24. Clustering#
When we deal with complex datasets of multiple observations and variables, it can be useful to be able to understand the underlying structure of the data.
One approach to do so is to determine whether the data can be grouped into clusters containing points with similar characteristics. This can be useful in different application scenarios in data analysis:
Customer Segmentation:
Grouping customers based on their purchasing behavior, preferences, or demographics for targeted marketing strategies.
Image Segmentation:
Dividing an image into meaningful segments or regions based on similarity, aiding in object recognition and computer vision tasks.
Text Document Clustering:
Grouping similar documents together based on their content for tasks such as topic modeling, document organization, and information retrieval.
Speech and Audio Processing:
Clustering audio data for tasks such as speaker identification, music genre classification, and speech analysis.
E-commerce:
Grouping products or users based on purchasing patterns to optimize inventory management and improve the user experience.
24.1. Problem Definition#
Clustering aims to break the data into distinct group with similar properties. Let us consider the “Old Faithful” dataset, which comprises
The duration of the eruption in minutes;
The time in minutes to the next eruption.
We can see each observation as 2D vector which can be plotted in a Cartesian coordinate system:

We can see how we clearly have two kinds of eruptions:
Short eruptions followed by other eruptions in a short time (bottom left);
Long eruptions followed by other eruptions in a long time (upper right).
A clustering algorithm aims to automatically find the two groups and assign each data point to the most likely group. Note that the two groups will be referred generically as “cluster
Let
be a set of observations. The goal of clustering is to split
Such that:
The clusters are expected to be such that elements within a group are similar to each other, whereas elements belonging to different groups are different from each other.
The number of clusters
24.1.1. Graphical Example#
For instance, consider the following 2D points:

Even if we the points are not explicitly assigned to given groups (i.e., we are not observing any discrete variable

Also note that, while different possible partitions of the 2D points are possible, the one outlined in the plot guarantees that points within the same group are similar to one another (i.e., their Euclidean distance is small), whereas points from different groups have are dissimilar (i.e., their Euclidean distance is large).
24.2. K-Means Clustering#
K-Means is the most popular clustering algorithm. As in the definition we just gave of clustering, the goal of K-Means is to break the data
into
such that:
More specifically, K-Means attempts to create
We will start by defining
Hence, we define
The notion that clusters should be as compact as possible is formalized by defining the following cost function, which is sometimes called also distortion function:
Note that this function accumulates the sum of square distances between each data point and the prototype of the assigned cluster. This is possible because
We can also note that the variance of cluster
We can hence see the cost function also as:
Hence, minimizing
The K-Means problem is solved by finding a partition
24.2.1. Optimization#
The minimization above can be performed using an iterative algorithm.
The first step of the algorithm is to choose
Assignment
Eeach element
is assigned to the set with the closest centroid Which will lead to the partition:
Update
The centroids
are re-computed from the assigned sets
The algorithm converges when the update does not change any centroid. In some cases, the algorithms may never actually converge, hence it is often common to introduce as a termination criterion a maximum number of iterations.
The algorithm is not guaranteed to find the global optimum and the solution reached may depend on the random initialization. However, in practice it usually leads to a reasonable solution.
24.2.1.1. Pseudocode#
We can see the optimization algorithm in pseudo-code as follows:
Randomly initialize
cluster centroids Repeat until termination criterion is reached {
for i = 1 to N
//assignment for j = 1 to K
//update }
24.2.2. Example Execution#
Let us see a graphical example of the process.
Dataset

Initialization (Random)

Assignment (1)

Update (1)

Assignment (2)

Update (2)

Assignment (3)

Update (3)

Assignment (4)

Update (4)

The optimization procedure terminates here as the centroids did not move in the last update step. The data is now clustered in two groups.
The plot below shows the value of the cost function at the end of each iteration:

24.2.3. Choosing the Right Value for K#
In the K-Means clustering algorithm,
One common approach to determine the values of hyper-parameters is to make some guesses and fit different models with the guessed values of the hyper-parameters. We can then choose the model which has the “best value” of the considered hyper-parameters.
When it turns to determining the optimal
Note that these are both heuristic methods not giving many guarantees on the final selection of
24.2.3.1. Elbow Method#
The elbow method consists in fitting K-Means models for different values of
However, we expect the cost function to slow the speed of decrease in the presence of the optimal (or a good)

24.2.3.2. Silhouette Method#
The silhouette method tries to assess how good a given clustering model is by measuring how well the data is grouped in the identified clusters. This is done by computing, for each data point a score indicating whether how well the data point fits the assigned cluster with respect to all others.
If the model is a good one (hence a good
To do so, for a given data point
where
We now want to compare this number to the value we would have if
Now we compare
The obtained score is such that:
In practice:
A value of
close to is obtained when , meaning that fits in much better than it would fit in any other cluster;A value of
close to is obtained when , meaning that another cluster exists in which fits better.A value of
close to indicates that the data point is close to the border between two natural clusters.
The score above is arbitrarily set to
The correct number of clusters is usually chosen by showing the silhouette plot, which displays all the silhouette scores of data points within each clusters, arranged from largest to smallest. The plots below are taken from the scikit-learn documentation and show examples of silhouette plots for different choices of





The vertical line in red displays the average silhouette score. Good choices of cluster numbers will have silhouette plots which achieve values similar to each other and close to the average one. For example
In practice, it is also useful to plot the average silhouette score versus the number of clusters and choose value of
Such plot would be as follows for the previous example:

The plot below shows another example of such plot for a different dataset:

24.3. Other Clustering Algorithms#
It is important to note that various clustering algorithms exist beyond the commonly discussed k-means method. While k-means is a popular and widely used algorithm for partitioning data into distinct groups based on similarity, it is one approach among many. Numerous other clustering algorithms offer different perspectives and address specific challenges. Some examples include hierarchical clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and spectral clustering. These alternatives provide unique strengths and are tailored to different types of data and scenarios. While our focus has been on understanding clustering in general and the k-means algorithm in particular, it is important to know that other approaches may be more suitable depending on the specific problem at hand. In the next lectures, we will see Gaussian Mixture Models (GMMs), a probabilistic model for density estimation which can also be interpreted as a clustering algorithm.
24.4. References#
Section 10.3.1 of [1]
Section 2.3.9 of [2]
Section 9.2 of [2]
[1] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: springer.
[2] Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.