14. Clustering#

When we deal with complex datasets of multiple observations and variables, it can be useful to be able to understand the underlying structure of the data.

One approach to do so is to determine whether the data can be grouped into clusters containing points with similar characteristics. This can be useful in different application scenarios in data analysis:

  1. Customer Segmentation:

    • Grouping customers based on their purchasing behavior, preferences, or demographics for targeted marketing strategies.

  2. Image Segmentation:

    • Dividing an image into meaningful segments or regions based on similarity, aiding in object recognition and computer vision tasks.

  3. Text Document Clustering:

    • Grouping similar documents together based on their content for tasks such as topic modeling, document organization, and information retrieval.

  4. Speech and Audio Processing:

    • Clustering audio data for tasks such as speaker identification, music genre classification, and speech analysis.

  5. E-commerce:

    • Grouping products or users based on purchasing patterns to optimize inventory management and improve the user experience.

14.1. Problem Definition#

Clustering aims to break the data into distinct group with similar properties. Let us consider the “Old Faithful” dataset, which comprises \(272\) observations of the eruption of the Old Faithful geyser at the Yellowstone National Park in USA. Each observation includes two variables:

  • The duration of the eruption in minutes;

  • The time in minutes to the next eruption.

We can see each observation as 2D vector which can be plotted in a Cartesian coordinate system:

../_images/63f4f700550e363e8b81b830177a04567dddfa9f9834e064179f3c3189299fd0.png

We can see how we clearly have two kinds of eruptions:

  • Short eruptions followed by other short eruptions in a short time (bottom left);

  • Long eruptions followed by other long eruptions in a long time (upper right).

A clustering algorithm aims to automatically find the two groups and assign each data point to the most likely group. Note that the two groups will be referred generically as “cluster \(i\)”, meaning that the algorithm will not assign any semantic to the clusters, but it will aim to put “similar” data points in the same cluster.

Let

\[\mathbf{X} = \left\{ \mathbf{x}^{(i)} \right\}_{i = 1}^{N},\ \mathbf{x}^{(i)} \in \mathfrak{R}^{n}\]

be a set of observations. The goal of clustering is to split \(\mathbf{X}\) into \(K\) groups (called clusters):

\[S = \{ S_{1},S_{2},\ldots,S_{K}\}\]

Such that:

\[\forall\ \mathbf{x}^{(i)} \in \mathbf{X}\ \exists!S_{j} \in S\ :\mathbf{x}^{(i)} \in S_{j}\]

The clusters are expected to be such that elements within a group are similar to each other, whereas elements belonging to different groups are different from each other.

The number of clusters \(\mathbf{K}\) is often a hyper-parameter of the algorithm.

14.1.1. Graphical Example#

For instance, consider the following 2D points:

../_images/360dcb37a7439df52d0567866db050cca20c3ba8cb039ac6ad087f80703d6479.png

Even if we the points are not explicitly assigned to given groups (i.e., we are not observing any discrete variable \(y\) indicating the group to which each point belong), we can clearly see that there are two distinct groups (the one in the top-left part of the plot, and the one in bottom-right). The goal of clustering is to split the data into two groups, as it is shown below:

../_images/3773291814ab9f16067daf3ec5a7ea4a4261daefff1192518c8055512b9de37d.png

Also note that, while different possible partitions of the 2D points are possible, the one outlined in the plot guarantees that points within the same group are similar to one another (i.e., their Euclidean distance is small), whereas points from different groups have are dissimilar (i.e., their Euclidean distance is large).

14.2. K-Means Clustering#

K-Means is the most popular clustering algorithm. As in the definition we just gave of clustering, the goal of K-Means is to break the data

\[\mathbf{X} = \left\{ \mathbf{x}^{(i)} \right\}_{i = 1}^{N},\ \mathbf{x}^{(i)} \in \mathfrak{R}^{n}\]

into \(K\) clusters

\[S = \{ S_{1},S_{2},\ldots,S_{K}\}\]

such that:

\[\forall\ \mathbf{x}^{(i)} \in \mathbf{X}\ \exists!S_{j} \in S\ :\mathbf{x}^{(i)} \in S_{j}\]

More specifically, K-Means attempts to create \(K\) clusters which are as compact as possible, where \(K\) is a parameter specified by the user (its application may depend on the application).

We will start by defining \(K\) different vectors \(\mathbf{\mu}_k \in \Re^n\) which will act as prototypes for the our clusters. We will shortly see that these vectors will be the cluster’s centers.

Hence, we define \(N \times K\) binary variables \(r_{ij}\) which will allow us to establish a mapping between data points \(\mathbf{x}^{(i)}\) and clusters \(S_j\). In particular, we will define:

\[\begin{split}r_{ij} = \begin{cases}1 & \text{ if } \mathbf{x}^{(i)} \in S_j \\ 0 & \text{ otherwise} \end{cases} \end{split}\]

The notion that clusters should be as compact as possible is formalized by defining the following cost function, which is sometimes called also distortion function:

\[J = \sum_{j=1}^K \sum_{i=1}^N r_{ij} ||\mathbf{x}_i-\mathbf{\mu}_j||^2\]

Note that this function accumulates the sum of square distances between each data point and the prototype of the assigned cluster. This is possible because \(r_{ij}\) will be one only in the if data point \(i\) is in cluster \(j\) (so many terms in the internal sum will be zero).

We can also note that the variance of cluster \(j\) can be defined as:

\[\sigma_j^2 = \frac{1}{\sum_{i}^N{r_{ij}}}\sum_{i=1}^N r_{ij} ||\mathbf{x}_i-\mathbf{\mu}_j||^2 = \frac{1}{|S_j|}\sum_{i=1}^N r_{ij} ||\mathbf{x}_i-\mathbf{\mu}_j||^2\]

We can hence see the cost function also as:

\[J = \sum_{j=1}^K |S_j|\sigma_j^2\]

Hence, minimizing \(J\) will effectively minimize the variances \(\sigma_j^2\) and the number of elements in each cluster \(|S_j|\), which will hence be encouraged to have similar numbers of elements.

The K-Means problem is solved by finding a partition  \(\widehat{S}\) minimizing the cost function:

\[\{\hat r_{ij}\}_{ij}, \{ \hat{\mathbf{\mu}_j} \}_j = \arg{\min{\sum_{j = 1}^{K}{\sum_{i = 1}^{N}{r_{ij}\left\| \mathbf{x}^{(i)} - \mathbf{\mu}_{i} \right\|^{2}}}}}\]

14.2.1. Optimization#

The minimization above can be performed using an iterative algorithm. The first step of the algorithm is to choose \(K\) random centroids \(\mathbf{\mu}_{i}\). After this initialization, the algorithm iterates the following two steps:

Assignment

Eeach element \(\mathbf{x}\) is assigned to the set with the closest centroid

\[\begin{split}r_{ij} = \begin{cases}1 & \text{ if } j=\arg_k\min ||\mathbf{x}^{(i)}-\mathbf{\mu}_j||^2 \\ 0 & \text{otherwise} \end{cases}, \forall i,j\end{split}\]

Which will lead to the partition:

\[S_{i} = \left\{ \mathbf{x} \in \mathbf{X} :\ \left\| \mathbf{x} - \mathbf{\mu}_{i} \right\|^{2} \leq \left\| \mathbf{x} - \mathbf{\mu}_{j} \right\|^{2}\ \ \forall\ j \in \left\{ 1,\ldots K \right\} \right\}\]

Update

The centroids \(\mathbf{\mu}_{i}\) are re-computed from the assigned sets

\[\mathbf{\mu}_{j} = \frac{1}{|S_{j}|}\sum_{\mathbf{x} \in S_{j}}^{}\mathbf{x} = \frac{\sum_{i = 1}^{N}{r_{ij}\mathbf{x}^{(i)}}}{\sum_{i = 1}^{N}{r_{ij}}}\]

The algorithm converges when the update does not change any centroid. In some cases, the algorithms may never actually converge, hence it is often common to introduce as a termination criterion a maximum number of iterations.

The algorithm is not guaranteed to find the global optimum and the solution reached may depend on the random initialization. However, in practice it usually leads to a reasonable solution.

14.2.1.1. Pseudocode#

We can see the optimization algorithm in pseudo-code as follows:

Randomly initialize \(K\) cluster centroids \(\mu_{1},\mu_{2},\ldots,\mu_{K} \in \mathfrak{R}^{n}\)

Repeat until termination criterion is reached {

         for i = 1 to N

                 \(r_{ij} = \begin{cases}1 & \text{ if } j=\arg_k\min ||\mathbf{x}^{(i)}-\mathbf{\mu}_j||^2 \\ 0 & \text{otherwise} \end{cases}, \forall i,j\) //assignment

         for j = 1 to K

                 \(\mathbf{\mu}_{j} = \frac{\sum_{i = 1}^{N}{r_{ij}\mathbf{x}^{(i)}}}{\sum_{i = 1}^{N}{r_{ij}}}\) //update

}

14.2.2. Example Execution#

Let us see a graphical example of the process.

Dataset

../_images/0e6ee79ba9995a7f99a7d42318ab054565b3b4c386eb0844fcb619377ba7e4de.png

Initialization (Random)

../_images/4d1fb98072edc5b14f1b9ed4b6bfaa182bed43d9ccde3dad139e4398728e74a9.png

Assignment (1)

../_images/99c387f5f21ee8e5a9804138d58cc348fd7242f2ab878e245c86d6bd57750129.png

Update (1)

../_images/36e1c90ebc5bcfc520ce481efc45034f2c5e20046dd2a895aed3ea36825e49fb.png

Assignment (2)

../_images/4bc56fc212456df58d1de947a5a290f5e8aa8e14f5996c6d4bf0fbf7c03dda94.png

Update (2)

../_images/f1271611d0fa2ce218ce866265a36f3e38edead921def7186dfe97179c4296b4.png

Assignment (3)

../_images/20aeadd9bf300693e6b010914d5a74d5c7941761918018079342fc8913bb8059.png

Update (3)

../_images/9880baa80546e7243adb633ba2a3f0dffcda330919e706ebd361f0cb61d9c4d8.png

Assignment (4)

../_images/75b13263fb63875972f94039764f3cba63085bb43f6ae0f678fbcd898677db0d.png

Update (4)

../_images/ccc4efb1512ab07339a128d1a9ea519287b5614aa2e115bf36f6cfb4423a7e02.png

The optimization procedure terminates here as the centroids did not move in the last update step. The data is now clustered in two groups.

The plot below shows the value of the cost function at the end of each iteration:

../_images/b2accded7842cdfbb330249ed6b47961f1a41d846b22523a11ca017f2e1ed554.png

14.2.3. Choosing the Right Value for K#

In the K-Means clustering algorithm, \(K\) is a hyper-parameter. This means that, it is a parameter of the method, affecting the final model we obtain after the optimization procedure, but its value is not automatically determined by the optimization procedure.

One common approach to determine the values of hyper-parameters is to make some guesses and fit different models with the guessed values of the hyper-parameters. We can then choose the model which has the “best value” of the considered hyper-parameters.

When it turns to determining the optimal \(K\) value for K-Means clustering, there are two main techniques which are commonly used: the elbow method and the silhouette method, which are discussed in the following.

Note that these are bot heuristic methods not giving many guarantees on the final selection of \(K\), but they can still help decide on the number of clusters, in particular if guided by an intuition of what a good value of \(K\) would look like.

14.2.3.1. Elbow Method#

The elbow method consists in fitting K-Means models for different values of \(K\). For each \(K\), we then plot the final value of the cost function. We expect that as \(K\) increases, the cost will decrease. Indeed, if \(K\) is very large (even larger than it should be), we will anyway get a smaller cost due to the fact that the presence of more clusters will inevitably lower their individual variances.

However, we expect the cost function to slow the speed of decrease in the presence of the optimal (or a good) \(K\). This is called the “elbow point” as the curve should look like an elbow. We can see it in the example below:

../_images/f1d5a00e177d743540c151cfdbeb5dcfe56534cf266cd62719872f6a6b7d8c25.png

14.2.3.2. Silhouette Method#

The silhouette method tries to assess how good a given clustering model is by measuring how well the data is grouped in the identified clusters. This is done by computing, for each data point a score indicating whether how well the data point fits the assigned cluster with respect to all others.

If the model is a good one (hence a good \(K\) value), we would expect each data point to fit well its assigned cluster and not to fit well any other cluster. If this does not happen, it means that the clusters are not very well separated and we either have more or less clusters.

To do so, for a given data point \(i\) assigned to cluster \(C_I\), the silhouette method computes the following score:

\[a(i) = \frac{1}{|C_I|-1} \sum_{j \in C_i, i \neq j} d(i,j)\]

where \(d(i,j)\) is the Euclidean distance between data point \(i\) and data point \(j\). In practice, the score \(a(i)\) is the average distance of data point \(i\) with any other data point assigned to the same cluster. We expect this number to be small if the cluster has a low variance.

We now want to compare this number to the value we would have if \(i\) were to be assigned to another cluster \(C_J\). In particular, we will choose the cluster \(C_J\) which minimizes the average distance. This is done by computing the following score:

\[b(i) = \min_{J \neq I} \frac{1}{|C_J|} \sum_{j \in C_J} d(i,j)\]

Now we compare \(a(i)\) with \(b(i)\). Ideally, \(a(i)\) should be much smaller than \(b(i)\). If this happens, this means that the data point \(i\) fits in the cluster \(C_I\) much better than it would in any other cluster. To do so, we compute the silhouette score of data point \(i\) as follows:

\[\begin{split}s(i) = \begin{cases}\frac{b(i)-a(i)}{\max\{a(i),b(i)\}} & \text{ if } |C_I|>1\\0 & \text{ otherwise}\end{cases}\end{split}\]

The obtained score is such that:

\[-1 \leq s(i) \leq 1\]

In practice:

  • A value of \(s(i)\) close to \(1\) is obtained when \(a(i) << b(i)\), meaning that \(i\) fits in \(C_I\) much better than it would fit in any other cluster;

  • A value of \(s(i)\) close to \(-1\) is obtained when \(a(i) >> b(i)\), meaning that another cluster \(C_J\) exists in which \(a(i)\) fits better.

  • A value of \(s(i)\) close to \(0\) indicates that the data point is close to the border between two natural clusters.

The score above is arbitrarily set to \(0\) if \(|C_I|=1\) as in that case, the cluster would only contain the data point \(i\). Note that this is an assumption of neutrality.

The correct number of clusters is usually chosen by showing the silhouette plot, which displays all the silhouette scores of data points within each clusters, arranged from largest to smallest. The plots below are taken from the scikit-learn documentation and show examples of silhouette plots for different choices of \(K\) in an example datest:

../_images/d4905a3611e95874e694a173b13817bed5320bc3147546c6e13620df3e6148cc.png ../_images/c9ae75e6ae7b88460d3d78bd6a33c015f463f84d701da4532fc311f370bcc668.png ../_images/0957f560e99936f0acb8677364364e972cd4b436a4e90a243c03582652e58b8f.png ../_images/6299d0de1b96682a227bce45dd6e8109b4098acd017b7b1a8e5725d7523d15ec.png ../_images/a9dd38401a6443e27071ebe180da299e4e174c68ac8f6ae8cdbce16ba40d83ac.png

The vertical line in red displays the average silhouette score. Good choices of cluster numbers will have silhouette plots which achieve values similar to each other and close to the average one. For example \(n=2\) and \(n=4\) seem to be the best pick.

In practice, it is also useful to plot the average silhouette score versus the number of clusters and choose value of \(K\) which maximize the average score.

Such plot would be as follows for the previous example:

../_images/dc460bed0fff84bbf845fc6dd12524459a19b2145767d993f73739c3e44aa1c1.png

The plot below shows another example of such plot for a different dataset:

../_images/3b9e151b3bc1e3d663f656139d2f01f71f844194eac631bbd3f4e3cb9173870f.png

14.3. Other Clustering Algorithms#

It is important to note that various clustering algorithms exist beyond the commonly discussed k-means method. While k-means is a popular and widely used algorithm for partitioning data into distinct groups based on similarity, it is one approach among many. Numerous other clustering algorithms offer different perspectives and address specific challenges. Some examples include hierarchical clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and spectral clustering. These alternatives provide unique strengths and are tailored to different types of data and scenarios. While our focus has been on understanding clustering in general and the k-means algorithm in particular, it is important to know that other approaches may be more suitable depending on the specific problem at hand. In the next lectures, we will see Gaussian Mixture Models (GMMs), a probabilistic model for density estimation which can also be interpreted as a clustering algorithm.

14.4. References#

[1] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: springer.

[2] Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.