K-Means Clustering and Bag of Visual Words
Notes by Antonino Furnari - antonino.furnari@unict.it🔝
University of Catania, Department of Mathematics and Computer Science
Notes available at: http://www.antoninofurnari.github.iohttps://antoninofurnari.github.io/lecture-notes/en/data-science-python/clustering-bovw/
In this laboratory we will see how to:
- Train a K-Means Clustering Model
- Build a Bag of Visual Words (BOVW) model;
- Use the BOVW model to represent images;
- Build a classifier using the BOVW representations;
- Build a Content Based Image Retrieval (CBIR) system;
1. K-Means Clustering
The K-Means algorithm aims to find groups (clusters) of data in a multidimensional space. Let’s see an example of use of the K-Means algorithm. Consider the “Old Faithful” dataset. The dataset contains measurements on the eruptions of Old Faithful geyser in Yellowstone National Park, USA. In particular, each record reports the duration of an eruption and the time elapsed between the current eruption and the next one. Let’s load the dataset:
|
|
Number of records: 272
eruptions | waiting | |
---|---|---|
0 | 3.600 | 79 |
1 | 1.800 | 54 |
2 | 3.333 | 74 |
3 | 2.283 | 62 |
4 | 4.533 | 85 |
Let’s apply a normalization by mean and standard deviation (z-scoring) to the data and plot it:
|
|
Question 1 Does the data shown above have any particular characteristics? |
Now let’s train the K-Means algorithm by specifying $2$ as the number of clusters:
|
|
KMeans(n_clusters=2)
The K-Means algorithm assigned each training data to one of the clusters. We can access the assignants variable as follows:
|
|
[0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1
0 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0
1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0
1 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1
0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1
0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
0 0 0 1 0 1 1 0 0 1 0 1 0]
Each element has been assigned to one of two clusters identified with $1$ or $0$. For convenience, let’s build a new dataframe in which we rename the variables to “X” and “Y” and add a variable “C” indicating the cluster each record has been associated with:
|
|
X | Y | C | |
---|---|---|---|
0 | 0.098318 | 0.596025 | 0 |
1 | -1.478733 | -1.242890 | 1 |
2 | -0.135612 | 0.228242 | 0 |
3 | -1.055558 | -0.654437 | 1 |
4 | 0.915755 | 1.037364 | 0 |
So let’s plot the data by visualizing which cluster they have been associated with:
|
|
Question 2 Is the cluster assignment plausible? Run the K-Means algorithm several times. Does the solution change? Because? Are the solutions found equivalent? |
We can access the centroids found by the K-Means algorithm as follows:
|
|
array([[ 0.70839746, 0.67549972],
[-1.25776692, -1.19935664]])
Similarly to what an algorithm of the Nearest Neighbor type does, the K-Means also implicitly defines decision regions and a decision boundary. Let’s see this information in our case:
|
|
|
|
1.1 Example with the DIGITS dataset
Now let’s try to use a larger dataset with multiple sized samples. Consider the “digits” dataset of handwritten digits provided by scikit-learn:
|
|
Numero di campioni: 1797, Numero di dimensioni: 64
Each element of the dataset is a 64-dimensional vector representing an image $8 \times 8$. Therefore we can visualize each sample as an image by means of a reshape operation. For example:
|
|
Let’s assume for a moment that the dataset is not labeled, but still assume that we know that it contains images of the handwritten 10 digits. Let’s try using the K-Means algorithm to try and split the data into $10$ groups:
|
|
KMeans(n_clusters=10)
The KMeans algorithm found $10$ centroids and assigned each training sample to one of the centroids. The centroids have the same dimensionality as the input data ($64$ items) and therefore they too can be seen as $8 \times 8$ images. In particular, each centroid is the average image of all images that have been associated with that cluster. We visualize the 10 centroids in the form of images:
|
|
We would expect that the K-Means has put in a given cluster all and only the elements belonging to a given class (for example, a cluster should contain all and only the samples with the label “0”). If so, the cluster is said to be “pure”. Different clusters can have different degrees of purity.
Question 3 Looking at the centroid images, can we predict which clusters are purer and which are less pure? |
We can measure the purity of a cluster as follows:
- We determine the most frequent label among those of the training samples assigned to the cluster;
- We count the fraction of elements assigned to the cluster that have the label corresponding to the most frequent one. Let’s see how to do it:
|
|
Cluster 0. Most frequent label: 2. Purity: 0.85
Cluster 1. Most frequent label: 1. Purity: 0.62
Cluster 2. Most frequent label: 6. Purity: 0.97
Cluster 3. Most frequent label: 7. Purity: 0.84
Cluster 4. Most frequent label: 9. Purity: 0.56
Cluster 5. Most frequent label: 8. Purity: 0.45
Cluster 6. Most frequent label: 4. Purity: 0.98
Cluster 7. Most frequent label: 5. Purity: 0.91
Cluster 8. Most frequent label: 3. Purity: 0.87
Cluster 9. Most frequent label: 0. Purity: 0.99
Let’s now display this information together with the images related to the centroids:
|
|
Question 4 Do the predictions made previously match the purity values found? Why are some clusters purer and others less pure? |
2. Bag of Visual Words
2.1 Dataset
For this laboratory, we will use a simple dataset of images belonging to $8$ different scene types: coast
, mountain
, forest
, open country
, street
, inside city
, tall buildings
, highways
. Let’s donwload the data from the following URL:
After downloading, extract all files in a scenes
directory. This directory should contain $2688$ images. Let’s retrieve the list of filenames using the glob
function:
|
|
['scenes/mountain_natu749.jpg', 'scenes/highway_gre141.jpg', 'scenes/insidecity_art1496.jpg', 'scenes/highway_bost389.jpg', 'scenes/highway_gre155.jpg', 'scenes/insidecity_hous28.jpg', 'scenes/opencountry_land352.jpg', 'scenes/opencountry_fie24.jpg', 'scenes/opencountry_fie30.jpg', 'scenes/insidecity_hous14.jpg']
As we can see, the class (e.g., “coast”) is contained as part of the filename. We can “extract” the correct class from each filename as follows:
|
|
['mountain', 'highway', 'insidecity', 'highway', 'highway', 'insidecity', 'opencountry', 'opencountry', 'opencountry', 'insidecity']
We can now create a DataFrame including filenames and related classes:
|
|
filename | class | |
---|---|---|
0 | scenes/mountain_natu749.jpg | mountain |
1 | scenes/highway_gre141.jpg | highway |
2 | scenes/insidecity_art1496.jpg | insidecity |
3 | scenes/highway_bost389.jpg | highway |
4 | scenes/highway_gre155.jpg | highway |
We will need the data to be split into training and test set. This can be easily done as follows:
|
|
2016 672
filename | class | |
---|---|---|
2065 | scenes/street_art762.jpg | street |
633 | scenes/mountain_nat1207.jpg | mountain |
1955 | scenes/opencountry_land617.jpg | opencountry |
970 | scenes/street_urb244.jpg | street |
515 | scenes/coast_n672003.jpg | coast |
We can now write a simple function to show some examples. To load the image, we will use the io
module of the scikit-image
library for image processing:
|
|
|
|
|
|
|
|
2.1.1 Data Subset
We are going to use this data to demonstrate how to build and use a Bag of Visual Words model. For the purpose of this laboratry, let’s consider a data subset of a few images. This is to avoid long waiting time while experimenting with the model. Let’s consider a subset of just the $5%$ of the images:
|
|
135
The new small subset contains just $135$ images. Let’s split it into training and test set:
|
|
114 21
2.2 Building the Bag of Visual Words Model
We will now see how to build a Bag of Visual Words model to properly represent images. This will involve the following steps, as shown in the image below:
- Extracting all patches from training images;
- Describing all patches using the SIFT descriptor;
- Clustering the descriptors using K-Means;
2.3 Patch Extraction and Description
We will start with patch extraction and description. For this purpose, we will use cyvlfeat
a Python wrapper to the VLFeat library (http://vlfeat.org/) originally designed for MATLAB and C. We can install the library by typing the command:
conda install -c conda-forge cyvlfeat
As can be seen from the official page of cyvlfeat
(https://github.com/menpo/cyvlfeat), the wrapper gives access to many features of VLFeat. We will use only the dsift
function. This function, conveniently allows to extract and describe all patches in an image in a single pass. This happens by first computing the gradients of the whole image and then sliding a window over the image and computing the SIFT descriptor using the gradients under the window. This is shown in the image below:
As shown in the image, the dsift
function will require the following parameters:
bounds
: the coordinate of a bounding box from which to extract patches. This is needed in the case in which we want to sample patches only from a part of the image. By default this is set to the whole image;size
: this is the size of single SIFT cell, which is by default set to $3$. Considering that a SIFT descriptor includes a $4\times4$ grid, the default patch size is $4 \times 3 = 12$ pixes;step
: this specifies the sampling step. It is set to $1$ by default, which moves the sliding window by $1$ pixel at a time;
Let’s try to load an image and use this function to extract SIFT descriptors (visual words) from it:
|
|
The dsift
funciton returns two outputs: the positions
of the extracted patches in the image, and the extracted descriptors
:
|
|
(61009, 2)
(61009, 128)
[[4.5 4.5]
[4.5 5.5]
[4.5 6.5]]
Question 5 Why is the number of columns of |
By using the default parameters, we have extracted a large number of patch descriptors from a single image. We can play with the available parameters to reduce the number of patches extracted. For instance, let’s extract patches at a step of $10$ pixels and with a bin size of $5$:
|
|
(625, 2) (625, 128)
Question 6 Do the extracted patches have some overlap? How many pixels are in common between two patches extracted one after another? What is the size of each patch? What is the size of the descriptors? Is it the same as before? Why? |
To have an idea of how patches extracted from the same image can have different descriptors, let’s select two random patches and plot their descriptors:
|
|
Question 7 Compare the two plots. Are the descriptors similar? Try to pick another pair. Is the situation different? How many bars are in each plot? |
We can extract and describe patches from different images as follows:
|
|
(625, 128) (625, 128) (625, 128)
To create a vocabulary of visual words, we will need to put together the patch descriptors computed over all the image of the dataset. We can concatenate the features of two or more images as follows:
|
|
(1875, 128)
We can now write a function that extracts all descriptors from all the images contained in a DataFrame:
|
|
|
|
Extracting/Describing Patches: 100%|██████████| 114/114 [00:02<00:00, 38.42it/s]
(71250, 128)
Question 8 Try with different parameters for step and size. Does the speed at which patches are extracted/described change? Do we obtain more/less features? Why? |
2.4 Creating the Visual Vocabulary
We now need to create a vocabulary of visual words. First, we need to decide the size of our vocabulary. We will use $500$, but larger vocabularies can be used. To create the vocabulary, we have to cluster all training descriptors using the K-Means algorithm. In particular, we will use MiniBatchKMeans
from scikit-learn
, which is an implementation of K-Means optimized to work with large sets of data:
|
|
MiniBatchKMeans(n_clusters=500)
We can access to the cluster centeroids (the visual words) as follows:
|
|
(500, 128)
Each visual words has the same shape of a descriptor and contains $128$ dimensions.
Question 9 Try fitting a K-Means with more visual words. Is the process slower? Why? |
Let’s now choose $6$ random visual words and plot them:
|
|
The K-Means object also allows to associate a new descriptor, the corresponding cluster. This is done using the predict
method:
|
|
array([174], dtype=int32)
Question 10 How is the “corresponding cluster” selected by the K-Means object? |
2.5 Representing images with the Bag of Visual Words Model
We can now use the bag of visual words model to represent images. Given an input image, this can be done following these steps:
- We extract and describe all the patches contained in the image. At this stage, it usually makes sense to use the same parameters used to extract the training descriptors;
- Each descriptor is associated to the closest visual word in the vocabulary;
- A Bag Of Words Representation is obtained from the extracted “tokens”.
This is illustrated in the image below.
Let’s start by writing a function which extract “tokens” from the image:
|
|
Let’s now try to use this funciton on the first training image:
|
|
array([174, 63, 267, 134, 96, 348, 61, 119, 163, 267, 96, 143, 63,
328, 96, 378, 184, 36, 322, 17, 96, 211, 31, 60, 267, 275,
246, 96, 174, 96, 184, 31, 96, 211, 329, 429, 348, 184, 149,
149, 370, 149, 331, 288, 293, 322, 211, 296, 407, 36], dtype=int32)
We can create a first representation by counting the number of instances of each token. This can be done using the histogram
funciton of numpy
. Since we want the histogram to be normalized, we can pass the parameter normed=True
:
|
|
We can now load another images, compute its BOVW representations and compare it with the one we have just obtained:
|
|
Question 11 Are the two representations similar? What would happen if we picked two images from the same class? |
2.5.1 Representing all training and test images
We have now seen a simple way to compute Bag of Visual Words representation. However, this can be done more easily using the scikit-learn
API for text processing. In particular, the load_and_describe
function can be used as an image tokenizer with a CountVectorizer
object to obtain visual words counts. Also, since we know that the vocabulary contains exactly 500 visual words and that tokens will be numbers comprised between $0$ ad $499$, we can pass this information to the count vectorizer with the parameter vocabulary
. Let’s create a CountVectorizer
and fit it to the training set:
|
|
<114x500 sparse matrix of type '<class 'numpy.int64'>'
with 20409 stored elements in Compressed Sparse Row format>
Note that fitting the CountVectorizer
object will take care of loading all images, extracting patches and describing them, matching descriptors to visual words, counting the number of instances of each visual word in each image. We can use the CountVectorizer to obtain visual words counts as follows:
|
|
(1, 500)
We can finally obtain bag of words representation by considering some normalization function. To simply normalize as done in the previous example, we could use TF-IDF without IDF and specify the ’l1’ norm:
|
|
TfidfTransformer(norm='l1', use_idf=False)
We can now compute the BOVW representation as:
|
|
1.000000000000001
Count vectorization and normalization can be performed in one step using the TfidfVectorizer
object. Let’s see an example with TF-IDF:
|
|
TfidfVectorizer(lowercase=False,
tokenizer=<function load_and_describe at 0x7f84189d7040>,
vocabulary=range(0, 500))
We can now use this object to represent training and test images:
|
|
(114, 500) (114,) (21, 500) (21,)
We can now apply the machine learning algorithms we have seen in the previous laboratories to this data. For instance, let’s train a logistic regressor to classify the images:
|
|
We can evaluate the predictions using the accuracy:
|
|
Training accuracy: 0.89
Testing accuracy: 0.71
2.6 Content-Based Image Retrieval (CBIR)
Now that we can represent images, we can build and Content-Based Image Retrieval (CBIR) system. Given a query image, our CBIR will find images in the database which have a similar content, based on the image representation. To this aim, we will use our training set as the database and the test set as the set of query images.
Given a query image, we hence need a way to compute distances between the query representation and the representations of all images in the database. As this can be computationally expensive, we will use an approximate indexing approach called K-D Tree (https://en.wikipedia.org/wiki/K-d_tree). In particular, we will use the KDTree
object from scikit-learn
:
|
|
Now the KDTree
has “memorized” and “indexed” the database of representations in order to be able to find the closest points to a query in an efficient way. Let’s consider a query image with the related representation:
|
|
We can find the index of the most similar image in the database using the query
method:
|
|
[[1.09494036]] [[24]]
Let’s load and show this image:
|
|
Question 12 Compare the query image and the matched ones. Are they similar? Are their representations similar? |
We can get the ids of the closest k
images by specifying a k
parameter:
|
|
(array([[1.09494036, 1.13737878, 1.14285961]]), array([[ 24, 103, 68]]))
This will return the distances to the closest images in the database and the related indices.
2.7 Salvataggio su file
The models and features can be saved to file for later use. This is particularly useful when the computation takes a lot of time. For instance, we can save what we have computed so far as follows:
|
|
The data saved to file can be read again using pickle
:
|
|
dict_keys(['x_train', 'x_test', 'kmeans'])
References
[1] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual Categorization with Bags of Keypoints. Workshop on Statistical Learning in Computer Vision. ECCV 1 (1–22), 1–2.
[2] K-Means (scikit-learn): http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
[3] Bag of Visual Words models: http://www.robots.ox.ac.uk/~az/icvss08_az_bow.pdf
[4] CBIR: https://it.wikipedia.org/wiki/CBIR
[5] K-d tree: https://en.wikipedia.org/wiki/K-d_tree
[6] Pickle: https://wiki.python.org/moin/UsingPickle
Exercises
Exercise 1 Consider the Boston dataset. Apply the K-Means to the data of the variable |
Exercise 2 All the experiments in this laboratory have been performed considering a small set of images. Repeat the experiments using the whole dataset (splitted into training/test sets). Do the performances of the classifier improve? |
Exercise 3 We have also considered a very coarse way to extract descriptors. Play with the patch extraction parameters in order to extract a larger set of descriptors and repeat all experiments on the whole dataset. Do the performances of the classifier improve? |
Exercise 4 The performances of the logistic regressor for image classificaiton are a bit disappointing. Try to improve them by experimenting with different feature normalization methods. Compare the performance with Naive Bayes and K-NN. |
Exercise 5 Write a |