13. Data as N-Dimensional Points#

So far, we have seen data as a collection of observations of multiple variables. We have often represented a dataset as a table or a data matrix. In all previous analyses, we have focused on the interpretation of the values of each variable.

However, in many cases (and especially when variables are many), it is useful to given a geometric interpretation of data and see observation as d-dimensional data points, where \(d\) is the number of variables in the dataset. Consider the following data:

Subject

Height (cm)

Weight (Kg)

1

175

70

2

160

60

3

180

78

4

160

75

5

155

58

6

190

110

We can see each of these observations (the rows of the matrix) as a data point. Further, we can geometrically represent such points as follows:

../_images/4cb91a1123c70ff8b38008ef6e3493c1ed0b1b641f1085218de518d93a9cd672.png

Each observation, is a 2D point in the Cartesian space. This geometrical interpretation of data is central to data analysis and very convenient in particular when many different variables are available.

13.1. Features and Representation Functions#

In this context in which we are not always explicitly giving names or interpretation to the variables, each of the variables will also be referred to as a feature of the data. For instance, a dataset of images of \(100 \times 100\) pixels can be seen as a collection of \(10000\)-dimensional vectors, where each vector’s dimension is a “feature”, i.e., the value of a specific pixel of the image.

Given this interpretation of variables as features, it is often common to talk about feature extraction, in the case of any process transforming the data \(\mathbf{x} \in \Re^d\) to another form \(\mathbf{y} \in \Re^m\). In this context, a function:

\[f : \Re^d \to \Re^m\]

will be often referred to as a “feature extraction function” (of feature extractor) or a “representation function”. We will recall these terms going further in the course.

We will also refer to \(\Re^d\) and \(\Re^m\) as “feature spaces” or “representation spaces” as they will be vector spaces to which data points (also called feature vectors) belong.

13.2. Important Properties of Feature Spaces#

Since observations \(\mathbf{x}\) are vectors living in some vector spaces, all basic properties of vectors and vector spaces introduced in linear algebra will apply here as well. We’ll recall the most important concepts as we need them in the course, but, for the moment, it is useful to summarize the main important properties.

13.2.1. Norms#

Given a vector space \(S\), a norm \(p\) is a function from \(S\) to a non negative real number:

\[p \in S \to \Re^+_0\]

A commonly used family of norms is the one of the L-p norms:

\[||\mathbf{x}||_p = \left( \sum_{i=1}^{d} |x_i|^p \right)^{1/p}\]

Where \(\mathbf{x}\) is d-dimensional, \(x_i\) is the i-th component of \(\mathbf{x}\), and \(p\) is a parameter defining the behavior of the norm.

Note that a norm computes some kind of measure of distance of the vector from the origin of the space. In practice, the following norms are commonly used:

L-2 Norm

\[||\mathbf{x}||_2 = \sqrt{\sum_{i=1}^{d} x_i^2}\]

This is what we know as “computing the magnitude (or modulus) of the vector”. It is the Euclidean distance between the origin and the vector.

When we use the L2 norm, we can omit the \(p=2\) subscript:

\[||\mathbf{x}||_2 = ||\mathbf{x}||\]

Another commonly used notation is the one for the squared L2 norm:

\[||\mathbf{x}||_2^2 = ||\mathbf{x}||^2 = \sum_{i=1}^{d} x_i^2\]

L-1 Norm

\[||\mathbf{x}||_1 = \sum_{i=1}^{d} |x_i|\]

This is the sum of the absolute values of the components of the vector.

L-\(\infty\) Norm

\[||\mathbf{x}||_\infty = \max\{|x_i|\}_{i=1}^{d}\]

This is the maximum absolute value of the components of the vector.

To visualize the difference between the different norms, it often common to display the unit circles according to the different norms. Each of the shape contains all vectors with unit norm:

../_images/708b4a81262ac6a6763a5ecfef4762de94f1da07c6bd5062b1999408b7c1e758.png

13.2.2. Metrics#

Given a space \(S\), a function

\(m : S \times S \to \Re\)

is a metric if the following properties are satisfied:

  1. The distance of a point from itself is zero: \(m(\mathbf{x}, \mathbf{x}) = 0\);

  2. The distance between two distinct points is always positive (positivity): \(m(\mathbf{x},\mathbf{y}) \geq 0, \forall \mathbf{x}, \mathbf{y} \in S\);

  3. The distance between \(x\) and \(y\) is the same as the distance between \(y\) and \(x\) (symmetry): \(m(\mathbf{x}, \mathbf{y}) = m(\mathbf{y}, \mathbf{x})\).

  4. Triangle inequality: \(m(\mathbf{x},\mathbf{y}) \leq m(\mathbf{x},\mathbf{z}) + m(\mathbf{z},\mathbf{y}), \forall \mathbf{x}, \mathbf{y}, \mathbf{z} \in S\)

13.2.2.1. L-p Metrics#

From the L-p norms, we can derive a family of metrics as follows:

\[L_p(\mathbf{x},\mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_p = \left( \sum_{i=1}^{d} |x_i - y_i|^p \right)^{1/p}\]

L2 Distance

Note that with \(p=2\) we have the Euclidean (or L-2) distance:

\[L_2(\mathbf{x},\mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_2 = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}\]

L1 Distance

With \(p=1\), we obtain the L1 distance, which is also known as the Manhattan distance:

\[L_1(\mathbf{x},\mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_1 = \sum_{i=1}^{d} |x_i - y_i|\]

L-\(\infty\) Distance

With \(p=\inf\), we obtain the L-\(\infty\) distance:

\[L_\infty(\mathbf{x},\mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_\infty = \max_{i=1}^d\{|x_i - y_i|\}\]

The difference between the L1 and L2 distances is notable, as shown in the figure blow:

../_images/9875267b3656cd73af5b63e55da1108a69cf4731c852cea3aae17b670605f9ba.png
Manhattan (L1) Distance: 7.00
Euclidean (L2) Distance: 5.00
L-Inf Distance: 4.00

While the Euclidean distance measures the length of the straight segment connecting the two points, the Manhattan distance is the one in grey (dashed), which measures the distance that a taxi driver should drive in Manhattan (or any other square-block based city) to reach the destination.

13.2.2.2. Cosine Distance#

The cosine distance is useful when we need to compare two vectors but we do not care about differences arising from scaling factors. Note that the euclidean distance of two proportional vectors is in general non-zero:

\[||\mathbf{x} - \alpha \mathbf{x}||^2 > 0, \alpha \neq 1\]

Nevertheless, \(\mathbf{x}\) and \(\alpha \mathbf{x}\) are very similar. If we want to compare two vectors considering only the relationships between their coordinates, rather than their values, we can use the cosine similarity, which is defined as follows:

\[S_C (\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{||\mathbf{x}|| ||\mathbf{y}||}\]

The cosine similarity computes the cosine of the angle \(\theta\) comprised between two vectors, as shown in the plot below:

../_images/e64e2c520cf9e4504e6b8df2037730c271b45a5bcfec47f00c3f284020ab8af2.png

Note that the cosine similarity would be the same if the two vectors had different scales (i.e., if they were longer but they would have the same orientation).

This kind of similarity measure is useful when the scale of the vector is not important. For instance, if the vectors \(\mathbf{x}\) and \(\mathbf{y}\) are vectors of word counts of two documents of different lengths, we care about the proportions of words in each document, while longer document will have vectors with larger L2 norms.

The cosine similarity is a normalized number between \(-1\) and \(1\), where:

  • \(+1\) means maximum alignment (similarity) between the two vectors;

  • \(-1\) means that the two vectors are dissimilar - they are indeed opposite.

  • \(0\) means that the two vectors are orthogonal.

The plot below shows different examples of cosine similarity of vector pairs:

../_images/196b6c029f390b83b1d4259ea91b0c7b50c7948d8ad6313018d8c3c8bbcbad9e.png

The cosine distance is defined from the cosine similarity as:

\[D_C = 1- S_C = 1- \frac{\mathbf{x} \cdot \mathbf{y}}{||\mathbf{x}|| \cdot ||\mathbf{x}||}\]

It should be noted that the cosine similarity is not a metric as it does not satisfy the triangular inequality.

13.3. References#

[1] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.