Recommendation Systems, Sentiment Analysis
Notes by Antonino Furnari - antonino.furnari@unict.itπ
University of Catania, Department of Mathematics and Computer Science
Notes available at: http://www.antoninofurnari.github.iohttps://antoninofurnari.github.io/lecture-notes/en/data-science-python/recommendation-systems-sentiment-analysis/
In this laboratory we will see how to:
- Build a recommendation system based on collaborative filtering;
- Use word embeddings using Spacy;
- Perform sentiment analysis with VADER;
1. Recommendation Systems
1.1 Dataset
We will use a dataset movies rated by users. We can load the data directly from its URL as follows:
|
|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100003 entries, 0 to 100002
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 100003 non-null int64
1 item_id 100003 non-null int64
2 rating 100003 non-null int64
3 item_title 100003 non-null object
dtypes: int64(3), object(1)
memory usage: 3.1+ MB
user_id | item_id | rating | item_title | |
---|---|---|---|---|
0 | 0 | 50 | 5 | Star Wars (1977) |
1 | 290 | 50 | 5 | Star Wars (1977) |
2 | 79 | 50 | 4 | Star Wars (1977) |
3 | 2 | 50 | 5 | Star Wars (1977) |
4 | 8 | 50 | 5 | Star Wars (1977) |
The dataset contains $100003$ observations. Each observation consists in a rating given by a user to a movie. Specifically, the variables have the following meaning:
user_id
the id of the user rating an item;item_id
the id of the item (a movie);rating
a number between $1$ and $5$ indicanting the rating given by the user to the movie;item_title
: the title of the rated movie.
Each user can rate an item at most $1$ time, hence each line can be related to different users rating the same item or to different items rated by the same user. Let’s count how many unique users and items we have in the dataset:
|
|
Number of users: 944
Number of items: 1682
Question 1 How many rows we would have in the dataset if each user had rated each item? If we constructed a utility matrix, would it be sparse of dense? |
Before proceeding, let’s split the dataset into a training and a test set:
|
|
Let’s now build the utility matrix. Recall that it is a matrix showing how a user has rated a given item. We can create the utility matrix using a pivot table:
|
|
item_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 5.0 | 3.0 | 4.0 | 3.0 | 3.0 | 5.0 | NaN | NaN | 5.0 | 3.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows Γ 1638 columns
The matrix is sparse. It contains NaN
values for user-movies pairs for which we do not have a rating.
1.2 Collaborative Filtering
We will now implement a user-user collaborative filter. Remember that a user-user collaborative filter works as follows:
- Consider a user $x_i$ and an item $s_j$ which not been rated by user $x_i$;
- Build a profile for each user by considering the rows of the utility matrix normalized by subtracting the mean;
- Find a set $N$ of similar users who have rated item $s_j$;
- Estimate the utility value $u(x_i, s_j)$ computing a weighted average of rating given by the similar users;
Let’s first see an example for N=3
, x_i=0
and s_j=1
:
|
|
First, we need to compute user profiles for all users. We can do this by first replacing all missing values with zeros, then normalizing each row by subtracting its mean (computed before adding the zeros):
|
|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 1.300518 | -0.891304 | 1.285714 | -1.315789 | 0.253968 | 1.426829 | 0.0 | 0.0 | 0.722222 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.300518 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows Γ 1642 columns
Question 2 Why do we need to subtract the mean to obtain user profiles? |
Now we should compute the cosine similarity between the row corresponding to $x_i$ and all the other rows:
|
|
user_id
0 1.000000
1 0.004347
2 0.087988
3 0.000000
4 0.116171
dtype: float64
The $j^{th}$ row reports the similarity distance between the $j^{th}$ profile and the profile of user $x_i$. The largest similarity (1) is between $x_i$ and $x_i$. Let’s remove this row:
|
|
user_id
1 0.004347
2 0.087988
3 0.000000
4 0.116171
5 0.000000
dtype: float64
We should now select only the users who have rated item $s_j$:
|
|
Int64Index([ 1, 2, 5, 6, 10, 13, 17, 20, 21, 25,
...
923, 924, 927, 929, 930, 932, 933, 936, 938, 941],
dtype='int64', name='user_id', length=355)
Question 3 Why do we need to select the users who have rated the considering item? |
Let’s now select the distances related to the selected users:
|
|
user_id
1 0.004347
2 0.087988
5 0.000000
6 0.011841
10 -0.108046
dtype: float64
We can now sort the distances with sort_values
:
The last rows represent the most similar items. Let’s select the last two items:
|
|
user_id
395 0.207021
679 0.214723
97 0.345497
dtype: float64
We can now store the user ids in an array:
|
|
Int64Index([395, 679, 97], dtype='int64', name='user_id')
Let’s now see how the most similar users have rated movie $s_j$:
|
|
5.0 3.0 4.0
We can now compute the average rating using a weighted average:
|
|
3.9899620761918646
For convenience, let’s now write an object to perform recommendations following the scikit-learn
API:
|
|
We can use the Collaborative Filter as follows:
|
|
3.9899620761918646
4.125945120766069
1.3 Performance Assessment
To assess the performance of our method, we can see how the filter works with the test data, for which we have the ground truth ratings:
|
|
user_id | item_id | rating | item_title | |
---|---|---|---|---|
87184 | 510 | 326 | 4 | G.I. Jane (1997) |
98871 | 151 | 1264 | 4 | Nothing to Lose (1994) |
1956 | 1 | 265 | 4 | Hunt for Red October, The (1990) |
11529 | 546 | 219 | 5 | Nightmare on Elm Street, A (1984) |
39495 | 457 | 183 | 5 | Alien (1979) |
Let’s define a function to iterate over the rows of the test set and compute the related ratings:
|
|
To speedup testing, let’s select a few items randomly from the test set:
|
|
501
|
|
100%|βββββββββββββββββββββββββββββββββββββββββ| 501/501 [00:33<00:00, 14.87it/s]
We can see recommendation as a regression problem and evaluate the system using MAE:
|
|
0.8396598738455379
Let’s compare this with a collaborative filter with N=5
:
|
|
100%|βββββββββββββββββββββββββββββββββββββββββ| 501/501 [00:34<00:00, 14.51it/s]
0.788623628125405
Let’s push N to some large value such as $50$:
|
|
100%|βββββββββββββββββββββββββββββββββββββββββ| 501/501 [00:32<00:00, 15.34it/s]
0.8032535120960659
Question 4 Why do we obtain better results for larger Ns? What is the effect of choosing a very small N? |
If you are interested in recommendation systems, you can check the surprise
library: http://surpriselib.com/.
2. Word Embeddings
SpaCy provides different pre-trained word embeddings which can be used “off-the-shelf”. The small model we have used does not include word embeddings, so we need to download a larger one. Specifically, SpaCy provides different models including different vocabulary sizes:
- en_core_web_md (116MB) 20K embeddings
- en_core_web_lg (812MB) 685K embeddings
- en_vectors_web_lg (631MB) 1.1M embeddings
All Embeddings have $300$ dimensions. The embeddings have not been learned using the algoritms seen in this course (GloVe). Instead they have learned using an algorithm called “word2vec”. The underlying principle behind this algorithm is however the same: words which are used in a similar way have similar embeddings, also called “vectors”.
We will use en_code_web_md
, but the larger models should also been considered when building applications which strongly rely on word vectors. We can install the model with the following command:
python -m spacy download en_core_web_md
After installing the model, we can load it as follows:
|
|
We can access the word vector with the vector
property of each token:
|
|
300
array([ 1.6597 , 4.7975 , 0.49976 , -0.39231 , -3.1763 , 2.5721 ,
0.023483, -0.047588, -2.3754 , 3.5058 ], dtype=float32)
2.1 Token Properties
Each token also has three additional properties:
is_oov
: indicates whether the word is out of vocabulary;has_vector
: indicates whether the word has a word embedding;vector_norm
: is the L2 norm of the word vector.
Let’s see an example:
|
|
Heilo False True 37.983173
how False True 90.45195
are False True 89.23195
you False True 70.9396
? False True 68.08072
The word “hello” has been mispelled as “heilo”, hence it has been identified as “out of vocabulary”. Since no word embedding is availble for this word, a vector containing all zeros is assigned (hence the vector norm is 0). Let’s check the word vector:
|
|
array([-2.6324 , -0.39889, 1.0277 , -0.22824, 0.58977, -2.9811 ,
-0.34429, 0.91616, 0.7227 , -2.3488 ], dtype=float32)
We can obtain a word vector for a document by averaging the vectors of all the words:
|
|
array([-1.77478 , 2.635882 , -5.93964 , -1.59687 , -1.28356 ,
1.2811 , -1.7377741, 2.723094 , -3.9060802, 1.7923 ],
dtype=float32)
The same result can be obtained by simply calling vector
on the document:
|
|
array([-1.77478 , 2.635882 , -5.93964 , -1.59687 , -1.28356 ,
1.2811 , -1.7377741, 2.723094 , -3.9060802, 1.7923 ],
dtype=float32)
Question 5 Are there any shortcomings in averaging word vectors to obtain embeddings for sentences? What happens if the sentence is very long and related to different topics? |
2.2 Similarity
Each document/vector has a similarity
method which allows to compute the similarity between two documents/vectors based on their word embedding:
|
|
that vs Is -0.127594
that vs this 0.716428
that vs the 0.549108
that vs region 0.337794
that vs , 0.405670
that vs this 0.716428
that vs the 0.549108
that vs soil 0.281650
that vs , 0.405670
that vs the 0.549108
that vs clime 0.440914
Similarly with documents:
|
|
Similarity between document 1 and 2: 0.735970
Similarity between document 1 and 3: 0.503123
2.3 Word Arithmetic
Let’s see an example of arithmentic between word embeddings. For instance, let’s try to see what is the closest word to:
"brother" - "man" + "woman"
|
|
100%|βββββββββββββββββββββββββββββββββββββ| 790/790 [00:00<00:00, 213059.42it/s]
array(['sister', 'brother', 'she', 'and', 'who', 'woman', 'havin', 'b.',
'accounting', 'where', 'βcause', 'that', 'thatβs', 'whoβs',
'lovin', 'lovinβ', 'semester', 'when', 'was', 'the', 'co', 'would',
'somethin', 'r.', 'c.', 'region', 'designed', 'had', 'love', 'd.',
'to', 'they', 'two', 'were', 'he', 'those', 'could', 'might',
'βbout', 'cause', 'there', "there's", 'clime', 'course', 'should',
'nothin', 'these', 'meet', 'ought', "'s", 'a', 'not', 'and/or',
'e.g.', 'have', 'did', 'has', 'o.o', 'real', 'you', 'may', 'this',
'what', "what's", 'of', '-o', 'βs', 'c', 'Γ€', 'ΓΆ', 'i.e.', 'all',
'how', 'space', 'e.', 'u.', 'does', 'b', "somethin'", 'somethinβ',
'must', 'scope', 'm.', 'are', 'it', 'bout', 's.', 'why', 're',
'soil', 'need', 'f.', 'n.', 'j.', 'l.', 'i.', 'o.', 'or', 'is',
"that's", 'co.', 'v.', 'p', 'p.', 'h', 'h.', 'on', 'ΓΌ.', 'j', 'z.',
'y.', 'x.', 'z', 'q', 'k', 'k.', 'q.', "n't", 'nβt', 'can', 'βs',
'am', 'we', "'cause", 'g.', '-p', '-x', '\\t', 'e', 'letβs', 'y',
'ΓΌ', 'r', 'p.m.', 'g', 'ol', 'd', 'a.m.', 'βm', 'a.', 'n', 'f',
'x', 'm', 's', 'man', 'i', 'do', 'w.', 'w', ':x', 'let', 'vs.',
'sha', 'v', "who's", "let's", "'m", ':o)', 'pm', 'a.m', 'p.m',
"'d", "o'clock", 'oβclock', 've', 'l', 'βre', "he's", "she's",
"it's", 'v.s.', 'e.g', 'i.e', "'re", 'βd', 'cβm', "c'm", 'βve',
"havin'", 'w/o', 'got', ':o', ':p', ':-p', 'ca', 'u', "'ve", 'gon',
'll', 'βll', "'ll", 'na', 't.', 't', 'wo', "nuthin'", 'nothinβ',
"nothin'", '\\n', 'itβs', 'dare', 'vs', "doin'", "'bout", 'doinβ',
'doin', 'o', "'coz", 'nt', "ma'am", 'yβ', "y'", 'goin', 'goinβ',
"goin'", 'ta', 'maβam', 'cuz', "'cuz", 'cos', "'cos", 'coz', 'βem',
'em', 'nuff', "'nuff", 'ai', "'em", 'olβ', "ol'", "lovin'"],
dtype='<U10')
As one could expect, the closest word is “sister”.
Question 6 Why are “4-year-old” and “daughter” also close to the computed vector? |
3. Sentiment Analysis
We will now see how to perform sentiment anlysis on a text using VADER. We will use a module included in the NLTK library. First, we need to download the VADER lexicon as follows:
|
|
[nltk_data] Downloading package vader_lexicon to
[nltk_data] /Users/...
True
We will use the SentimentIntensityAnalyzer
object:
|
|
To obtain the negative
, neutral
, positive
and compound
scores of VADER, we will use the polarity_scores
method:
|
|
{'neg': 0.0, 'neu': 0.423, 'pos': 0.577, 'compound': 0.6249}
This will return a dictionary containing the four scores. Let’s try with another few examples:
|
|
{'neg': 0.5, 'neu': 0.5, 'pos': 0.0, 'compound': -0.4588}
|
|
{'neg': 0.0, 'neu': 0.481, 'pos': 0.519, 'compound': 0.7509}
|
|
{'neg': 0.0, 'neu': 0.506, 'pos': 0.494, 'compound': 0.4466}
Question 7 Compare the third example to the first one. Why does the third one have a larger compound value than the first one? |
3.1 Sentiment Analysis and Movie Reviews
Let’s now see how we can use VADER to analyze movie reviews. We will use the movie reviews dataset seen in the previous laboratries:
|
|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5006 entries, 0 to 5005
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 author 5006 non-null object
1 review 5006 non-null object
2 rating 5006 non-null float64
dtypes: float64(1), object(2)
memory usage: 117.5+ KB
author | review | rating | |
---|---|---|---|
0 | Dennis_Schwartz | in my opinion , a movie reviewer's most import... | 0.1 |
1 | Dennis_Schwartz | you can watch this movie , that is based on a ... | 0.2 |
2 | Dennis_Schwartz | this is asking a lot to believe , and though i... | 0.2 |
3 | Dennis_Schwartz | no heroes and no story are the main attributes... | 0.2 |
4 | Dennis_Schwartz | this is not an art movie , yet i saw it an art... | 0.2 |
Let’s analyze the first review with VADER:
|
|
{'neg': 0.134, 'neu': 0.753, 'pos': 0.113, 'compound': -0.8923}
The “compound” score is negative, which is coherent with low rating assigned to the review. Let’s see if this happens systematically. To do so, we need to compute the “compound” value for each review. We will first define a vader_polarity
function to compute the polarity of the review using the compound:
|
|
-0.8923
Let’s now compute the polarity score to each review (this might take a while):
|
|
100%|ββββββββββββββββββββββββββββββββββββββ| 5006/5006 [00:13<00:00, 378.55it/s]
author | review | rating | polarity | |
---|---|---|---|---|
0 | Dennis_Schwartz | in my opinion , a movie reviewer's most import... | 0.1 | -0.8923 |
1 | Dennis_Schwartz | you can watch this movie , that is based on a ... | 0.2 | 0.8927 |
2 | Dennis_Schwartz | this is asking a lot to believe , and though i... | 0.2 | 0.9772 |
3 | Dennis_Schwartz | no heroes and no story are the main attributes... | 0.2 | 0.0316 |
4 | Dennis_Schwartz | this is not an art movie , yet i saw it an art... | 0.2 | 0.9903 |
3.2 Classifying reviews with no training
To see if the polarity tells us something about the rating, let’s consider a binary classification task. We will consider a review as negative if the rating is smaller than $0.5$. Similarly, we will classify a review as positive if the polarity is positive:
|
|
author | review | rating | polarity | label | predicted_label | |
---|---|---|---|---|---|---|
0 | Dennis_Schwartz | in my opinion , a movie reviewer's most import... | 0.1 | -0.8923 | False | False |
1 | Dennis_Schwartz | you can watch this movie , that is based on a ... | 0.2 | 0.8927 | False | True |
2 | Dennis_Schwartz | this is asking a lot to believe , and though i... | 0.2 | 0.9772 | False | True |
3 | Dennis_Schwartz | no heroes and no story are the main attributes... | 0.2 | 0.0316 | False | True |
4 | Dennis_Schwartz | this is not an art movie , yet i saw it an art... | 0.2 | 0.9903 | False | True |
Let’s now assess the performance of our classifier:
|
|
Accuracy: 74.03%
array([[ 434, 823],
[ 477, 3272]])
The model is not perfect, but the result is not bad, considering that we have not trained at model at all!
Question 8 How does this model compares with respect to the model base don bag of words seen in the previous laboratories? |
3.3 VADER Scores as Features
We can do something slightly more complicated by considering all VADER scores as features and training a logistic regressor on top of them. We should now define a function that maps a review to a feature vector containing the VADER scores:
|
|
array([ 0.134 , 0.753 , 0.113 , -0.8923])
We will use a very small training set to see what we can done with very little training.
|
|
50
Let’s compute a feature vector for each review of the training and test sets:
|
|
100%|ββββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:00<00:00, 341.12it/s]
100%|ββββββββββββββββββββββββββββββββββββββ| 4956/4956 [00:12<00:00, 382.48it/s]
Let’s obtain the related labels:
|
|
Let’s now train the logistic regressor. We will normalize features using a MinMaxScaler:
|
|
0.7487893462469734
Using only $50$ reviews for training we have obtained a small boost on performance. We can also try to fit a linear regressor to predict the ratings:
|
|
0.14381416319971027
We have obtained a relatively small MAE error, considering the range of the reviews.
References
[1] Recommendation systems library. http://surpriselib.com/
[2] Word vectors in Spacy. https://spacy.io/usage/vectors-similarity
[3] NLTK and VADER. https://www.nltk.org/_modules/nltk/sentiment/vader.html
Exercises
Exercise 1 Build an item-item Collaborative Filter and compare its performance with respect to the user-user Collaborative Filter built in this laboratory. Experiment with different values of |
Exercise 2 Represent each review of the movie review dataset using the word embeddings. Use a small portion of the data to train a logistic regressor for review classification on the dataset. How does this model compares with respect to the one built using VADER words embeddings? |
Exercise 3 Consider the ham vs spam dataset seen in the previous laboratories. Represent each element using features extracted with VADER and word embeddings. Train two logistic regressors using the two set of features using a small portion of the data as training set. Compare the two models with a logistic regressor built starting from a bag of words representation. Use the same train/test split for training/evaluation. Which of the models performs best? Why? |