Linear, Polynommial, and Logistic Regression
Notes by Antonino Furnari - antonino.furnari@unict.it🔝
University of Catania, Department of Mathematics and Computer Science
Notes available at: http://www.antoninofurnari.github.iohttps://antoninofurnari.github.io/lecture-notes/en/data-science-python/linear-polynomial-logistic-regression/
In this laboratory, we will see:
- Linear Regression;
- Polynomial Regression;
- Logistic Regression.
1. Linear Regression
To experiment with linear regression, we will use a dataset of movie reviews collected from IMDb (https://www.imdb.com/). The dataset is available here: http://www.cs.cornell.edu/people/pabo/movie-review-data/ (in particular here: http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz ).
1.1 Dataset
To make things faster, for this laboratory we will use a pre-processed version of this dataset available at http://antoninofurnari.it/downloads/reviews.csv. Let’s load and inspect the csv with Pandas:
|
|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5006 entries, 0 to 5005
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 author 5006 non-null object
1 review 5006 non-null object
2 rating 5006 non-null float64
dtypes: float64(1), object(2)
memory usage: 117.5+ KB
author | review | rating | |
---|---|---|---|
0 | Dennis_Schwartz | in my opinion , a movie reviewer's most import... | 0.1 |
1 | Dennis_Schwartz | you can watch this movie , that is based on a ... | 0.2 |
2 | Dennis_Schwartz | this is asking a lot to believe , and though i... | 0.2 |
3 | Dennis_Schwartz | no heroes and no story are the main attributes... | 0.2 |
4 | Dennis_Schwartz | this is not an art movie , yet i saw it an art... | 0.2 |
The dataset contains $5006$ reviews. For each review, it is reported its author, the text of the review, and the rating given by the authors. Let’s briefly explore some statistics of the data using describe
:
|
|
rating | |
---|---|
count | 5006.000000 |
mean | 0.581422 |
std | 0.181725 |
min | 0.000000 |
25% | 0.490000 |
50% | 0.600000 |
75% | 0.700000 |
max | 1.000000 |
We got statistics for the only quantitative variable in the dataset (rating). We can note tha the ratings are number between $0$ and $1$. The mean rating is slightly below $0.6$. Let’s now see the number of authors and the number of reviews per author:
|
|
We have $4$ authors and a variable number of reviews. Let’s have a look at the average rating per author:
|
|
The average rating is overall balanced across all authors. Let’s see the average length of reviews per author:
|
|
The average length of reviews is less balanced, but still ok.
1.1.1 Train/Test Splitting
Let’s split our dataset randomly into training and test set. We will use $15%$ of the data as test set:
|
|
1.2 Data Representation
We will begin by representing the data in a suitable way. For the moment, we will consider a Bag of Words representation. Let’s use a CountVectorizer to obtain a set of features. We will fit the model on the training set and use it to transform both the training and testing data. We will also save the training and testing target values (the labels) into separate variables:
|
|
Let’s see the dimensionality of the training and testing features:
|
|
((4255, 39499), (751, 39499))
Question 1 How many words does our vocabulary contain? Why the numbers of columns in the training and testing design matrices are the same? While the numbers of rows are not the same? |
Question 2 Representations are task-dependent. What is the main rationale behind using bag of words for this regression task? Can there be better representations of the data? (i.e., is this representation overlooking some aspects of the data?) |
1.3 Linear Regression
To fit a linear regressor to the data, we have to use the LinearRegression
object of scikit-learn
, which has the same interface we have seen for other supervised machine learning algorithms such as Naive Bayes and KNN:
|
|
LinearRegression()
The linear regressor has now been properly tuned. We can access its parameters as follows:
|
|
[-0.00651757 0.0056682 0.00903341 ... 0.00282769 -0.00056079
-0.0003679 ]
0.48495048975822536
Question 3 How many parameters does the model have in total? Why? |
We have fitted the linear model:
\begin{equation} f(\mathbf{x}) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + … + \theta_n x_n \end{equation}
We can now apply this model to the training and test data by using the predict
method:
|
|
We now need to understand how good is our model at making predictions. We can compute the MAE score as follows:
|
|
Let’s see what are the MAE values on the training and test sets:
|
|
MAE on the training set: 0.0000
MAE on the test set: 0.1218
Question 4 Compare the errors obtained on the training and test sets. Which one is higher? Which one is lower? Why is there such difference? What is happening? |
How can we understand if the performance of our method is good? To see that, we can compare it with respect to a simple baseline which always predicts a rating of $0.5$:
|
|
Baseline MAE: 0.17
|
|
Baseline MAE: 0.15
|
|
Baseline MAE: 0.15
We can see that our regressor performs better. For reference, we will create a dataframe containing the MAE scores for different classifiers:
|
|
1.4 Regularization
By comparing the training and testing perfomance, we could argue that we have an overfitting problem. We could try to address it by adding a regularization term to the cost function. Linear regression with regularization is implemented in scikit-learn
with the Ridge
object. Let’s repeat the whole pipeline using the regularized regressor:
|
|
MAE on the training set: 0.0016
MAE on the test set: 0.1176
Question 5 What happened to the training performance? What happened to the testing performance? Why? |
Question 6 Compute the L2 norm (vector module) of the coefficient obtained using the regularized linear regressor and the standard one. Which of the two has a smaller norm? Why? |
It seems that regularization improved the results! Let’s add this to our scores
:
|
|
1.4 Improving the representation
We are currently using simple word counts as the representation for our reviews. However, this representation has some shortcomings:
- Long and short reviews will get very different representations because there is no normalization;
- Frequent words might dominate the representation.
We can try to fix this by using TF-IDF. Let’s see if this improves results:
|
|
MAE on the training set: 0.0560
MAE on the test set: 0.0955
It seems that TF-IDF improved the results. Let’s put this result in our scores
DataFrame and compare:
|
|
1.5 Our regressor at work
Let’s see our model in action on some samples of the test set. We will define a function which takes a message and predicts a review score for it:
|
|
We can now rate any test review:
|
|
True rating: 0.73/Predicted rating: 0.71
Review: ron howard has a history of making entertaining films that tend towards weak conclusions ( one could call this the " happy endings " syndrome ) . joining parenthood and far and away in this category is mr . howard's newest release , the paper . for something that starts out as a boisterous , amusing , potentially-incisive look at the newspaper business and the people who run it , this motion picture finishes with a relative whimper . no viewer is likely to be bored during the paper . the film moves along quickly , using mobile cameras and rapid cuts to keep the energy level high and the pacing crisp . this is especially true in the newsroom scenes , as director ron howard attempts to convey the spirit and atmosphere of the disorganized-but-productive work environment . his eye for detail is evident . away from the paper's offices , the movie is less successful . the personal lives of henry and his co-workers aren't all that interesting , and seem scripted rather than real . this is especially true during the drawn-out melodrama that characterizes the last fifteen minutes . the contrivance used to incorporate these into the overall plot is a little too obvious . the climactic struggle that ensues between two of the main characters looks like it belongs more in a three stooges movie than a pseudo-real look at the behind-the-scenes newspaper business . but there's a lot in the paper that works . although most of the characters are types , the actors play them with sympathy and understanding . three-dimensionality may not be achieved , but attempts are made to give the men and women in this film some depth , and the freshness of the performances helps a great deal . randy quaid is magnificent as the gun- toting , paranoid mcdougal . although not a developed character in any sense , mcdougal has some of the best one-liners . the dialogue is frequently snappy--presumably the product of screenwriters with ears for the way people actually talk . there's a fair amount of natural humor , and only a few occasions when the laughs are forced . the paper is a crowd pleaser and , regardless of any viewer's experience ( or lack thereof ) with the behind-the-scenes wrangling that goes on in newspaper offices , the story is compelling enough to be affable and entertaining . while there are no startling revelations here , the film's atmosphere contains enough strength of realism that more than one viewer may momentarily think of the goings-on at the sun as they sit down with their morning cup of coffee and look at the day's headlines .
We can also try to rate our own text:
|
|
0.6663085163223353
|
|
0.10607236353046512
Question 7 What are the minimum and maximum lengths of strings our linear regressor can handle? Why? What if we do not apply TF-IDF? |
2 Polynomial Regression
Let’s now see an example with polynomial regression. Since applying polynomial regression to problems with a very large number of features can be very computationally expensive, we will first simplify our representation by including less words. We can do so by specifying a min_df
parameter to the CountVectorizer
. This will specify the minimum document frequency a word should have in order to be included. If we set this value to some number between $0$ and $1$, the number of words which are selected changes:
|
|
(4255, 53)
We can also decide to exclude words which are too frequent, by including a max_df
term, which denotes the maximum document frequency a word should have to be included in the vocabulary:
|
|
(4255, 37)
Question 8 How do |
Let’s try to solve the regression problem with this reduced set of features:
|
|
MAE on the training set: 0.1396
MAE on the test set: 0.1444
Since we are discarding words, we obtained reduced performance. We will now apply polynomial regression using a polynomial of degree 2. To do so, we have to transform the data by adding new columns to the design matrix. This is done by the object PolynomialFeatures
:
|
|
(4255, 37) (4255, 741)
As we can see, we have greatly increased the number of features using polynomial feature transformation. Let’s see what happens if we train a linear regressor on these features. That will be a polynomial regressor:
|
|
MAE on the training set: 0.1216
MAE on the test set: 0.1519
As we can see, we didn’t imporve the performance of the model. In general, whether polynomial regression is going to improve the results or not depends on the task. The common way to proceed is to test if it improves the results in order to decide if it should be included or not.
Question 9 Was the training slower or faster with the polynomial regression? In the case seen in this laboratory, was it worth to include it? |
3 Logistic Regression
We will now see an example of logistic regression. For the purpose of this laboratory, we will define a simple binary classification task on the dataset at hand. In particular, we will attach each element of the dataset a label. We will consider a review positive
if its rating is larger than $0.5$, and negative
otherwise. We can compute these values as follows:
|
|
4164 0.60
893 0.70
591 0.50
1318 0.50
4629 0.70
627 0.60
1896 0.74
4542 0.70
4672 0.70
661 0.60
Name: rating, dtype: float64 4164 True
893 True
591 False
1318 False
4629 True
627 True
1896 True
4542 True
4672 True
661 True
Name: rating, dtype: bool
|
|
0.5870740305522915
Since it is more convenient to have integers rather than booleans, we can cast them to integers as follows:
|
|
4164 0.60
893 0.70
591 0.50
1318 0.50
4629 0.70
627 0.60
1896 0.74
4542 0.70
4672 0.70
661 0.60
Name: rating, dtype: float64 4164 1
893 1
591 0
1318 0
4629 1
627 1
1896 1
4542 1
4672 1
661 1
Name: rating, dtype: int32
We can apply the same process to training and test labels:
|
|
Let’s check if the dataset is balanced, by counting the fraction of positives and negatives:
|
|
Question 10 Is the dataset balanced? Why it is important to assess it? |
We can train a logistic regressor with the API we have already seen for other classifiers:
|
|
LogisticRegression()
We can access the internal parameters of the logistic regressor as follows:
|
|
(array([[-0.26977792, 0.16778254, 0.52593232, ..., 0.16984559,
0.02966805, -0.00539487]]),
array([-0.16798022]))
Question 11 What is the total number of parameters? Why? |
Let’s compute training and testing predictions:
|
|
Now we need to assess the performance of the logistic regressor. Let’s use the accuracy:
|
|
Accuracy on training: 0.91
Accuracy on test: 0.80
We can obtain the same results by using the score
method of the logistic regressor:
|
|
Accuracy on training: 0.91
Accuracy on test: 0.80
Question 12 What can we say about the performance of our method? Is the algorithm overfitting? Is the performance good? Is accuracy enough to evaluate? Why? |
Let’s see also the confusion matrices:
|
|
Training confusion matrix:
[[1453 304]
[ 64 2434]]
Test confusion matrix:
[[197 102]
[ 47 405]]
To obtain single per-class scores, let’s compute $F_1$ scores:
|
|
F1 training scores: 0.93
F1 test scores: 0.84
Question 13 What additional information are the confusion matrix and the $F_1$ scores providing? Is this information important? Why? |
Exercises
Exercise 1 In the example of linear regressor, try to replace the TF-IDF normalization with a simple L1 normalization in which each vector of word counts is divided by the sum of its values. How does this regressor performs with respect to the others? |
Exercise 2 Try to improve the performance of the linear classifier by including as features a Bag of Words representation based on lemmas, POS, NER, etc. |
Exercise 3 The |
Referenze
- Scikit-learn tutorial on text processing: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- https://scikit-learn.org/stable/modules/linear_model.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html