Text Analysis (NLP), Classification
Notes by Antonino Furnari - antonino.furnari@unict.itπ
University of Catania, Department of Mathematics and Computer Science
Notes available at: http://www.antoninofurnari.github.iohttps://antoninofurnari.github.io/lecture-notes/en/data-science-python/text-analysis-classification/
In this laboratory, we will see the main tools for text analysis. In particular, we will see:
- The basic element of an NLP pipeline:
- Word tokenization;
- Identifying stop words;
- Lemmatization;
- POS tagging;
- NER tagging;
- Sentence segmentation.
- The Bag Of Words (BOW) representation;
- Text classification.
1. Text Processing Basics
We will use the spaCy
Python library, which can be installed with the following commands from command line/anaconda prompt:
conda install -c conda-forge spacy
Alternativamente:
pip install spacy
Dopo:
python -m spacy download en_core_web_sm
To use spaCy, we first need to import it and load a corpus of text on which the algorithms have been trained. Assuming that we will work with the English language, we will load the en
corpus:
|
|
The nlp object is associated with a vocabulary (i.e., a set of known words), which depends on the chosen corpus. We can see the length of the corpus as follows:
|
|
773
We can see the list of al the words in the vocabulary as follows:
|
|
['nuthin', 'ΓΌ.', 'p.m', 'Kan', 'Mar', "When's", ' ', 'Sept.', 'c.', 'Mont.']
To analyse text with spaCy, we first need to create a document object:
|
|
"Let's go to N.Y.!"
When we define a document, the vocabulary is updated by inserting any word which is present in the document but was not present in the corpus. For instance, the size of the vocabulary is now larger:
|
|
776
We can see which are the new words as follows:
|
|
{'!', 'go', 'to'}
As we will see a spaCy document allows to easilly perform some basic text processing operations.
1.1 Tokenization
Given a spaCy document, it is possible to easilly iterate over tokens with a simple for loop:
|
|
"
Let
's
go
to
N.Y.
!
"
We can alternatively obtain the list of all tokens simply by passing doc to list
:
|
|
[", Let, 's, go, to, N.Y., !, "]
The doc
object can also be indexed directly. So, if we want to access to the $5^{th}$ token, we can simply write:
|
|
to
We should note that each token is not a string, but actually a token object:
|
|
spacy.tokens.token.Token
We can access to the text contained in the token as follows:
|
|
to
<class 'str'>
Similarly, it is possible to access multiple tokens by slicing. This will return a span
object:
|
|
's go
<class 'spacy.tokens.span.Span'>
Even if we can index a document to obtain a token, tokens cannot be re-assigned:
|
|
Error!
As we can see, spaCy takes care of all steps required to obtain a proper tokenization, including recognizing prefixes, suffixes, infixes and exceptions, as shown in the image below (iamge from https://spacy.io/usage/spacy-101#annotations-token).
Question 1 Is the tokenization mechanism offered by spaCy useful at all? Compare the obtained tokenization with the result of splitting the string on spaces with |
1.2 Lemmatization
Lemmatization is a much more complex way to group words according to their meaning. This is done by looking both at a vocabulary (this is necessary to understand that, for instance ‘knives’ is the plural of ‘knife’) and at the context (for instance to understand if ‘meeting’ is used as a noun or as a verb). SpaCy performs lemmatization automatically and associates the correct lemma to each token.
In particular, apart from text
, each token is assigned two properties:
lemma
: a numerical id which univocally identifies the lemma (this is for machines);lemma_
: a string explaining the lemma (this is for humans);
For instance:
|
|
Let
278066919066513387
let
Let’s see an example with a sentence:
|
|
I -> I
will -> will
meet -> meet
you -> you
in -> in
the -> the
meeting -> meeting
after -> after
meeting -> meet
the -> the
runner -> runner
when -> when
running -> run
. -> .
The lemmatizer correctly associated the first occurrence of “meeting” as a verb (its lemma is “meet”) and the second one as a noun (its lemma is “meet”).
1.3 Stop Words
Not all words are equally important. Some words such as “a” and “the” appear very frequently in the text and tell us little about the nature of the text (e.g., its category). These words are usually referred to as “stop words”. SpaCy has a built in list of stop words for the English language. We can access them as follows:
|
|
326
['get', 'see', 'without', 'so', 'βve', 'elsewhere', 'sixty', 'me', 'somewhere', 'herein']
We can check if a word is a stop word as follows:
|
|
True
To make things easier, spaCy allows to assess if a given token is a stop word using the attribute is_stop
:
|
|
" -> False
Let -> False
's -> True
go -> True
to -> True
N.Y. -> False
! -> False
" -> False
As we can see, some commong words such as “’s”, “go” and “to” are stop words.
Depending on the problem we are trying to solve, we may want to remove some stop words or add our own stop words. For instance, let’s say we think “go” is valuable, and it should not be considered a stop word. We can remove it as follows:
|
|
Now, “go” is not considered as a stop word anymore:
|
|
" -> False
Let -> False
's -> True
go -> False
to -> True
N.Y. -> False
! -> False
" -> False
Similary, we can add a stop word as follows:
|
|
Let’s check if “!” is now a stop word:
|
|
" -> False
Let -> False
's -> True
go -> False
to -> True
N.Y. -> False
! -> True
" -> False
Question 2 Why would we want to get rid of words which are very frequent? In which way is this related to the concept of information? |
1.4 Part of Speech (POS) Tagging
SpaCy also allows to easily perform Part of Speech tagging. It is possible to assess which role a given token has in the text by using two properties of the tokens:
pos
: a numerical id which identifies the type of POS (for machines);pos_
: a textual representation for the POS (for humans).
Let’s see an example:
|
|
's
95
PRON
Let’s see a more thorough example:
|
|
" -> PUNCT
Let -> VERB
's -> PRON
go -> VERB
to -> ADP
N.Y. -> PROPN
! -> PUNCT
" -> PUNCT
The text contained in the pos_
tags is a bit short. We can obtain a more datailed explanation with spacy.explain
:
|
|
" -> punctuation
Let -> verb
's -> pronoun
go -> verb
to -> adposition
N.Y. -> proper noun
! -> punctuation
" -> punctuation
We can access fine grained POS tags using tag
and tag_
:
|
|
" -> ``
Let -> VB
's -> PRP
go -> VB
to -> IN
N.Y. -> NNP
! -> .
" -> ''
Similarly, we can obtain an explanation for each of the coarse grained tags:
|
|
" -> opening quotation mark
Let -> verb, base form
's -> pronoun, personal
go -> verb, base form
to -> conjunction, subordinating or preposition
N.Y. -> noun, proper singular
! -> punctuation mark, sentence closer
" -> closing quotation mark
Question 3 What is the difference between coarse and fine grained tags? Are there applications in which coarse grained tags can still be useful? |
1.5 Named Entity Recognition (NER)
Named entity recognition allows to identify which tokens refer to specific entities such as companies, organizations, cities, money, etc. Named entities can be accessed with the ents
property of a spaCy document:
|
|
(Boris Johnson, EU, Brexit, this week, October 31)
Each entity has the following properties:
text
: contains the text of the entity;label_
: a string denoting the type of entity;label
: an id for the entity; As usual, we can usespacy.explain
to get more information on an entity:
|
|
Boris Johnson - 380 - PERSON - People, including fictional
EU - 383 - ORG - Companies, agencies, institutions, etc.
Brexit - 380 - PERSON - People, including fictional
this week - 391 - DATE - Absolute or relative dates or periods
October 31 - 391 - DATE - Absolute or relative dates or periods
SpaCy has a built-in visualizer for named entity:
|
|
Question 4 How named entities are different from POS? Isn’t this the same as knowing that a given token is a noun? |
1.6 Sentence Segmentation
SpaCy also allows to perform sentence segmentation very easily by providing a sents
property for each document:
|
|
[I gave you $3.5.,
Do you remember?,
Since I owed you $1.5, you should now give me 2 dollars.]
Also, we can check if a given token is the first token of a sentence using the property is_sent_start
:
|
|
I -> True
gave -> False
you -> False
$ -> False
3.5 -> False
. -> False
Do -> True
you -> False
remember -> False
? -> False
Since -> True
I -> False
owed -> False
you -> False
$ -> False
1.5 -> False
, -> False
you -> False
should -> False
now -> False
give -> False
me -> False
2 -> False
dollars -> False
. -> False
Question 5 Is the sentence segmentation algorithm necessary at all? Compare the obtained segmentation with the result of splitting the string on punctuation with |
2. Bag of Words Representation
We will now see how to represent text using a bag of words representaiton. To deal with a concrete example, we will consider the classificaiton task of distinguishing spam messages from legitimate messages. We will consider the dataset of SMS spam, available here: https://www.kaggle.com/uciml/sms-spam-collection-dataset/version/1#.
The dataset can be downloaded after logging in. Download the file spam.csv
and place it in the current working directory.
We will load the csv using Pandas:
|
|
For this lab, we will use only the first two columns (the others contain mostly None
elements). We will also rename them from ‘v1’ and ‘v2’ to something more meaningful:
|
|
class | text | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
Here v1
represents the class, while v2
contains the messages. Legitimate messages are called ham
, as opposed to spam
messages.
Let’s inspect some messages:
|
|
ham --- Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
spam --- XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
ham --- Just forced myself to eat a slice. I'm really not hungry tho. This sucks. Mark is getting worried. He knows I'm sick when I turn down pizza. Lol
Since we will need to apply machine learning algorithms at some point, we should start by splitting the current dataset into training and testing sets. We will use the train_test_split
function from scikit-learn
. If the library is not installed, you can install it with the command:
conda install scikit-learn
or
pip install scikit-learn
Let’s split the dataset:
|
|
Let’s print some information about the two sets:
|
|
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4179 entries, 5062 to 2863
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 4179 non-null object
1 text 4179 non-null object
dtypes: object(2)
memory usage: 97.9+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1393 entries, 1537 to 4118
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 1393 non-null object
1 text 1393 non-null object
dtypes: object(2)
memory usage: 32.6+ KB
Let’s see the first elements of each set:
|
|
class | text | |
---|---|---|
5062 | ham | Ok i also wan 2 watch e 9 pm show... |
39 | ham | Hello! How's you and how did saturday go? I wa... |
4209 | ham | No da:)he is stupid da..always sending like th... |
4500 | ham | So wat's da decision? |
3578 | ham | Multiply the numbers independently and count d... |
|
|
class | text | |
---|---|---|
1537 | ham | All sounds good. Fingers . Makes it difficult ... |
963 | ham | Yo chad which gymnastics class do you wanna ta... |
4421 | ham | MMM ... Fuck .... Merry Christmas to me |
46 | ham | Didn't you get hep b immunisation in nigeria. |
581 | ham | Ok anyway no need to change with what you said |
Question 6 Compare the indexes of the two DataFrames with the indexes of the full |
We may want to check if how many elements belong to each category:
|
|
Question 7 The dataset is very unbalanced. Why is this something we should keep in mind? |
2.1 Tokenizating and counting words with CountVectorizer
In order to create our bag of words representation, we will need to tokenize each message, remove stop words and compute word counts. We could do this using spaCy. However, scikit-learn makes available some tools to perform feature extraction in an automated and efficient way.
To perform tokenization, we will use the CountVectorizer
object. This object allows to process a series of documents (the training set), extract a vocabulary out of it (the set of all words appearning in the document) transform each document into a vector which reports the number of instances of each word. Let’s import and create a CountVectorizer
object:
|
|
CountVectorizer
uses a syntax which we will see is common to many scikit-learn
objects:
- A method
fit
can be used to tune the internal parameters of theCountVectorizer
object. In this case, this is mainly the vocabulary. The input to the method is a list of text messages; - A method
transform
can be used to transform a list of documents into a sparse matrix in which each row is a vector containing the number of words included in each document. Since most of these numbers will be zero (documents don’t usually contain all words), a sparse matrix is used instead of a conventional dense matrix to save memory; - A method
fit_transform
which performsfit
andtransform
at the same time.
Let’s an example:
|
|
CountVectorizer()
We can access the vocabulary created by CountVectorizer
as follows:
|
|
{'this': 5, 'is': 0, 'list': 1, 'of': 3, 'short': 4, 'messages': 2}
The vocabulary is a dictionary which maps each word to a unique integer identifier. CountVectorizer
also automatically removed stop words such as a
. We can now transform text using the transform
method:
|
|
<3x6 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
As previously mentioned, the output will be a sparse matrix for memory efficiency. Since the matrix is small in our simple example, we can visualize its dense version with no trouble:
|
|
Each row fo the matrix corresponds to a document. Each column corresponds to a word, according to the ids included in the vocabulary dictionary. For instance, if we compare the first document with the corresponding vector:
|
|
We note that it contains one instance of this
(index $5$ in the vocabulary) and one instance of is
(index $0$).
Interestingly, if a new document contains words which were not contained in the original corpus of documents, they are simply discarded:
|
|
As we can see, new
and message
have been ignored as they were not in the original training set.
Let’s compute word counts on all the training set:
|
|
Question 8 Why |
The feature matrix is now a large sparse matrix with many rows (the number of samples) and many columns (the number of words). We can check the length of the vocabulary also as follows:
|
|
3 Nearest Neighbor and Multinomial Naive Bayes Classification
Word counts are a simple representation for text. Let’s see how well we can classify samples by using this representation. We will consider the K-Nearest Neighbor classifier for this task. Using it is very straightforward with scikit-learn. Let’s import the KNeighborsClassifier
object and create a 1-NN:
|
|
This object has a similar interface:
- A
fit
method is used to tune the parameters of the classifier. In this case, it is used to provide (and memorize) the training set. We need to provide both the input features and the corresponding labels: - A
predict
function is used to classify a sample;
|
|
We can now use the knn
object to classify a new message, but first, we need to extract features from our message. Let’s consider an example from the test set:
|
|
To see if this is spam or ham, we need to first extract features form it:
|
|
Now we can classify it with predict
:
|
|
We can compare this with the actual class of the message:
|
|
|
|
|
|
Apparently, the KNN classifier got it wrong. A good idea would be to assess the performance on a larger set of samples. Let’s do this for the whole test set:
|
|
We now need to evaluate how good our classifier is at guessing the right class. We can compute accuracy using the accuracy_score
function from scikit-learn:
|
|
Alternatively, we can directly obtain the accuracy using the score
method of the KNN object providing both the test features and labels:
|
|
Question 9 Our method achieved a good accuracy on the test set. Is this enough to evaluate its performance? What would be the performance of an approach systematically classifying messages as |
Scikit-learn also offers a confusion_matrix
function to compute confusion matrices:
|
|
Question 10 Compare the confusion matrix with the accuracy. Are we learning something different? |
We can compute precision and recall as follows:
|
|
We have a high precision and a low recall. Similarly, we can compute per-class $F_1$ scores with the f1_score
function:
|
|
Question 11 What are the F1 scores telling us? Compare this with the accuracy and the confusion matrix. |
We can easily repeat the process for a KNN with a different K. Let’s try a 5-NN:
|
|
Let us compute also precision and recall:
|
|
Question 12 Which of the two classifiers worked best? |
3.1 Optimizing the hyperparameters with cross validation
We can optimize the value of K using cross validation on the training set. To do so, we can use the GridSearchCV
object:
|
|
We can obtain the best parameters as follows:
|
|
3.2 Multinomial Naive Bayes
Using a different classifier is pretty straightforward with scikit-learn. Let’s see how to perform classification with a multinomial Naive Bayes classifier:
|
|
Let us compute precision and recall:
|
|
Question 13 Which of the classifier seen so far performs better on the test set? Is the margin narrow or large? |
4. Advanced Tools
In this section, we will see some advanced tools which may be useful to obtain more sophisticated natural language processing pipelines.
4.1 TF-IDF
We have seen that, in practice, it can be useful to weight words by their frequency in the corpus and in the single document by using TF-IDF. This can be done very easily in scikit-learn
using a TfidfTransformer
transformer object. Let’s see how this changes the results of a KNN classifier. For more clarity, we will report the entire training/test processing pipeline:
|
|
When performing TF-IDF, we can skip the “idf” part. This way, the words will be only weighted by their frequency in the corpus. To do so, we need to specify a use_idf=False
flag when creating the TF-IDF transformer:
|
|
Question 14 Compare the results of the classifiers using TF-IDF with those obtained with the classifier using word counts. Which achieves better performance? Repeat the same comparison with a Naive Bayes classifier. Do we observe similar patterns? Why? |
4.2 Custom Tokenization
Scikit-learn does not include tools to perform stemming, lemmatization, part-of-speech tagging etc. In some cases, however, it can be useful to consider these as features (or as additional features). To achieve this, we can combine scikit-learn with an external library such as spaCy.
This is done by providing to the CountVectorizer
object a custom tokenizer which splits a sentence into tokens. Let’s build a tokenizer which considers parts of speech:
|
|
We can use the tokenizer to split any sentence into tokens:
|
|
We can hence build a CountVectorizer
which uses the POSTokenizer as follows:
|
|
This can be easily integrated into a classification pipeline as follows:
|
|
Note that by using coarse-grained POS, we only have $18$ features:
|
|
We can combine these $18$ features with the previous representation based on word counts by concatenating the features:
|
|
Let’s integrate this into the classification pipeline:
|
|
Similar kinds of processing can be perfomed using other tokes such as named entities.
4.3 Scikit-Learn Pipelines
As we have seen in the past examples, classification algorithms are always made up of several components forming a pipeline including feature extraction and inference. This kind of processing an be simplified using Scikit-Learn pipelines. Let us see an example:
|
|
We can also merge more representation using the FeatureUnion
object:
|
|
Exercises
Exercise 1 Consider the following wikipedia page: https://en.wikipedia.org/wiki/Harry_Potter Download it and extract the text using Beautiful Soup. Hence, perform the following processing:
|
Exercise 2 Train ham-vs-spam classifiers based on a bag of words representation which considers only named entities. Compare the performance of a 1-NN with those of a Naive Bayes classifier. |
Exercise 3 Use the |
Referenze
- spaCy documentation: https://spacy.io/
- NLTK documentation: https://www.nltk.org/
- Scikit-learn tutorial on text processing: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html