NLP - Text Preprocessing
From now on, we will focus on a specific domain — Natural Language Processing(NLP), in part because my summer project is about Named Entities Recognition(NER). Therefore, I need to know some text preprocessing techniques and have a good understanding of the state-of-art NLP models, particularly BiLSTM + CRF. The very first step in NLP is text preprocessing, so I am going to start from here.
Converting text to lowercase
For some letter-based languages such as English, letters could be either lowercase or uppercase, e.g. book
, Book
or BOOK
. Since they represent the same word book
with the same semantics, there is no need to encode them three times. The common way is to convert all words into a consistent written style — lowercase because of its readable and concise style.
However, this method sometimes could affect semantics of some specific words, for instance, May
and may
are two different words. Well, for tasks that involve parts-of-speech analysis, it’s a matter.
In python, it’s easy to convert a string into lowercase,
text = text.lower()
Removing Punctuation
For some tasks, punctuations such as , . " @ #
are meaningless to us, so we should remove them. In Python, we can do this using the sub()
method of the re
module to replace any matched punctuation with an empty character
re.sub('[,\'.!"]', '', '"20.2 dollars! That\'s impossible", he said.')
#>> 202 dollars Thats impossible he said
From the above code, we can see some issues,
- Abbreviation,
That's -> Thats
- Prices,
20.2 dollars -> 202 dollars
Maybe we can write some specific rules for a corpus to remove particular punctuations, but it will be time-consuming and inflexible.
Tokenization
Tokenization is a technique to split a piece of text into a smaller unit. The unit could be
- a single word
- a character
- a combination of several words
Why do we tokenize text? The answer is that the ultimate goal of tokenization is to build vocabulary for a corpus. After all, words in each sentence are ultimately from the vocabulary.
Okay, so how do we perform tokenization to get a sequence of words? A naive way is to split a string by whitespace shown below.
text = text.split(' ')
However, there are some issues. For example, United Kindom
will be split into two words, United
and Kindom
. A common practice is to use the popular natural language toolkit (NLTK) to do this
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
With these tokens, how to build vocabulary? Usually, there are two ways to do this,
- include each uniqe token in the vocabulary
- include only the top K frequently occurring tokens; the idea is that repetition usually conveys the most important information
Stemming & Lemmatization
In grammatical aspect, a word could have several variants to express tense, mood or something. For example,
go, goes, went, gone
are different tenses ofgo
higher, highest
are comparative and superlative form ofhigh
In other words, these variants are originated from their root words. Generally, most words follow a general rule of to generate the corresponding forms, though there are some rare cases, as shown below,
eat, ate, eaten
good, better, best
Stemming and Lemmatization are two common methods to convert each word back to its original form or the root.
Stemming
Stemming is a simple and crude method that simply chopps off letters from the end of a word until a common root is found, which is the idea of Potter Stemming algorithm.
from nltk.stem import PorterStemmer, WordNetLemmatizer
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'saw']
p_stemmer = PorterStemmer()
for word in words:
print(word + '-->' + p_stemmer.stem(word))
# run-->run
# runner-->runner
# running-->run
# ran-->ran
# runs-->run
# easily-->easili
# saw-->saw
We can see that the result of stemming may not be a word, e.g. easili
, and that the tense forms of some special words like ran, saw
are not processed correctly.
Lemmatization
On the contrary, lemmatization returns the true root or lemma of a word by considering the whole vocabulary of a language and the context of that word in the sentence. In the case of ran
and saw
, the true root words or lemmas are run
and see
respectively.
NLTK provides a popular lemmatizer for us — WordNet Lemmatizer, which is an large lexical database of English. But before using it, we need to know the parts of speech of each word, which can also be done using pos_tag()
method of NLTK.
# we mannually specify the parts of speech for each word
words = [
("grows", 'v'), ('running', 'v'), ("better", 'a'), ("cats", 'n'), ('quickly', 'r')
]
lemmatizer = WordNetLemmatizer()
[ lemmatizer.lemmatize(w, p) for w, p in words ]
# ['grow', 'run', 'good', 'cat', 'quickly']
Removing stopping words
It seems that we’ve obtained the tokens as desired. But wait, let’s plot the number of occurrence of each word.
Figure 1 shows the frequency of each token in the built-in NLTK corpus named 1789-Washington.txt
. It can be seen that most tokens are determiners, prepositions, conjuctions or other function words that have little lexical meaning. Obvisously, we need to remove them, which can be done easily using NLTK,
from nltk.corpus import stopwords
stop_words_en = stopwords.words('english')
tokens = [ w for w in tokens if w not in stop_words_en ]
Putting it together
So far, we’ve introduced some basic and necessary text preprocessing techniques. Now let’s put it together to build the whole text preprocessing pipeline.
import string, re
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
def get_wordnet_pos(tag):
if tag.startswith("J"):
return wordnet.ADJ
elif tag.startswith("R"):
return wordnet.ADV
elif tag.startswith("V"):
return wordnet.VERB
else:
return wordnet.NOUN
lem = WordNetLemmatizer()
stop_words_en = stopwords.words('english')
def clean_sentence(text):
# lower
text = text.lower()
# remove punctuation
text = re.sub('\n', ' ',text)
text = text.translate(str.maketrans('','',string.punctuation))
# tokenization
# text.split(' ') # naive methods, e.g. New York will be splitted into ['New', 'York']
tokens = word_tokenize(text)
# lemmatize, [a, go, c, ...]
tokens = pos_tag(tokens) # [(a, 'NN'), (went, 'VB'), (c, 'NN')]
tokens = [ lem.lemmatize(w, get_wordnet_pos(tag)) for w, tag in tokens ]
# remove stopping words
tokens = [ w for w in tokens if w not in stop_words_en ]
tokens = [ w for w in tokens if len(w) > 1 ]
return ' '.join(tokens)