Text preprocessing is traditionally an important step for natural language processing (NLP) tasks.
It transforms text into a more digestible form so that machine learning algorithms can perform better.
Outline of Text Preprocessing :
Generally, there are 3 main components:
1. Tokenization
2. Normalization
3. Noise removal
Tokenization is about splitting strings of text into smaller pieces, or “tokens”. Paragraphs can be tokenized
into sentences and sentences can be tokenized into words. Normalization aims to put all text on a level playing field, e.g.,
converting all characters to lowercase. Noise removal cleans up the text, e.g., remove extra whitespaces.
We performed a series of steps under each component.
1. Remove HTML tags
2. Remove extra whitespaces
3. Convert accented characters to ASCII characters
4. Expand contractions
5. Remove special characters
6. Lowercase all texts
7. Convert number words to numeric form
8. Remove numbers
9. Remove stopwords
10. Lemmatization
Stopwords :
Stopwords are very common words. Words like “we” and “are” probably do not help at all in NLP tasks such as sentiment
analysis or text classifications. Hence, we can remove stopwords to save computing time and efforts in processing large volumes of
text.
Lemmatization :
Lemmatization is the process of converting a word to its base form, e.g., “caring” to “care”. We use spaCy’s lemmatizer to obtain
the lemma, or base form, of the words.
Another method to obtain the base form of a word is stemming. We can consider stemming if processing speed
is of utmost concern. But do take note that stemming is a crude heuristic that chops the ends off of words and hence, the
result may not be good or actual words. E.g., stemming “caring” will result in “car”.
Learn More