MachineLearning

NLP Text Preprocessing...



Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.

Outline of Text Preprocessing :

Generally, there are 3 main components:
1. Tokenization
2. Normalization
3. Noise removal

Tokenization is about splitting strings of text into smaller pieces, or “tokens”. Paragraphs can be tokenized into sentences and sentences can be tokenized into words. Normalization aims to put all text on a level playing field, e.g., converting all characters to lowercase. Noise removal cleans up the text, e.g., remove extra whitespaces.

We performed a series of steps under each component.

1. Remove HTML tags
2. Remove extra whitespaces
3. Convert accented characters to ASCII characters
4. Expand contractions
5. Remove special characters
6. Lowercase all texts
7. Convert number words to numeric form
8. Remove numbers
9. Remove stopwords
10. Lemmatization

Stopwords :
Stopwords are very common words. Words like “we” and “are” probably do not help at all in NLP tasks such as sentiment analysis or text classifications. Hence, we can remove stopwords to save computing time and efforts in processing large volumes of text.

Lemmatization :
Lemmatization is the process of converting a word to its base form, e.g., “caring” to “care”. We use spaCy’s lemmatizer to obtain the lemma, or base form, of the words.

Another method to obtain the base form of a word is stemming. We can consider stemming if processing speed is of utmost concern. But do take note that stemming is a crude heuristic that chops the ends off of words and hence, the result may not be good or actual words. E.g., stemming “caring” will result in “car”.


Learn More