Select your language (English only - some pages are automatically translated therefore some slight approximations)

Introducing NLPretext, a unified framework to facilitate text preprocessing.

Insights / Data Science – Machine Learning

23 February 2021
Working on NLP projects? Tired of always looking for the same silly preprocessing functions on the web, such as removing accents from French posts? Tired of spending hours on Regex to efficiently extract email addresses from a corpus? Amale El Hamri will show you how NLPretext got you covered!


NLPretext overview

NLPretext is composed of 4 modules: basic, social, token and augmentation.

Each of them includes different functions to handle the most important text preprocessing tasks.

Basic preprocessing

The basic module is a catalogue of transversal functions that can be used in any use case. They allow you to handle:


  • Bad whitespaces in a text, end of line characters
  • Encoding issues
  • Special characters such as currency symbols, numbers, punctuation marks, latin and non-latin characters
  • Emails and phone numbers
from nlpretext.basic.preprocess import replace_emails
example = "I have forwarded this email to"
example = replace_emails(example, replace_with="*EMAIL*")
# "I have forwarded this email to *EMAIL*"

Social preprocessing


The social module is a catalogue of handy functions that can be useful when processing social data, such as:

  • hashtags extraction/removal
  • emojis extraction/removal
  • mentions extraction/removal
  • html tags cleaning


from import extract_emojis
example = "I take care of my skin 😀"
example = extract_emojis(example)
# [':grinning_face:']

Token preprocessing

The token module helps you clean your text on a token level. First you can load a tokenizer to split your sentence into tokens. Then you can:

  • remove stopwords
  • remove small words
  • remove tokens with special characters
from nlpretext.token.preprocess import remove_stopwords
example = [“I”, “like”, “when”, “you”, “move”, “your”, “body”]
example = remove_stopwords(example, lang="en")
# ['I', 'move', 'body']

Text augmentation

The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.

from nlpretext.augmentation.text_augmentation import augment_text
example = "I want to buy a small black handbag please."
entities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]
example = augment_text(example, method=”wordnet_synonym”, entities=entities)
# "I need to buy a small black pocketbook please."

Create your end to end pipeline

Default pipeline

Our library provides a Preprocessor object to efficiently pipe all preprocessing operations.
If you need to keep all elements of your text and perform minimum cleaning, use the default pipeline. It normalizes whitespaces and removes newlines characters, fixes unicode problems and removes recurrent artifacts from social data such as mentions, hashtags and HTML tags.

from nlpretext import Preprocessor
text = "I just got the best dinner in my life @latourdargent !!! I recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
text =
# "I just got the best dinner in my life !!! I recommend"

Custom pipeline

If you have a clear idea of what preprocessing functions you want to pipe in your preprocessing pipeline, you can add them in your own Preprocessor.

from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters, remove_stopwords, lower_text)
from import remove_mentions, remove_hashtag, remove_emoji
text = "I just got the best dinner in my life @latourdargent !!! I recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
preprocessor.pipe(remove_stopwords, args={'lang': 'en'})
text =
# "dinner life recommend"

NLPretext installation

To install the library please run

pip install nlpretext

You can find the github repository here and the library documentation here