NLPretext is composed of 4 modules: basic, social, token and augmentation.
Each of them includes different functions to handle the most important text preprocessing tasks.
The basic module is a catalogue of transversal functions that can be used in any use case. They allow you to handle:
- Bad whitespaces in a text, end of line characters
- Encoding issues
Special characters such as currency symbols, numbers, punctuation marks, latin and non-latin characters
- Emails and phone numbers
example = “I have forwarded this email to firstname.lastname@example.org”
example = replace_emails(example, replace_with=”*EMAIL*”)
# “I have forwarded this email to *EMAIL*”
The social module is a catalogue of handy functions that can be useful when processing social data, such as:
html tags cleaning
example = “I take care of my skin 😀”
example = extract_emojis(example)
The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.
Create your end to end pipeline
Our library provides a Preprocessor object to efficiently pipe all preprocessing operations.
If you need to keep all elements of your text and perform minimum cleaning, use the default pipeline. It normalizes whitespaces and removes newlines characters, fixes unicode problems and removes recurrent artifacts from social data such as mentions, hashtags and HTML tags.
If you have a clear idea of what preprocessing functions you want to pipe in your preprocessing pipeline, you can add them in your own Preprocessor.
To install the library please run