	{"id":52316,"date":"2021-02-22T17:27:52","date_gmt":"2021-02-22T17:27:52","guid":{"rendered":"https:\/\/www.artefact.com\/?post_type=news&#038;p=52316"},"modified":"2024-09-20T17:45:40","modified_gmt":"2024-09-20T16:45:40","slug":"introducing-nlpretext-a-unified-framework-to-facilitate-text-preprocessing","status":"publish","type":"blog","link":"https:\/\/www.artefact.com\/de\/blog\/introducing-nlpretext-a-unified-framework-to-facilitate-text-preprocessing\/","title":{"rendered":"Wir stellen NLPretext vor, ein einheitliches Framework zur Erleichterung der Textvorverarbeitung."},"content":{"rendered":"<p><div class=\"fusion-fullwidth fullwidth-box fusion-builder-row-1 fusion-flex-container nonhundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"--awb-border-radius-top-left:0px;--awb-border-radius-top-right:0px;--awb-border-radius-bottom-right:0px;--awb-border-radius-bottom-left:0px;--awb-flex-wrap:wrap;\" ><div class=\"fusion-builder-row fusion-row fusion-flex-align-items-flex-start fusion-flex-content-wrap\" style=\"max-width:calc( 1440px + 20px );margin-left: calc(-20px \/ 2 );margin-right: calc(-20px \/ 2 );\"><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-0 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:10px;--awb-margin-bottom-large:0px;--awb-spacing-left-large:10px;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:10px;--awb-spacing-left-medium:10px;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:10px;--awb-spacing-left-small:10px;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-1 description\"><p>23 February 2021<br \/>\nWorking on NLP projects? Tired of always looking for the same silly preprocessing functions on the web, such as removing accents from French posts? Tired of spending hours on Regex to efficiently extract email addresses from a corpus? Amale El Hamri will show you how NLPretext got you covered!<\/p>\n<\/div><div class=\"fusion-image-element\" style=\"text-align:center;--awb-caption-title-font-family:var(--h2_typography-font-family);--awb-caption-title-font-weight:var(--h2_typography-font-weight);--awb-caption-title-font-style:var(--h2_typography-font-style);--awb-caption-title-size:var(--h2_typography-font-size);--awb-caption-title-transform:var(--h2_typography-text-transform);--awb-caption-title-line-height:var(--h2_typography-line-height);--awb-caption-title-letter-spacing:var(--h2_typography-letter-spacing);\"><span class=\" fusion-imageframe imageframe-none imageframe-1 hover-type-none\"><a class=\"fusion-no-lightbox\" href=\"https:\/\/medium.com\/artefact-engineering-and-data-science\" target=\"_self\" aria-label=\"1*s986xIGqhfsN8U&#8211;09_AdA\" rel=\"noopener\"><img decoding=\"async\" width=\"300\" height=\"74\" alt=\"Medium Tech Blog by Artefact\" src=\"https:\/\/www.artefact.com\/\/wp-content\/uploads\/2021\/03\/1s986xIGqhfsN8U-09_AdA.png\" data-orig-src=\"https:\/\/www.artefact.com\/\/wp-content\/uploads\/2021\/03\/1s986xIGqhfsN8U-09_AdA-300x74.png\" class=\"lazyload img-responsive wp-image-59273\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%274000%27%20height%3D%27992%27%20viewBox%3D%270%200%204000%20992%27%3E%3Crect%20width%3D%274000%27%20height%3D%27992%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E\" data-srcset=\"https:\/\/www.artefact.com\/\/wp-content\/uploads\/2021\/03\/1s986xIGqhfsN8U-09_AdA-200x50.png 200w, https:\/\/www.artefact.com\/\/wp-content\/uploads\/2021\/03\/1s986xIGqhfsN8U-09_AdA-400x99.png 400w, https:\/\/www.artefact.com\/\/wp-content\/uploads\/2021\/03\/1s986xIGqhfsN8U-09_AdA-600x149.png 600w, https:\/\/www.artefact.com\/\/wp-content\/uploads\/2021\/03\/1s986xIGqhfsN8U-09_AdA-800x198.png 800w, https:\/\/www.artefact.com\/\/wp-content\/uploads\/2021\/03\/1s986xIGqhfsN8U-09_AdA-1200x298.png 1200w\" data-sizes=\"auto\" data-orig-sizes=\"(max-width: 640px) 100vw, 300px\" \/><\/a><\/span><\/div><\/div><\/div><\/div><\/div><article class=\"fusion-fullwidth fullwidth-box fusion-builder-row-2 fusion-flex-container nonhundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"--awb-border-radius-top-left:0px;--awb-border-radius-top-right:0px;--awb-border-radius-bottom-right:0px;--awb-border-radius-bottom-left:0px;--awb-flex-wrap:wrap;\" ><div class=\"fusion-builder-row fusion-row fusion-flex-align-items-flex-start fusion-flex-justify-content-center fusion-flex-content-wrap\" style=\"max-width:calc( 1440px + 20px );margin-left: calc(-20px \/ 2 );margin-right: calc(-20px \/ 2 );\"><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-1 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-color:#ffffff;--awb-bg-color-hover:#ffffff;--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:10px;--awb-margin-bottom-large:0px;--awb-spacing-left-large:10px;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:10px;--awb-spacing-left-medium:10px;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:10px;--awb-spacing-left-small:10px;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-title title fusion-title-1 fusion-sep-none fusion-title-text fusion-title-size-two\" style=\"--awb-text-color:#ff0066;--awb-margin-bottom:20px;--awb-margin-bottom-small:8px;--awb-font-size:30px;\"><h2 class=\"fusion-title-heading title-heading-left fusion-responsive-typography-calculated\" style=\"font-family:&quot;PT Serif&quot;;font-style:normal;font-weight:700;margin:0;font-size:1em;--fontSize:30;line-height:1.33;\"><div><\/div>\n<div><\/div>\n<p>NLPretext overview<\/h2><\/div><div class=\"fusion-text fusion-text-2\"><p>NLPretext is composed of 4 modules: basic, social, token and augmentation.<\/p>\n<p>Each of them includes different functions to handle the most important text preprocessing tasks.<\/p>\n<\/div><div class=\"fusion-title title fusion-title-2 fusion-sep-none fusion-title-text fusion-title-size-two\" style=\"--awb-text-color:#ff0066;--awb-margin-top:40px;--awb-margin-bottom:20px;--awb-margin-bottom-small:8px;--awb-font-size:30px;\"><h2 class=\"fusion-title-heading title-heading-left fusion-responsive-typography-calculated\" style=\"font-family:&quot;PT Serif&quot;;font-style:normal;font-weight:700;margin:0;font-size:1em;--fontSize:30;line-height:1.33;\">Basic preprocessing<\/h2><\/div><div class=\"fusion-text fusion-text-3\"><p>The basic module is a catalogue of transversal functions that can be used in any use case. They allow you to handle:<\/p>\n<\/div><ul style=\"--awb-line-height:27.2px;--awb-icon-width:27.2px;--awb-icon-height:27.2px;--awb-icon-margin:11.2px;--awb-content-margin:38.4px;\" class=\"fusion-checklist fusion-checklist-1 fusion-checklist-default type-icons paddingListArticle\"><li class=\"fusion-li-item\" style=\"\"><span class=\"icon-wrapper circle-no\"><i class=\"fusion-li-icon awb-icon-check\" aria-hidden=\"true\"><\/i><\/span><div class=\"fusion-li-item-content\">Bad whitespaces in a text, end of line characters<\/div><\/li><li class=\"fusion-li-item\" style=\"\"><span class=\"icon-wrapper circle-no\"><i class=\"fusion-li-icon awb-icon-check\" aria-hidden=\"true\"><\/i><\/span><div class=\"fusion-li-item-content\">Encoding issues<\/div><\/li><li class=\"fusion-li-item\" style=\"\"><span class=\"icon-wrapper circle-no\"><i class=\"fusion-li-icon awb-icon-check\" aria-hidden=\"true\"><\/i><\/span><div class=\"fusion-li-item-content\">\n<p>Special characters such as currency symbols, numbers, punctuation marks, latin and non-latin characters<\/p>\n<\/div><\/li><li class=\"fusion-li-item\" style=\"\"><span class=\"icon-wrapper circle-no\"><i class=\"fusion-li-icon awb-icon-check\" aria-hidden=\"true\"><\/i><\/span><div class=\"fusion-li-item-content\">Emails and phone numbers<\/div><\/li><\/ul><div class=\"fusion-text fusion-text-4\"><div class=\"code\">from nlpretext.basic.preprocess import replace_emails<br \/>\nexample = &#8220;I have forwarded this email to o&#98;&#x61;&#x6d;a&#64;&#119;&#x68;&#x69;t&#101;&#x68;&#x6f;u&#115;&#101;&#x2e;&#x67;o&#118;&#8221;<br \/>\nexample = replace_emails(example, replace_with=&#8221;*EMAIL*&#8221;)<br \/>\nprint(example)<br \/>\n# &#8220;I have forwarded this email to *EMAIL*&#8221;<\/div>\n<\/div><div class=\"fusion-title title fusion-title-3 fusion-sep-none fusion-title-text fusion-title-size-two\" style=\"--awb-text-color:#ff0066;--awb-margin-top:40px;--awb-margin-bottom:20px;--awb-margin-bottom-small:8px;--awb-font-size:30px;\"><h2 class=\"fusion-title-heading title-heading-left fusion-responsive-typography-calculated\" style=\"font-family:&quot;PT Serif&quot;;font-style:normal;font-weight:700;margin:0;font-size:1em;--fontSize:30;line-height:1.33;\">Social preprocessing<\/h2><\/div><div class=\"fusion-text fusion-text-5\"><p>The <strong>social<\/strong> module is a catalogue of handy functions that can be useful when processing social data, such as:<\/p>\n<\/div><ul style=\"--awb-line-height:27.2px;--awb-icon-width:27.2px;--awb-icon-height:27.2px;--awb-icon-margin:11.2px;--awb-content-margin:38.4px;\" class=\"fusion-checklist fusion-checklist-2 fusion-checklist-default type-icons paddingListArticle\"><li class=\"fusion-li-item\" style=\"\"><span class=\"icon-wrapper circle-no\"><i class=\"fusion-li-icon awb-icon-check\" aria-hidden=\"true\"><\/i><\/span><div class=\"fusion-li-item-content\">\n<p>hashtags extraction\/removal<\/p>\n<\/div><\/li><li class=\"fusion-li-item\" style=\"\"><span class=\"icon-wrapper circle-no\"><i class=\"fusion-li-icon awb-icon-check\" aria-hidden=\"true\"><\/i><\/span><div class=\"fusion-li-item-content\">\n<p>emojis extraction\/removal<\/p>\n<\/div><\/li><li class=\"fusion-li-item\" style=\"\"><span class=\"icon-wrapper circle-no\"><i class=\"fusion-li-icon awb-icon-check\" aria-hidden=\"true\"><\/i><\/span><div class=\"fusion-li-item-content\">\n<p>mentions extraction\/removal<\/p>\n<\/div><\/li><li class=\"fusion-li-item\" style=\"\"><span class=\"icon-wrapper circle-no\"><i class=\"fusion-li-icon awb-icon-check\" aria-hidden=\"true\"><\/i><\/span><div class=\"fusion-li-item-content\">\n<p>html tags cleaning<\/p>\n<\/div><\/li><\/ul><div class=\"fusion-text fusion-text-6\"><div class=\"code\">from nlpretext.social.preprocess import extract_emojis<br \/>\nexample = &#8220;I take care of my skin \ud83d\ude00&#8221;<br \/>\nexample = extract_emojis(example)<br \/>\nprint(example) #[&#8216;:grinning_face:&#8217;]<\/div>\n<\/div><div class=\"fusion-title title fusion-title-4 fusion-sep-none fusion-title-text fusion-title-size-two\" style=\"--awb-text-color:#ff0066;--awb-margin-top:40px;--awb-margin-bottom:20px;--awb-margin-bottom-small:8px;--awb-font-size:30px;\"><h2 class=\"fusion-title-heading title-heading-left fusion-responsive-typography-calculated\" style=\"font-family:&quot;PT Serif&quot;;font-style:normal;font-weight:700;margin:0;font-size:1em;--fontSize:30;line-height:1.33;\">Text augmentation<\/h2><\/div><div class=\"fusion-text fusion-text-7\"><p>The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.<\/p>\n<\/div><div class=\"fusion-title title fusion-title-5 fusion-sep-none fusion-title-text fusion-title-size-two\" style=\"--awb-text-color:#ff0066;--awb-margin-top:40px;--awb-margin-bottom:20px;--awb-margin-bottom-small:8px;--awb-font-size:38px;\"><h2 class=\"fusion-title-heading title-heading-left fusion-responsive-typography-calculated\" style=\"font-family:&quot;PT Serif&quot;;font-style:normal;font-weight:700;margin:0;font-size:1em;--fontSize:38;line-height:1.05;\">Create your end to end pipeline<\/h2><\/div><div class=\"fusion-title title fusion-title-6 fusion-sep-none fusion-title-text fusion-title-size-three\" style=\"--awb-text-color:#ff0066;--awb-margin-top:20px;--awb-margin-bottom:0px;--awb-margin-bottom-small:8px;--awb-font-size:30px;\"><h3 class=\"fusion-title-heading title-heading-left fusion-responsive-typography-calculated\" style=\"font-family:&quot;PT Serif&quot;;font-style:normal;font-weight:700;margin:0;font-size:1em;--fontSize:30;line-height:1.33;\">Default pipeline<\/h3><\/div><div class=\"fusion-text fusion-text-8\"><p>Our library provides a Preprocessor object to efficiently pipe all preprocessing operations.<br \/>\nIf you need to keep all elements of your text and perform minimum cleaning, use the default pipeline. It normalizes whitespaces and removes newlines characters, fixes unicode problems and removes recurrent artifacts from social data such as mentions, hashtags and HTML tags.<\/p>\n<\/div><div class=\"fusion-text fusion-text-9\"><div class=\"code\">from nlpretext import Preprocessor<\/div>\n<div class=\"code\">text = &#8220;I just got the best dinner in my life @latourdargent !!! I recommend \ud83d\ude00 #food #paris n&#8221;<\/div>\n<div class=\"code\">preprocessor = Preprocessor()<\/div>\n<div class=\"code\">text = preprocessor.run(text) print(text)<\/div>\n<div class=\"code\"># &#8220;I just got the best dinner in my life !!! I recommend&#8221;<\/div>\n<\/div><div class=\"fusion-title title fusion-title-7 fusion-sep-none fusion-title-text fusion-title-size-three\" style=\"--awb-text-color:#ff0066;--awb-margin-top:20px;--awb-margin-bottom:0px;--awb-margin-bottom-small:8px;--awb-font-size:30px;\"><h3 class=\"fusion-title-heading title-heading-left fusion-responsive-typography-calculated\" style=\"font-family:&quot;PT Serif&quot;;font-style:normal;font-weight:700;margin:0;font-size:1em;--fontSize:30;line-height:1.33;\">Custom pipeline<\/h3><\/div><div class=\"fusion-text fusion-text-10\"><p>If you have a clear idea of what preprocessing functions you want to pipe in your preprocessing pipeline, you can add them in your own Preprocessor.<\/p>\n<\/div><div class=\"fusion-text fusion-text-11\"><div class=\"code\">from nlpretext import Preprocessor<\/div>\n<div class=\"code\">from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters, remove_stopwords, lower_text)<\/div>\n<div class=\"code\">from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji<\/div>\n<div class=\"code\">text = \"I just got the best dinner in my life @latourdargent !!! I recommend \ud83d\ude00 #food #paris n\"<\/div>\n<div class=\"code\">preprocessor = Preprocessor()<\/div>\n<div class=\"code\">preprocessor.pipe(lower_text)<\/div>\n<div class=\"code\">preprocessor.pipe(remove_mentions)<\/div>\n<div class=\"code\">preprocessor.pipe(remove_hashtag)<\/div>\n<div class=\"code\">preprocessor.pipe(remove_emoji)<\/div>\n<div class=\"code\">preprocessor.pipe(remove_eol_characters)<\/div>\n<div class=\"code\">preprocessor.pipe(remove_stopwords, args=)<\/div>\n<div class=\"code\">preprocessor.pipe(remove_punct)<\/div>\n<div class=\"code\">preprocessor.pipe(normalize_whitespace)<\/div>\n<div class=\"code\">text = preprocessor.run(text) print(text) # \"dinner life recommend\"<\/div>\n<\/div><div class=\"fusion-title title fusion-title-8 fusion-sep-none fusion-title-text fusion-title-size-two\" style=\"--awb-text-color:#ff0066;--awb-margin-top:40px;--awb-margin-bottom:20px;--awb-margin-bottom-small:8px;--awb-font-size:30px;\"><h2 class=\"fusion-title-heading title-heading-left fusion-responsive-typography-calculated\" style=\"font-family:&quot;PT Serif&quot;;font-style:normal;font-weight:700;margin:0;font-size:1em;--fontSize:30;line-height:1.33;\">NLPretext installation<\/h2><\/div><div class=\"fusion-text fusion-text-12\"><p>To install the library please run<\/p>\n<\/div><div class=\"fusion-text fusion-text-13\"><div class=\"code\">pip install nlpretext<\/div>\n<\/div><div class=\"fusion-text fusion-text-14\"><p>You can find the github repository <a href=\"https:\/\/github.com\/artefactory\/NLPretext\/tree\/master\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a> and the library documentation <a href=\"https:\/\/nlpretext.readthedocs.io\/en\/latest\/\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a><\/p>\n<\/div><\/div><\/div><\/div><\/article><div class=\"fusion-fullwidth fullwidth-box fusion-builder-row-3 fusion-flex-container nonhundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"--awb-border-radius-top-left:0px;--awb-border-radius-top-right:0px;--awb-border-radius-bottom-right:0px;--awb-border-radius-bottom-left:0px;--awb-flex-wrap:wrap;\" ><div class=\"fusion-builder-row fusion-row fusion-flex-align-items-flex-start fusion-flex-content-wrap\" style=\"max-width:calc( 1440px + 20px );margin-left: calc(-20px \/ 2 );margin-right: calc(-20px \/ 2 );\"><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-2 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:10px;--awb-margin-bottom-large:0px;--awb-spacing-left-large:10px;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:10px;--awb-spacing-left-medium:10px;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:10px;--awb-spacing-left-small:10px;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-15\"><p>This article was first published on the Artefact Tech Blog on Medium.<\/p>\n<\/div><div ><a class=\"fusion-button button-flat fusion-button-default-size button-default fusion-button-default button-1 fusion-button-default-span fusion-button-default-type button-primary-medium\" target=\"_self\" href=\"https:\/\/medium.com\/artefact-engineering-and-data-science\" rel=\"noopener\"><span class=\"fusion-button-text awb-button__text awb-button__text--default\">Discover Our Blog Medium<\/span><\/a><\/div><\/div><\/div><\/div><\/div><\/p>\n","protected":false},"excerpt":{"rendered":"<p>23. Februar 2021<br \/>\n Arbeiten Sie an NLP-Projekten? Sind Sie es leid, im Internet immer wieder nach den gleichen dummen Vorverarbeitungsfunktionen zu suchen, wie z.B. das Entfernen von Akzenten aus franz\u00f6sischen Beitr\u00e4gen? Sind Sie es leid, Stunden mit Regex zu verbringen, um E-Mail-Adressen effizient aus einem Korpus zu extrahieren? Amale El Hamri wird Ihnen zeigen, wie NLPretext Sie unterst\u00fctzt!        <\/p>","protected":false},"featured_media":52321,"parent":0,"template":"","meta":{"_acf_changed":false,"ep_exclude_from_search":false},"blog-category":[22035,21930],"blog-language":[2991],"class_list":["post-52316","blog","type-blog","status-publish","has-post-thumbnail","hentry","blog-category-data-ai-consulting","blog-category-finance","blog-language-en"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.artefact.com\/de\/wp-json\/wp\/v2\/blog\/52316","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.artefact.com\/de\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.artefact.com\/de\/wp-json\/wp\/v2\/types\/blog"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.artefact.com\/de\/wp-json\/wp\/v2\/media\/52321"}],"wp:attachment":[{"href":"https:\/\/www.artefact.com\/de\/wp-json\/wp\/v2\/media?parent=52316"}],"wp:term":[{"taxonomy":"blog-category","embeddable":true,"href":"https:\/\/www.artefact.com\/de\/wp-json\/wp\/v2\/blog-category?post=52316"},{"taxonomy":"blog-language","embeddable":true,"href":"https:\/\/www.artefact.com\/de\/wp-json\/wp\/v2\/blog-language?post=52316"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}