{"id":1534,"date":"2020-09-14T19:03:21","date_gmt":"2020-09-14T19:03:21","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/14\/20-popular-nlp-text-preprocessing-techniques-implementation-in-python\/"},"modified":"2020-09-14T19:03:21","modified_gmt":"2020-09-14T19:03:21","slug":"20-popular-nlp-text-preprocessing-techniques-implementation-in-python","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/14\/20-popular-nlp-text-preprocessing-techniques-implementation-in-python\/","title":{"rendered":"20+ Popular NLP Text Preprocessing Techniques Implementation In Python"},"content":{"rendered":"<div id=\"tve_editor\" data-post-id=\"5595\">\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748bf7fa95\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/1-NLP-Text-Preprocessing-Techniques.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-5598\" alt=\"NLP Text Preprocessing Techniques\" data-id=\"5598\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"NLP Text Preprocessing Techniques\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5598\" alt=\"NLP Text Preprocessing Techniques\" data-id=\"5598\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"NLP Text Preprocessing Techniques\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/1-NLP-Text-Preprocessing-Techniques.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\" data-css=\"tve-u-1748bf69da7\">\n<p dir=\"ltr\">Using the text preprocessing techniques we can remove noise from raw data and makes raw data more valuable for building models.\u00a0<\/p>\n<p dir=\"ltr\">Here, raw data is nothing but data we collect from different sources like reviews from websites, documents, social media, <a href=\"https:\/\/dataaspirant.com\/twitter-sentiment-analysis-using-r\/\" target=\"_blank\" rel=\"noopener noreferrer\">twitter tweets<\/a>, news articles etc.\u00a0<\/p>\n<p dir=\"ltr\">Data preprocessing is the primary and most crucial step in any data science problems or project. Preprocessing the collected data is the integral part of any Natural Language Processing, Computer Vision, deep learning and machine learning problems. Based on the type of dataset, we have to follow different preprocessing methods.\u00a0<\/p>\n<p dir=\"ltr\">Which means machine learning data preprocessing techniques vary from the deep learning, natural language or nlp\u00a0 data preprocessing techniques.<\/p>\n<p dir=\"ltr\">So there is a need to learn these techniques to build effective <a href=\"https:\/\/dataaspirant.com\/build-email-spam-classification-model-spacy-python\/\" target=\"_blank\" rel=\"noopener noreferrer\">natural language processing models<\/a>.<\/p>\n<p dir=\"ltr\">In this article we will discuss different text preprocessing techniques or methods like <strong>normalization, stemming, lemmatization<\/strong>, etc. for \u00a0handling text to build various Natural Language Processing problems\/models.\u00a0<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_tw_qs tve_clearfix\" data-url=\"https:\/\/twitter.com\/intent\/tweet\" data-via=\"\" data-use_custom_url=\"\" data-css=\"tve-u-1748bf69de8\">\n<div class=\"thrv_tw_qs_container\">\n<div class=\"thrv_tw_quote\">\n<p class=\"\">Popular Text Preprocessing Techniques Implementation in Python #nlp, #datascience #machinelearning<\/p>\n<\/div>\n<p>\n\t\t\t<span><br \/>\n\t\t\t\t<i><\/i><br \/>\n\t\t\t\t<span class=\"thrv_tw_qs_button_text thrv-inline-text tve_editable\">Click to Tweet<\/span><br \/>\n\t\t\t<\/span>\n\t\t<\/p>\n<\/div>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" data-css=\"tve-u-1748bf69da7\">\n<p dir=\"ltr\">Moreover we don&#8217;t limit ourself with the theory part but we will also implement these technique in \u00a0python.<\/p>\n<p dir=\"ltr\">Before we go further below are the list of topics you will learn in this article.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" data-css=\"tve-u-1748bf69dec\">\n<h2 class=\"\" id=\"t-1600076346678\">Text Preprocessing Importance in NLP<\/h2>\n<p dir=\"ltr\">As we said before text preprocessing is the first step in the Natural Language Processing pipeline. The importance of preprocessing is increasing in NLP due to noise or unclear data extracted or collected from different sources.\u00a0<\/p>\n<p dir=\"ltr\">Most of the text data collected from reviews of E-commerce websites like Amazon or Flipkart, tweets from twitter,\u00a0 comments from Facebook or Instagram, and other websites like Wikipedia, etc.\u00a0<\/p>\n<p dir=\"ltr\">We can observe users use <strong>short forms, emojis, misspelling<\/strong> of words, etc. in their comments, tweets, and so on.<\/p>\n<p dir=\"ltr\">We should not feed raw data without preprocessing to \u00a0build models because the preprocessing of text directly <a href=\"https:\/\/dataaspirant.com\/six-popular-classification-evaluation-metrics-in-machine-learning\/\" target=\"_blank\" rel=\"noopener noreferrer\">improves the model&#8217;s performance<\/a>.<\/p>\n<p dir=\"ltr\">If we feed data without performing any text preprocessing techniques, the build models will not learn the real significance of the data. In some cases, if we feed raw data without any preprocessing techniques the models will get confused and give random results.\u00a0<\/p>\n<p dir=\"ltr\">In that confusion, the model will learn harmful patterns that are not valuable. Due to this, the model&#8217;s performance will be affected, which means the model performance will reduce significantly.<\/p>\n<p dir=\"ltr\">So we should remove all these noises from the text and make it a more clear and structured form for building models.<\/p>\n<p dir=\"ltr\">Here we have to know one thing.<\/p>\n<p dir=\"ltr\">The natural language text preprocessing techniques will vary from problem to problem. This means we cannot apply the same text preprocessing techniques used for one NLP problem to another NLP problem.\u00a0<\/p>\n<p dir=\"ltr\">For example, in <strong>sentiment analysis classification problems<\/strong>, we can remove or ignore numbers within the text because numbers are not significant in this problem statement. <\/p>\n<p dir=\"ltr\">However, we should not ignore the numbers if we are dealing with financial related problems. Because numbers play a key role in these kinds of problems.<\/p>\n<p dir=\"ltr\">So while performing NLP text preprocessing techniques. We need to focus more on the domain we are applying these NLP techniques and the <strong>order<\/strong> of methods also plays a key role.<\/p>\n<p dir=\"ltr\">Don&#8217;t worry about the order of these techniques for now.\u00a0 We will give the generic order in which you need to apply these techniques.<\/p>\n<p dir=\"ltr\">Our suggestion is to use preprocessing methods or techniques on a <strong>subset of aggregate data<\/strong> (take a few sentences randomly). We can easily observe whether it is in our expected form or not. If it is in our expected form, then apply on a complete dataset; otherwise, change the order of preprocessing techniques.<\/p>\n<p dir=\"ltr\">We will provide a <strong>python file<\/strong> with a preprocess class of all preprocessing techniques at the end of this article. <\/p>\n<p dir=\"ltr\">You can download and import that class to your code. We can get preprocessed text by calling preprocess class with a list of sentences and sequences of preprocessing techniques we need to use.<\/p>\n<p dir=\"ltr\">Again the order of technique we need to use will differ from problem to problem.<\/p>\n<h2 class=\"\" id=\"t-1600076346679\">Different Text Preprocessing Techniques<\/h2>\n<p dir=\"ltr\">Let us jump to learn different types of text preprocessing techniques.\u00a0<\/p>\n<p dir=\"ltr\">In the next few minutes, we will discuss and learn the importance and implementation of these techniques.<\/p>\n<h3 class=\"\" id=\"t-1600076346680\">Converting to Lower case<\/h3>\n<p>Converting all our text into the lower case is a simple and most effective approach. \u00a0If we are not applying lower case conversion on words like NLP, nlp, Nlp, we are treating all these words as <strong>different<\/strong> words.\u00a0<\/p>\n<p>After using the lower casing, all three words are treated as a single word that is <strong>nlp.<\/strong><\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c0000e2\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/2-Converting-Lower-Case.png?resize=626%2C428&amp;ssl=1\" class=\"tve_image wp-image-5610\" alt=\"Converting to Lower Case\" data-id=\"5610\" width=\"626\" data-init-width=\"1024\" height=\"428\" data-init-height=\"700\" title=\"Converting to Lower Case\" loading=\"lazy\" data-width=\"626\" data-height=\"428\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5610\" alt=\"Converting to Lower Case\" data-id=\"5610\" width=\"626\" data-init-width=\"1024\" height=\"428\" data-init-height=\"700\" title=\"Converting to Lower Case\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/2-Converting-Lower-Case.png?resize=626%2C428&amp;ssl=1\" data-width=\"626\" data-height=\"428\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Converting To Lower Case<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">This method is useful for problems that are dependent on the <strong>frequency<\/strong> of words such as document classification.\u00a0<\/p>\n<p dir=\"ltr\">In this case, we count the frequency of words by <a href=\"https:\/\/dataaspirant.com\/word-embedding-techniques-nlp\/\" target=\"_blank\" rel=\"noopener noreferrer\">using bag-of-words, TFIDF, etc.<\/a><\/p>\n<p dir=\"ltr\">It is better to perform <strong>lower case<\/strong> the text as the first step in this text preprocessing. Because if we are trying to remove stop words all words need to be in lower case. <\/p>\n<p dir=\"ltr\">For example, few sentences have the starting word as <strong>&#8220;The&#8221;<\/strong> if we are not performing the lower casing technique before that technique, we can not remove all <strong>stopwords<\/strong>.\u00a0<\/p>\n<p dir=\"ltr\">The other case is for calculating the frequency count. If we<strong> not<\/strong> converted the text into lower case <strong>Data Science<\/strong> and <strong>data science<\/strong> will treat as different tokens. <\/p>\n<p dir=\"ltr\">In natural language processing the lower dimension of text which is <strong>words<\/strong> called as <strong>tokens<\/strong>.<\/p>\n<p dir=\"ltr\">We can apply this method to most of the text related problems. Still, it may not be suitable for different projects like <strong>Parts-Of-Speech tag<\/strong> recognition or <strong>dependency parsing<\/strong>, where proper word casing is essential to recognize nouns, verbs, etc.<\/p>\n<h4 class=\"\">Implementation of lower case conversion<\/h4>\n<\/div>\n<h3 class=\"\" id=\"t-1600077497351\">Removal of HTML tags<\/h3>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c0fcccd\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/3-Remove-html-tags.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-5616\" alt=\"Removing html tags\" data-id=\"5616\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Removing html tags\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-css=\"tve-u-1748c16822a\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5616\" alt=\"Removing html tags\" data-id=\"5616\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Removing html tags\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/3-Remove-html-tags.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-css=\"tve-u-1748c16822a\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"wp-caption-text thrv-inline-text\">Html Tags Removal<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">This is the second essential preprocessing technique. The chances to get HTML tags in our text data is quite common when we are extracting or scraping data from different websites.\u00a0<\/p>\n<p dir=\"ltr\">We don&#8217;t get any valuable information from these HTML tags. So it is better to remove them from our text data. We can remove these tags by using <strong>regex<\/strong> and we can also use the <strong>BeautifulSoup<\/strong> module from bs4 libraries.\u00a0<\/p>\n<p dir=\"ltr\">Let us see the implementation using python.<\/p>\n<h4 class=\"\">HTML tags removal Implementation using regex module<\/h4>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c15c1e7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/3-Html-Tag-removal.png?resize=626%2C421&amp;ssl=1\" class=\"tve_image wp-image-5620\" alt=\"Html Tag removal Example\" data-id=\"5620\" width=\"626\" data-init-width=\"1842\" height=\"421\" data-init-height=\"1238\" title=\"Html Tag removal Example\" loading=\"lazy\" data-width=\"626\" data-height=\"421\" data-css=\"tve-u-1748c15dc44\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5620\" alt=\"Html Tag removal Example\" data-id=\"5620\" width=\"626\" data-init-width=\"1842\" height=\"421\" data-init-height=\"1238\" title=\"Html Tag removal Example\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/3-Html-Tag-removal.png?resize=626%2C421&amp;ssl=1\" data-width=\"626\" data-height=\"421\" data-css=\"tve-u-1748c15dc44\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"wp-caption-text thrv-inline-text\">HTML tags removal example<\/p>\n<\/div>\n<h4 class=\"\">Implementation of Removing HTML tags using bs4 library<\/h4>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>We can observe <strong>both<\/strong> the functions are giving the <strong>same<\/strong> result after removing HTML tags from our example text.<\/p>\n<h3 class=\"\" id=\"t-1600077497352\">Removal of URLs<\/h3>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c1b1e89\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/4-Remove-Urls.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-5627\" alt=\"Remove Urls\" data-id=\"5627\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Remove Urls\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5627\" alt=\"Remove Urls\" data-id=\"5627\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Remove Urls\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/4-Remove-Urls.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Remove Urls<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>URL is the short-form of <strong>Uniform Resource Locator<\/strong>. The URLs within the text refer to the location of another website or anything else. <\/p>\n<p>If we are performing any website <strong>backlinks analysis<\/strong>, twitter or Facebook in that case, URLs are an excellent choice to keep in text. <\/p>\n<p>Otherwise, from URLs also we can not get any information. So we can remove it from our text. We can remove URLs from the text by using the python <strong>Regex<\/strong> library.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c1c30a1\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/4-Urls-removal.png?resize=626%2C444&amp;ssl=1\" class=\"tve_image wp-image-5630\" alt=\"Urls removal Example\" data-id=\"5630\" width=\"626\" data-init-width=\"1544\" height=\"444\" data-init-height=\"1094\" title=\"Urls removal Example\" loading=\"lazy\" data-width=\"626\" data-height=\"444\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5630\" alt=\"Urls removal Example\" data-id=\"5630\" width=\"626\" data-init-width=\"1544\" height=\"444\" data-init-height=\"1094\" title=\"Urls removal Example\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/4-Urls-removal.png?resize=626%2C444&amp;ssl=1\" data-width=\"626\" data-height=\"444\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Urls removal Example<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\">Implementation of Removing URLs \u00a0using python regex<\/h4>\n<p dir=\"ltr\">In the below script. We take example text with URLs and then call the 2 functions with that example text. In the <strong>remove_urls<\/strong> function, assign a regular expression to remove URLs to url_pattern after That, substitute URLs within the text with space by calling the re library&#8217;s<strong> sub-function<\/strong>.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h3 class=\"\" id=\"t-1600077497353\">Removing Numbers<\/h3>\n<p>We can remove numbers from the text if our problem statement <strong>doesn&#8217;t<\/strong> require numbers.\u00a0<\/p>\n<p>For example, if we are working on <strong>financial<\/strong> related problems like banking or insurance-related sectors. We may get information from numbers. <\/p>\n<p>In those cases, we <strong>shouldn&#8217;t<\/strong> remove numbers.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c22903e\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/5-Removing-Numbers.png?resize=626%2C448&amp;ssl=1\" class=\"tve_image wp-image-5636\" alt=\"Removing Numbers\" data-id=\"5636\" width=\"626\" data-init-width=\"1548\" height=\"448\" data-init-height=\"1108\" title=\"Removing Numbers\" loading=\"lazy\" data-width=\"626\" data-height=\"448\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5636\" alt=\"Removing Numbers\" data-id=\"5636\" width=\"626\" data-init-width=\"1548\" height=\"448\" data-init-height=\"1108\" title=\"Removing Numbers\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/5-Removing-Numbers.png?resize=626%2C448&amp;ssl=1\" data-width=\"626\" data-height=\"448\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Removing Numbers<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\">Implementation of Removing numbers \u00a0using python regex<\/h4>\n<p dir=\"ltr\">In the code below, we will call the <strong>remove_numbers<\/strong> function with example text, which contains numbers. <\/p>\n<p dir=\"ltr\">Let&#8217;s see how to implement it.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>In the above <strong>removing_numbers<\/strong> function. We mentioned a pattern to recognize numbers within the text and then substitute numbers with space using the re library&#8217;s sub-function. <\/p>\n<p>And then return text after removing the number to <strong>numbers_result<\/strong> variable.<\/p>\n<h3 class=\"\" id=\"t-1600077497354\">Converting numbers to words<\/h3>\n<p>If our problem statement <strong>need<\/strong> valuable information from numbers in that case, we have to convert numbers to words. Similar problem statements which are discussed at the removing numbers (above section).<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c2746ff\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/6-Converting-Numbers-to-Words.png?resize=626%2C429&amp;ssl=1\" class=\"tve_image wp-image-5641\" alt=\"Converting Numbers to Words\" data-id=\"5641\" width=\"626\" data-init-width=\"1788\" height=\"429\" data-init-height=\"1226\" title=\"Converting Numbers to Words\" loading=\"lazy\" data-width=\"626\" data-height=\"429\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5641\" alt=\"Converting Numbers to Words\" data-id=\"5641\" width=\"626\" data-init-width=\"1788\" height=\"429\" data-init-height=\"1226\" title=\"Converting Numbers to Words\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/6-Converting-Numbers-to-Words.png?resize=626%2C429&amp;ssl=1\" data-width=\"626\" data-height=\"429\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Converting Numbers to Words<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\">Implementation of Converting numbers to words using python num2words library<\/h4>\n<p dir=\"ltr\">We can convert numbers to words by just importing the <strong>num2words<\/strong> library. In the code below, we will call the <strong>num_to_words<\/strong> function with example text. Example text has numbers.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">In the above code, the num_to_words function is getting the text as input. In that, we are splitting text using a python string function of a split with space to get words individually. \u00a0<\/p>\n<p dir=\"ltr\">Taking each word and checking if that word is digit or not. If the word is digit then convert that into words.<\/p>\n<h3 class=\"\" id=\"t-1600077497355\">Apply spelling correction<\/h3>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c2d8185\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/7-Spelling-Checking.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-5648\" alt=\"Spelling Checking\" data-id=\"5648\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Spelling Checking\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5648\" alt=\"Spelling Checking\" data-id=\"5648\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Spelling Checking\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/7-Spelling-Checking.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Checking Spelling<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Spelling correction is another important preprocessing technique while working with tweets, comments, etc. Because we can see <strong>incorrect spelling<\/strong> words in those areas of text. We need to make those misspelling words to correct spelling words.<\/p>\n<p>We can check and replace misspelling words with correct spelling by using two python libraries, one is pyspellchecker, and another one is autocorrect.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c2e89b1\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/7-Spelling-Correction.png?resize=626%2C450&amp;ssl=1\" class=\"tve_image wp-image-5651\" alt=\"Example of Spelling Correction\" data-id=\"5651\" width=\"626\" data-init-width=\"1502\" height=\"450\" data-init-height=\"1080\" title=\"Example of Spelling Correction\" loading=\"lazy\" data-width=\"626\" data-height=\"450\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5651\" alt=\"Example of Spelling Correction\" data-id=\"5651\" width=\"626\" data-init-width=\"1502\" height=\"450\" data-init-height=\"1080\" title=\"Example of Spelling Correction\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/7-Spelling-Correction.png?resize=626%2C450&amp;ssl=1\" data-width=\"626\" data-height=\"450\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Example of Spelling Correction<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\">Implementation of spelling correction using python pyspellchecker library<\/h4>\n<p dir=\"ltr\">Below we are calling a <strong>spell_correction<\/strong> function with example text. Example text has incorrect spelling words to check whether the spell_correction function gives correct words or not.<\/p>\n<\/div>\n<h4 class=\"\">Implementation of spelling correction using python autocorrect library<\/h4>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>We can observe both methods given correct or expected solutions.<\/p>\n<h3 class=\"\" id=\"t-1600077497356\">Convert accented characters to ASCII characters<\/h3>\n<p>This is another common preprocessing technique in NLP. We can observe special characters at the top of the common letter or characters if we press a longtime while typing, for example,<strong> r\u00e9sum\u00e9.<\/strong>\u00a0<\/p>\n<p>If we are not removing these types of noise from the text, then the model will consider resume and r\u00e9sum\u00e9; both are two different words. <\/p>\n<p>Even if both are the same. We can convert this <strong>accented character to ASCII <\/strong>characters by using the <strong>unidecode<\/strong> library.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c370aad\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/8-Convert-accented-characters-to-ASCII.png?resize=626%2C443&amp;ssl=1\" class=\"tve_image wp-image-5659\" alt=\"Convert accented characters to ASCII\" data-id=\"5659\" width=\"626\" data-init-width=\"1562\" height=\"443\" data-init-height=\"1106\" title=\"Convert accented characters to ASCII\" loading=\"lazy\" data-width=\"626\" data-height=\"443\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5659\" alt=\"Convert accented characters to ASCII\" data-id=\"5659\" width=\"626\" data-init-width=\"1562\" height=\"443\" data-init-height=\"1106\" title=\"Convert accented characters to ASCII\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/8-Convert-accented-characters-to-ASCII.png?resize=626%2C443&amp;ssl=1\" data-width=\"626\" data-height=\"443\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Convert accented characters to ASCII<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\">Implementation of accented text to ASCII converter in python<\/h4>\n<p dir=\"ltr\">We will define the <strong>accented_to_ascii<\/strong> function to convert accented characters to their ASCII values in the below script. \u00a0<\/p>\n<p dir=\"ltr\">We will do this function with example text.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>In the above code, we use the unidecode method of the unidecode library with input text. Which <strong>converts<\/strong> accented characters to ASCII values.<\/p>\n<h3 class=\"\" id=\"t-1600077497357\">Converting chat conversion words to normal words<\/h3>\n<p>This is another essential preprocessing technique if we work with <strong>chat conversions<\/strong>, or our problem statement requires chat conversion analysis. We need to handle short-form. As nowadays, people use <strong>short-form<\/strong> words in their chatting conversions for their simplicity.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c3e0d35\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/9-Chat-conversion-to-normal-words.png?resize=626%2C441&amp;ssl=1\" class=\"tve_image wp-image-5665\" alt=\"Chat conversion to normal words\" data-id=\"5665\" width=\"626\" data-init-width=\"1528\" height=\"441\" data-init-height=\"1076\" title=\"Chat conversion to normal words\" loading=\"lazy\" data-width=\"626\" data-height=\"441\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5665\" alt=\"Chat conversion to normal words\" data-id=\"5665\" width=\"626\" data-init-width=\"1528\" height=\"441\" data-init-height=\"1076\" title=\"Chat conversion to normal words\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/9-Chat-conversion-to-normal-words.png?resize=626%2C441&amp;ssl=1\" data-width=\"626\" data-height=\"441\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Chat conversion to normal words<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<p dir=\"ltr\">A better way to work with those words is to replace short-form words to their original words. <\/p>\n<p dir=\"ltr\">We can find all those short-form words and its actual words in this <a href=\"https:\/\/github.com\/rishabhverma17\/sms_slang_translator\/blob\/master\/slang.txt\" class=\"tve-froala\">Github Repo<\/a> to save that file into our system; click right click and then press on save as option.<\/p>\n<h4 class=\"\">Implementation of python script<\/h4>\n<\/div>\n<h3 class=\"\" id=\"t-1600077497358\">Expanding Contractions<\/h3>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c4268d9\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/10-Expanding-Contractions.png?resize=626%2C445&amp;ssl=1\" class=\"tve_image wp-image-5669\" alt=\"Expanding Contractions\" data-id=\"5669\" width=\"626\" data-init-width=\"1498\" height=\"445\" data-init-height=\"1066\" title=\"Expanding Contractions\" loading=\"lazy\" data-width=\"626\" data-height=\"445\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5669\" alt=\"Expanding Contractions\" data-id=\"5669\" width=\"626\" data-init-width=\"1498\" height=\"445\" data-init-height=\"1066\" title=\"Expanding Contractions\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/10-Expanding-Contractions.png?resize=626%2C445&amp;ssl=1\" data-width=\"626\" data-height=\"445\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Expanding Contractions<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Contractions are words or combinations of words created by dropping a few letters and replacing those letters by an apostrophe. <\/p>\n<p dir=\"ltr\">An example of a contraction word.<\/p>\n<ul>\n<li class=\"\">&#8220;don&#8217;t&#8221; is <strong>&#8220;do not&#8221;<\/strong>\u00a0<\/li>\n<li class=\"\">&#8220;should&#8217;ve&#8221; is <strong>&#8220;should have&#8221;<\/strong>\u00a0<\/li>\n<\/ul>\n<p dir=\"ltr\">Nlp models don&#8217;t know about these <strong>contractions<\/strong>; they will consider &#8220;don&#8217;t&#8221; and &#8220;do not&#8221; both are two different words. <\/p>\n<p dir=\"ltr\">We have to choose this technique if our problem statement is required. Otherwise, \u00a0leave it as it is.<\/p>\n<h4 class=\"\">Implementation of expanding contractions<\/h4>\n<p dir=\"ltr\">In the code below, we are importing the <strong>CONTRACTION_MAP<\/strong> dictionary from the contraction file. And then define expand_contractions function to expand contractions if our input text has.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>We can observe in the output, the contraction of &#8220;doesn&#8217;t&#8221; in the example text expanded to &#8220;does not&#8221;. <\/p>\n<p>In the <strong>expand_contractions<\/strong> function, we take contraction words from our text matching with contraction map words. If we are not performing a lower case conversion technique before this, we have to take the first character to display the result of contraction &#8220;Doesn&#8217;t&#8221; like &#8220;Does not&#8221;. <\/p>\n<p>Otherwise, we can ignore a few steps in the script.<\/p>\n<h3 class=\"\" id=\"t-1600077497359\">Stemming<\/h3>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c4661b6\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/11-Stemming.png?resize=512%2C561&amp;ssl=1\" class=\"tve_image wp-image-5673\" alt=\"NLP Technique Stemming\" data-id=\"5673\" width=\"512\" data-init-width=\"274\" height=\"561\" data-init-height=\"300\" title=\"NLP Technique Stemming\" loading=\"lazy\" data-width=\"512\" data-height=\"561\" data-css=\"tve-u-1748c5286ed\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5673\" alt=\"NLP Technique Stemming\" data-id=\"5673\" width=\"512\" data-init-width=\"274\" height=\"561\" data-init-height=\"300\" title=\"NLP Technique Stemming\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/11-Stemming.png?resize=512%2C561&amp;ssl=1\" data-width=\"512\" data-height=\"561\" data-css=\"tve-u-1748c5286ed\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">NLP Technique Stemming<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>Stemming is reducing words to their base or root form by removing a few <strong>suffix<\/strong> characters from words. Stemming is the <strong>text normalization<\/strong> technique. <\/p>\n<p>There are so many stemming algorithms available, but the most widely used one is <strong>porter stemming<\/strong>. <\/p>\n<p>For example, the result of books after stemming is a book, and the result of learning is learn.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c476de8\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/11-Stemming-words-example.png?resize=613%2C795&amp;ssl=1\" class=\"tve_image wp-image-5676\" alt=\"Stemming words example\" data-id=\"5676\" width=\"613\" data-init-width=\"996\" height=\"795\" data-init-height=\"1292\" title=\"Stemming words example\" loading=\"lazy\" data-width=\"613\" data-height=\"795\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5676\" alt=\"Stemming words example\" data-id=\"5676\" width=\"613\" data-init-width=\"996\" height=\"795\" data-init-height=\"1292\" title=\"Stemming words example\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/11-Stemming-words-example.png?resize=613%2C795&amp;ssl=1\" data-width=\"613\" data-height=\"795\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Stemming words example<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">But stemming <strong>doesn&#8217;t<\/strong> always provide the correct form of words because this follows the rules like removing suffix characters to get base words.<\/p>\n<p dir=\"ltr\">Sometimes, stemming words don&#8217;t relate to original ones and sometimes give non &#8211; dictionary words or not proper words. \u00a0<\/p>\n<p dir=\"ltr\">For this, we can observe in the above table results of stemming &#8220;<strong>caring<\/strong>&#8221; and &#8220;console\/consoling&#8221;. Because of these results stemming technique does not apply to all NLP tasks.<\/p>\n<h4 class=\"\">Implementation of Stemming using PorterStemming from nltk library<\/h4>\n<p dir=\"ltr\">In the below python script, we will define the <strong>porter_stemmer<\/strong> function to implement the stemming technique. We will call the function with example text. <\/p>\n<p dir=\"ltr\">Before reaching the function, we have to initialize the object for the <strong>PorterStemmer<\/strong> class to use the stem function from that class.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>In the <strong>porter_stemmer<\/strong> function, we tokenized the input using <strong>word_tokenize<\/strong> from the nltk library. And then, apply the stem function to each of the tokenized words and update the text with stemmer words.<\/p>\n<\/div>\n<h3 class=\"\" id=\"t-1600077497360\">Lemmatization<\/h3>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c4c753a\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/12-Lemming-words-example-1.png?resize=626%2C773&amp;ssl=1\" class=\"tve_image wp-image-5688\" alt=\"Lemming words example\" data-id=\"5688\" width=\"626\" data-init-width=\"829\" height=\"773\" data-init-height=\"1024\" title=\"Lemming words example\" loading=\"lazy\" data-width=\"626\" data-height=\"773\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5688\" alt=\"Lemming words example\" data-id=\"5688\" width=\"626\" data-init-width=\"829\" height=\"773\" data-init-height=\"1024\" title=\"Lemming words example\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/12-Lemming-words-example-1.png?resize=626%2C773&amp;ssl=1\" data-width=\"626\" data-height=\"773\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Lemming words example<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">The aim of usage of <strong>lemmatization<\/strong> is similar to the stemming technique to reduce inflection words to their original or base words. But the lemmatization process is different from the above approach. <\/p>\n<p dir=\"ltr\">Lemmatization does not only trim the <strong>suffix<\/strong> characters; instead, use lexical knowledge bases to get original words. The result of lemmatization is always a <strong>meaningful<\/strong> word, not like stemming.<\/p>\n<p dir=\"ltr\">The disadvantages of stemming people prefer to use lemmatization to get base or root words of original words. This preprocessing technique is also optional; we have to apply it based on our problem statement. <\/p>\n<p dir=\"ltr\">Suppose we are doing <strong>POS (parts-of-speech)<\/strong> tagger problems. The original words of data have more information about data. As compared to stemming, the lemmatization speed is a little bit slow. <\/p>\n<p dir=\"ltr\">Let&#8217;s see the implementation of lemmatization using <strong>nltk<\/strong> library.<\/p>\n<h4 class=\"\">Implementation of lemmatization using nltk<\/h4>\n<p dir=\"ltr\">In the below strip, before calling the lemmatization function, we have to initialize the object for WordNetLemmatizer to use it.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>We can see the differences between the outputs of stemming and lemmatization. Programmers program programming all are different, and for languages, lemma gives meaningful words but stemming words for that are meaningless.<\/p>\n<p><strong>Differences between Stemming and Lemmatization <\/strong><\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv-page-section thrv-lp-block\" data-inherit-lp-settings=\"1\" data-css=\"tve-u-1748c7bcc92\" data-keep-css_id=\"1\">\n<div class=\"tve-page-section-in tve_empty_dropzone  \" data-css=\"tve-u-1748c7bd102\">\n<div class=\"thrv_wrapper thrv-columns dynamic-group-kbulxqe6\" data-css=\"tve-u-1748c7bcc93\">\n<div class=\"tcb-flex-row v-2 tcb--cols--2\" data-css=\"tve-u-1748c7bcc94\">\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbulxl9a\" data-css=\"tve-u-1748c7bcc95\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbulxc3q\" data-css=\"tve-u-1748c7bcc96\">\n<div class=\"tve-cb\">\n<h4 class=\"\" id=\"t-1600081660720\">Stemming<\/h4>\n<div class=\"thrv_wrapper thrv-styled_list dynamic-group-kbulx7a0\" data-icon-code=\"icon-check\" data-css=\"tve-u-1748c7bcc9a\">\n<ul class=\"tcb-styled-list\">\n<li class=\"thrv-styled-list-item dynamic-group-kbulwyg8\" data-css=\"tve-u-1748c7bcc9b\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbulwoj9\" data-css=\"tve-u-1748c7bcc9d\">Statistical method and text normalization technique.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbulwyg8\" data-css=\"tve-u-1748c7bcc9b\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbulwoj9\" data-css=\"tve-u-1748c7bcc9d\">In the process of stemming remove the suffix of words to get a base word.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbulwyg8\" data-css=\"tve-u-1748c7bcc9b\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbulwoj9\" data-css=\"tve-u-1748c7bcc9d\">Stemming does not always provide meaning or dictionary words\u00a0 as its result.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbulwyg8\" data-css=\"tve-u-1748c7bcc9b\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbulwoj9\" data-css=\"tve-u-1748c7bcc9d\">The speed of the stemming process is fast. <\/span>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\" data-css=\"tve-u-1748c812c1c\">\n<div class=\"tcb-col dynamic-group-kbulxl9a\" data-css=\"tve-u-1748c7bcc9e\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbulxc3q\" data-css=\"tve-u-1748c7bcc9f\">\n<div class=\"tve-cb\">\n<h4 class=\"\" id=\"t-1600081660721\">Lemmatization<\/h4>\n<div class=\"thrv_wrapper thrv-styled_list dynamic-group-kbulx7a0\" data-icon-code=\"icon-times-solid\" data-css=\"tve-u-1748c7bcca2\">\n<ul class=\"tcb-styled-list\">\n<li class=\"thrv-styled-list-item dynamic-group-kbulwyg8\" data-css=\"tve-u-1748c7bcca3\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbulwoj9\" data-css=\"tve-u-1748c7bcca5\">Lemmatization is also the same as stemming statistical methods and normalization techniques.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbulwyg8\" data-css=\"tve-u-1748c7bcca6\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbulwoj9\" data-css=\"tve-u-1748c7bcca8\">Lemmatization follows lexical knowledge to get the root word for original one.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbulwyg8\" data-css=\"tve-u-1748c7bcca9\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbulwoj9\" data-css=\"tve-u-1748c7bccab\">The resulting words of lemmatization are always meaningful and dictionary words.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbulwyg8\" data-css=\"tve-u-1748c7bcca9\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbulwoj9\" data-css=\"tve-u-1748c7bccab\">As compared to stemming the process, the speed of lemmatization is slow.<\/span>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c83223b\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/13-No-emojis-please.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-5700\" alt=\"No emojis please\" data-id=\"5700\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"No emojis please\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5700\" alt=\"No emojis please\" data-id=\"5700\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"No emojis please\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/13-No-emojis-please.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">No emojis please<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">In today&#8217;s online communication, emojis play a very crucial role.<\/p>\n<p dir=\"ltr\">Emojis are small images. Users use these emojis to express their present feelings. We can communicate these with anyone globally. For some problem statements, we need to remove emojis from the text. <\/p>\n<p dir=\"ltr\">Let&#8217;s see on that type of problem statement how we can remove emojis.<\/p>\n<h4 class=\"\">Implementation of emoji removing<\/h4>\n<p>For this we take code snippets from this<br \/>\n<a href=\"https:\/\/gist.github.com\/slowkow\/7a7f61f495e3dbb7e3d767f97bd7304b\" class=\"\">GitHub Repo<\/a>.\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c85238e\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/13-remove-emojis-1024x746.png?resize=626%2C456&amp;ssl=1\" class=\"tve_image wp-image-5703\" alt=\"Remove Emojis\" data-id=\"5703\" width=\"626\" data-init-width=\"1024\" height=\"456\" data-init-height=\"746\" title=\"Remove Emojis\" loading=\"lazy\" data-width=\"626\" data-height=\"456\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5703\" alt=\"Remove Emojis\" data-id=\"5703\" width=\"626\" data-init-width=\"1024\" height=\"456\" data-init-height=\"746\" title=\"Remove Emojis\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/13-remove-emojis-1024x746.png?resize=626%2C456&amp;ssl=1\" data-width=\"626\" data-height=\"456\" data-recalc-dims=\"1\"><\/span><\/div>\n<h3 class=\"\" id=\"t-1600081660723\">Removal of Emoticons<\/h3>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c889a50\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/14-Emoticons-Removal.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-5706\" alt=\"Emoticons Removal\" data-id=\"5706\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Emoticons Removal\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5706\" alt=\"Emoticons Removal\" data-id=\"5706\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Emoticons Removal\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/14-Emoticons-Removal.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Emoticons Removal<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<p dir=\"ltr\">Emojis and emoticons are both different. An emoticon portrays a human facial expression using just keyboard characters, such as letters, numbers, and punctuation marks. <\/p>\n<p dir=\"ltr\">This is also the same as emojis; if problem statements don&#8217;t require emoticons, we can remove them.<\/p>\n<h4 class=\"\">Implementation of removing of emoticons<\/h4>\n<div>\nTo remove emotions from the text, we need a list of emoticons; in this<br \/>\n<a href=\"https:\/\/github.com\/NeelShah18\/emot\/blob\/master\/emot\/emo_unicode.py\" class=\"tve-froala\">GitHub Repo<\/a>, we can find all emoticons as a dictionary.\n<\/div>\n<div>\nWe take an<br \/>\n<strong>EMOTICONS<\/strong> dictionary from that GitHub<br \/>\n<a href=\"https:\/\/github.com\/NeelShah18\/emot\/blob\/master\/emot\/emo_unicode.py\" class=\"tve-froala\">repo<\/a> and save it in our system as emoticons_list.py. After that, import that file into our preprocessing code.\n<\/div>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c89c7fe\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/14-Emoticons-Removal-example.png?resize=626%2C444&amp;ssl=1\" class=\"tve_image wp-image-5709\" alt=\"Emoticons Removal example\" data-id=\"5709\" width=\"626\" data-init-width=\"1668\" height=\"444\" data-init-height=\"1182\" title=\"Emoticons Removal example\" loading=\"lazy\" data-width=\"626\" data-height=\"444\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5709\" alt=\"Emoticons Removal example\" data-id=\"5709\" width=\"626\" data-init-width=\"1668\" height=\"444\" data-init-height=\"1182\" title=\"Emoticons Removal example\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/14-Emoticons-Removal-example.png?resize=626%2C444&amp;ssl=1\" data-width=\"626\" data-height=\"444\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Emoticons Removal Example<\/p>\n<\/div>\n<h3 class=\"\" id=\"t-1600081660724\">Converting Emojis to words<\/h3>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c8d0949\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/15-Emojis-to-words.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-5712\" alt=\"Converting Emojis to words\" data-id=\"5712\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Converting Emojis to words\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5712\" alt=\"Converting Emojis to words\" data-id=\"5712\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Converting Emojis to words\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/15-Emojis-to-words.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Converting Emojis to words<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">In the previous section, we removed emojis from the text, but some problem statements get information from emojis. <\/p>\n<p dir=\"ltr\">In that case, we <strong>shouldn&#8217;t<\/strong> remove emojis. <\/p>\n<p dir=\"ltr\">For example, if we are working on sentiment analysis on restaurant reviews data. One review is <\/p>\n<blockquote class=\"\"><p>&#8220;i ordered fried rice that is, <img src=\"https:\/\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif\" data-lazy-src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f60b.svg\" role=\"img\" class=\"emoji\" alt=\"\ud83d\ude0b\"><img role=\"img\" class=\"emoji\" alt=\"\ud83d\ude0b\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f60b.svg\"><br \/>\n<img src=\"https:\/\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif\" data-lazy-src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f60b.svg\" role=\"img\" class=\"emoji\" alt=\"\ud83d\ude0b\"><img role=\"img\" class=\"emoji\" alt=\"\ud83d\ude0b\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f60b.svg\">&#8221; <\/p><\/blockquote>\n<p dir=\"ltr\">another review is <\/p>\n<blockquote class=\"\"><p>&#8220;i ordered fried rice that is <img src=\"https:\/\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif\" data-lazy-src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f61e.svg\" role=\"img\" class=\"emoji\" alt=\"\ud83d\ude1e\"><img role=\"img\" class=\"emoji\" alt=\"\ud83d\ude1e\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f61e.svg\"><br \/>\n<img src=\"https:\/\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif\" data-lazy-src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f620.svg\" role=\"img\" class=\"emoji\" alt=\"\ud83d\ude20\"><img role=\"img\" class=\"emoji\" alt=\"\ud83d\ude20\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f620.svg\">&#8220;. <\/p><\/blockquote>\n<p dir=\"ltr\">If we remove emojis from these two sentences. We cannot get the user&#8217;s sentiment. So, in this case, we can convert emojis into words.\u00a0<\/p>\n<h4 class=\"\">Implementation of converting emoji to words using python<\/h4>\n<div>\nFrom this<br \/>\n<a href=\"https:\/\/github.com\/NeelShah18\/emot\/blob\/master\/emot\/emo_unicode.py\" class=\"\">GitHub Repo<\/a>, we can also get emojis words and Unicode of corresponding emojis in a dictionary.\n<\/div>\n<p>\u00a0<\/p>\n<p>Take an <strong>EMO_UNICODE<\/strong> dictionary from that git and save it in a python file, then we can import the EMO_UNICODE dictionary to our code.<\/p>\n<p>\u00a0<\/p>\n<p><strong>EMO_UNICODE<\/strong> has emoji words as a key and unicode for that value. But for converting emojis to words, we need that dictionary in reverse like unicode as key and emoji word as value.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c8ea7d3\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/15-Emojis-to-words-example.png?resize=626%2C457&amp;ssl=1\" class=\"tve_image wp-image-5715\" alt data-id=\"5715\" width=\"626\" data-init-width=\"1726\" height=\"457\" data-init-height=\"1260\" title=\"15 Emojis to words example\" loading=\"lazy\" data-width=\"626\" data-height=\"457\" data-css=\"tve-u-1748c8ed6e4\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5715\" alt=\"\" data-id=\"5715\" width=\"626\" data-init-width=\"1726\" height=\"457\" data-init-height=\"1260\" title=\"15 Emojis to words example\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/15-Emojis-to-words-example.png?resize=626%2C457&amp;ssl=1\" data-width=\"626\" data-height=\"457\" data-css=\"tve-u-1748c8ed6e4\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"wp-caption-text thrv-inline-text\">Emojis To Words Example<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h3 class=\"\" id=\"t-1600081660725\">Converting Emoticons to words<\/h3>\n<p>The purpose of converting emoticons to words is also the same as converting emojis to words techniques. The only difference is here, converting emoticons to words.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c92eae7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/16-Emoticons-to-words-example-1024x734.png?resize=626%2C449&amp;ssl=1\" class=\"tve_image wp-image-5719\" alt=\"Emoticons to words example\" data-id=\"5719\" width=\"626\" data-init-width=\"1024\" height=\"449\" data-init-height=\"734\" title=\"Emoticons to words example\" loading=\"lazy\" data-width=\"626\" data-height=\"449\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5719\" alt=\"Emoticons to words example\" data-id=\"5719\" width=\"626\" data-init-width=\"1024\" height=\"449\" data-init-height=\"734\" title=\"Emoticons to words example\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/16-Emoticons-to-words-example-1024x734.png?resize=626%2C449&amp;ssl=1\" data-width=\"626\" data-height=\"449\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Emoticons to words example<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\">Implementation of converting emoticons to words<\/h4>\n<div>\nTake the EMOTICONS dictionary from this<br \/>\n<a href=\"https:\/\/github.com\/NeelShah18\/emot\/blob\/master\/emot\/emo_unicode.py\" class=\"\">GitHub Repo<\/a>. \u00a0We saved that dictionary of emoticons in an<br \/>\n<strong>emoticons_list<\/strong> python file.\n<\/div>\n<p>In the below code, we import the <strong>EMOTICONS<\/strong> dictionary from that file.<\/p>\n<\/div>\n<h3 class=\"\" id=\"t-1600081660726\">Removing of Punctuations or Special Characters<\/h3>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c966343\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/17-Removing-of-Punctuations-or-Special-Characters.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-5723\" alt=\"Removing of Punctuations or Special Characters\" data-id=\"5723\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Removing of Punctuations or Special Characters\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5723\" alt=\"Removing of Punctuations or Special Characters\" data-id=\"5723\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Removing of Punctuations or Special Characters\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/17-Removing-of-Punctuations-or-Special-Characters.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Removing of Punctuations or Special Characters<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>Punctuations or special characters are all characters except digits and alphabets. List of all available special characters are [!&#8221;#$%&amp;'()*+,-.\/:;&lt;=&gt;?@[]^_`{|}~]. \u00a0<\/p>\n<p>This is better to remove or convert emoticons before removing punctuations or special characters. <\/p>\n<p>If we apply this technique process before emoticons related techniques, we may lose emoticons from the text. So if we apply the emoticons technique, apply before removing the punctuation technique. <\/p>\n<p>For example, if we remove the period using the punctuation removing technique from text like <strong>&#8220;money 20.98&#8221;<\/strong>, we will lose the period (.) between 20 &amp; 98. That completely lost their meaning. <\/p>\n<p>So we have to focus more on choosing punctuations.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c98729f\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/17-Removing-of-Punctuations-or-Special-Characters-example-1024x647.png?resize=626%2C396&amp;ssl=1\" class=\"tve_image wp-image-5727\" alt=\"Removing of Punctuations or Special Characters example\" data-id=\"5727\" width=\"626\" data-init-width=\"1024\" height=\"396\" data-init-height=\"647\" title=\"Removing of Punctuations or Special Characters example\" loading=\"lazy\" data-width=\"626\" data-height=\"396\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5727\" alt=\"Removing of Punctuations or Special Characters example\" data-id=\"5727\" width=\"626\" data-init-width=\"1024\" height=\"396\" data-init-height=\"647\" title=\"Removing of Punctuations or Special Characters example\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/17-Removing-of-Punctuations-or-Special-Characters-example-1024x647.png?resize=626%2C396&amp;ssl=1\" data-width=\"626\" data-height=\"396\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Removing of Punctuations or Special Characters example<\/p>\n<\/div>\n<h4 class=\"\">Implementation of removing punctuations using string library<\/h4>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h3 class=\"\" id=\"t-1600081660727\">Removing of Stopwords<\/h3>\n<p>Stopwords are common words and irrelevant words from which we can&#8217;t get any useful information for our model or problem statement.<\/p>\n<p>Few stopwords are &#8220;a&#8221;, &#8220;an&#8221;, &#8220;the&#8221;, etc. \u00a0<\/p>\n<p>For example, we can ignore stop words when we work with sentiment analysis, text classification problems. But in the case of POS (<strong>Parts-Of-Speech<\/strong>) tagging or language translation, we have to consider whether stop words also give more information and useful words for our problem statement.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748c9c1151\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/18-Stopwords.png?resize=626%2C117&amp;ssl=1\" class=\"tve_image wp-image-5738\" alt=\"Stopwords Example\" data-id=\"5738\" width=\"626\" data-init-width=\"1024\" height=\"117\" data-init-height=\"192\" title=\"Stopwords Example\" loading=\"lazy\" data-width=\"626\" data-height=\"117\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5738\" alt=\"Stopwords Example\" data-id=\"5738\" width=\"626\" data-init-width=\"1024\" height=\"117\" data-init-height=\"192\" title=\"Stopwords Example\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/18-Stopwords.png?resize=626%2C117&amp;ssl=1\" data-width=\"626\" data-height=\"117\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Stopwords Example<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">We can import lists of stop words from different NLP related libraries such as nltk, spacy, gensim, etc. <\/p>\n<p dir=\"ltr\">let&#8217;s see how to remove stopwords from the text by using stop words from all these three libraries.<\/p>\n<h4 class=\"\">Implementation of removing stopwords using all stop words from nltk, spacy, gensim<\/h4>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>The code mentioned above, we take stopwords from different libraries such as <strong>nltk, spacy, and gensim<\/strong>.\u00a0<\/p>\n<p>And then take unique stop words from all three stop word lists. In the <strong>remove_stopwords<\/strong>, we check whether the tokenized word is in stop words or not; if not in stop words list, then append to the text without the stopwords list.<\/p>\n<h3 class=\"\" id=\"t-1600081660728\">Removing of Frequent words<\/h3>\n<p dir=\"ltr\">In the above section, we removed stopwords. <\/p>\n<p dir=\"ltr\">Stopwords are common words all over the language. These frequent words are common words of a particular domain. <\/p>\n<p dir=\"ltr\">If we are working on any problem statement for a specific field, we can ignore common words in that domain because those frequent words don&#8217;t give too much information.<\/p>\n<h4 class=\"\">Implementation of frequent words removing<\/h4>\n<p dir=\"ltr\">Here we use the &#8220;<strong>Counter<\/strong>&#8221; function from the collection library to remove our corpus&#8217;s frequent words.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>In the above script, we defined two functions one is for counting frequent words another is to remove them from our corpus.<\/p>\n<h3 class=\"\" id=\"t-1600081660729\">Removing of Rare words<\/h3>\n<p dir=\"ltr\">Removing rare words text preprocessing technique is similar to eliminating frequent words. We can remove more irregular words from the corpus.<\/p>\n<h4 class=\"\">Implementation of frequent words removing<\/h4>\n<p dir=\"ltr\">In the below script, the same as the above one, we defined two functions: finding rare words and removing them. We take only ten rare words for this sample text; this number may increase based on our text corpus.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h3 id=\"t-1600081660730\" class=\"\">Removing single characters<\/h3>\n<p dir=\"ltr\">After performing all text preprocessing techniques except extra spaces, removing this is better to remove a single character if there is any present in our corpus. We can remove using regex.<\/p>\n<h4 class=\"\">Implementation of removing single characters<\/h4>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h3 id=\"t-1600081660731\" class=\"\">Removing Extra Whitespaces<\/h3>\n<p dir=\"ltr\">This is the last preprocessing technique. We can not get any information from extra spaces, so that we can ignore all additional spaces such as 0ne or more newlines, tabs, extra spaces. <\/p>\n<p dir=\"ltr\">Our suggestion is to apply this preprocessing technique at last after performing all text preprocessing techniques.<\/p>\n<h4 class=\"\">Implementation \u00a0of removing extra whitespaces<\/h4>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h2 class=\"\" id=\"t-1600081660732\">Process of applying all text preprocessing techniques with an Example\u00a0<\/h2>\n<p dir=\"ltr\">For this process, we are providing a complete python code in our <a href=\"https:\/\/github.com\/saimadhu-polamuri\/DataAspirant_codes\/tree\/master\/text_preprocessing_techniques\/scripts\" target=\"_blank\" rel=\"noopener noreferrer\">dataaspirant github repo<\/a>. You have to download this preprocessing.py file After extracting the downloaded file.<\/p>\n<p dir=\"ltr\">Import it into our text preprocessing class from the preprocessing file. Now we will discuss how to use it.<\/p>\n<h4 class=\"\">Implementation of Complete preprocessing techniques\u00a0<\/h4>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1748cb238e3\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/19-Text-Preprocessing-Techniques-flow.png?resize=613%2C738&amp;ssl=1\" class=\"tve_image wp-image-5765\" alt=\"Text Preprocessing Techniques flow\" data-id=\"5765\" width=\"613\" data-init-width=\"850\" height=\"738\" data-init-height=\"1024\" title=\"Text Preprocessing Techniques flow\" loading=\"lazy\" data-width=\"613\" data-height=\"738\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-5765\" alt=\"Text Preprocessing Techniques flow\" data-id=\"5765\" width=\"613\" data-init-width=\"850\" height=\"738\" data-init-height=\"1024\" title=\"Text Preprocessing Techniques flow\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/19-Text-Preprocessing-Techniques-flow.png?resize=613%2C738&amp;ssl=1\" data-width=\"613\" data-height=\"738\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Text Preprocessing Techniques flow<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">In the below, we apply only a few text preprocessing techniques to know how we can use the importing class.<\/p>\n<p dir=\"ltr\">Here we are taking the <strong>Sms_spam_or_not<\/strong> dataset. <\/p>\n<p dir=\"ltr\">From the dataset, we are taking a text column and converting it into a list. We initiated an object for the prepress class, which one imported from a preprocessing file. <\/p>\n<p dir=\"ltr\">If we want to apply preprocessing techniques, send a list of sentences and a list of techniques to the preprocessing function by using the object of preprocessing. <\/p>\n<p dir=\"ltr\">We listed out all techniques with short forms in the comment section. Please send a list of short forms of corresponding techniques as a technique list.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h2 class=\"\" id=\"t-1600081660733\">Conclusion<\/h2>\n<p dir=\"ltr\">In this article, most of the text preprocessing techniques are explained. We do not need to perform all preprocessing techniques. Just download the file and import the file in our code. <\/p>\n<p dir=\"ltr\"><strong>All function<\/strong> with a list of sentences and a list of text preprocessing techniques. Focus when we select techniques and also order because the preprocessing process depends on this order of processing techniques.<\/p>\n<\/div>\n<h4 class=\"\">Recommended Natural Language Processing Courses<\/h4>\n<div class=\"thrv_wrapper thrv-page-section thrv-lp-block\" data-inherit-lp-settings=\"1\" data-css=\"tve-u-1748bf69c7d\" data-keep-css_id=\"1\">\n<div class=\"tve-page-section-in tve_empty_dropzone  \" data-css=\"tve-u-17481b960b8\">\n<div class=\"thrv_wrapper thrv-columns dynamic-group-kbt3q0q7\" data-css=\"tve-u-17481b95e2b\">\n<div class=\"tcb-flex-row v-2 tcb--cols--3 tcb-medium-no-wrap tcb-mobile-wrap m-edit\" data-css=\"tve-u-1748bf69c7e\">\n<div class=\"tcb-flex-col\" data-css=\"tve-u-1748da97cd8\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-17481b95e2d\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-1748bf69c94\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-1748bf69c98\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4068\" alt=\"nlp specialization\" data-id=\"4068\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"NLP specialization\" loading=\"lazy\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-1748bf69c99\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4068\" alt=\"nlp specialization\" data-id=\"4068\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"NLP specialization\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-1748bf69c99\" data-recalc-dims=\"1\"><br \/>\n<span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-1748bf69c80\">NLP Specialization with Python<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-17481b95e2d\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-1748bf69c95\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-1748bf69ca4\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4070\" alt=\"Natural-language-processing-classifiation-vector-spaces\" data-id=\"4070\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"Natural-language-processing-2\" loading=\"lazy\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-1748bf69ca5\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4070\" alt=\"Natural-language-processing-classifiation-vector-spaces\" data-id=\"4070\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"Natural-language-processing-2\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-1748bf69ca5\" data-recalc-dims=\"1\"><br \/>\n<span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-1748bf69c87\">NLP \u00a0Classification and vector spaces<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-17481b95e2d\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-1748bf69c97\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-1748bf69ca6\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4072\" alt=\"natural-language-processing-python\" data-id=\"4072\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"natural-language-processing-python\" loading=\"lazy\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-1748bf69ca7\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4072\" alt=\"natural-language-processing-python\" data-id=\"4072\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"natural-language-processing-python\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-1748bf69ca7\" data-recalc-dims=\"1\"><br \/>\n<span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-1748bf69c8e\">NLP Model Building With Python<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/dataaspirant.com\/nlp-text-preprocessing-techniques-implementation-python\/<\/p>\n","protected":false},"author":0,"featured_media":1535,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1534"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=1534"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1534\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/1535"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=1534"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=1534"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=1534"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}