{"id":8228,"date":"2021-04-12T02:29:10","date_gmt":"2021-04-12T02:29:10","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/12\/bag-of-words-convert-text-into-vectors\/"},"modified":"2021-04-12T02:29:10","modified_gmt":"2021-04-12T02:29:10","slug":"bag-of-words-convert-text-into-vectors","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/12\/bag-of-words-convert-text-into-vectors\/","title":{"rendered":"Bag of Words: Convert text into vectors"},"content":{"rendered":"<div>\n<p>In this blog, we will study about the model that represents and converts text to numbers i.e. the Bag of Words (BOW). The bag-of-words model has seen great success in solving problems which includes language modeling and document classification as it is simple to understand and implement.<\/p>\n<p>After completing this particular blog, you all will have an overview of: What does the bag-of-words model mean by and why is its importance in representing text. How we can develop a bag-of-words model for a collection of documents. How to use the bag of words to prepare a vocabulary and deploy in a model using programming language.<\/p>\n<p>\u00a0<\/p>\n<p><em>The problem and its solution\u2026<\/em><\/p>\n<p>The biggest problem with modeling text is that it is unorganised, and most of the statistical algorithms, i.e., the machine learning and deep learning techniques prefer well defined numeric data. They cannot work with raw text directly, therefore we have to convert text into numbers.<\/p>\n<p>Word embeddings are commonly used in many Natural Language Processing (NLP) tasks because they are found to be useful representations of words and often lead to better performance in the various tasks performed. A huge number of approaches exist in this regard, among which some of the most widely used are Bag of Words, Fasttext, TF-IDF, Glove and word2vec. For easy user implementation, several libraries exist, such as Scikit-Learn and NLTK, which can implement these techniques in one line of code. But it is important to understand the working principle behind these word embedding techniques. As already said before, in this blog, we see how to implement Bag of words and the best way to do so is to implement these techniques from scratch in Python . Before we start with coding, let\u2019s try to understand the theory behind the model approach.<\/p>\n<h3><em>\u00a0<\/em><em>Theory Behind Bag of Words Approach<\/em><\/h3>\n<p>In simple words, <a href=\"https:\/\/www.excelr.com\/blog\/data-science\/natural-language-processing\/implementation-of-bag-of-words-using-python\"><strong>Bag of words<\/strong><\/a> can be defined as a <a href=\"https:\/\/www.mygreatlearning.com\/blog\/natural-language-processing-tutorial\/\">Natural Language Processing<\/a> technique used for text modelling or we can say that it is a method of feature extraction with text data from documents.\u00a0 It involves mainly two things firstly, a vocabulary of known words and, then a measure of the presence of known words.<\/p>\n<p>The process of converting NLP text into numbers is called vectorization in machine learning language.A lot of different ways are available in converting text into vectors which are:<\/p>\n<p>Counting the number of times each word appears in a document, and Calculating the frequency that each word appears in a document out of all the words in the document.<\/p>\n<p><em>Understanding using an example<\/em><\/p>\n<p>To understand the bag of words approach, let\u2019s see how this technique converts text into vectors with the help of an example. Suppose we have a corpus with three sentences:<\/p>\n<ol>\n<li>\u201cI like to eat mangoes\u201d<\/li>\n<li>\u201cDid you like to eat jellies?\u201d<\/li>\n<li>\u201cI don\u2019t like to eat jellies\u201d<\/li>\n<\/ol>\n<p><strong>Step 1: Firstly, we go through all the words in the above three sentences and make a list of all of the words present in our model vocabulary.<\/strong><\/p>\n<ol>\n<li>I<\/li>\n<li>like<\/li>\n<li>to<\/li>\n<li>eat<\/li>\n<li>mangoes<\/li>\n<li>Did<\/li>\n<li>you<\/li>\n<li>like<\/li>\n<li>to<\/li>\n<li>eat<\/li>\n<li>Jellies<\/li>\n<li>I<\/li>\n<li>don\u2019t<\/li>\n<li>like<\/li>\n<li>to<\/li>\n<li>eat<\/li>\n<li>jellies<\/li>\n<\/ol>\n<p><strong>Step 2: Let\u2019s find out the frequency of each word without preprocessing our text.<\/strong><\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-1.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5507\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-1.png\" alt=\"\" width=\"462\" height=\"279\"><\/a><\/p>\n<p>But is this not the best way to perform a bag of words. In the above example, the words Jellies and jellies are considered twice no doubt they hold the same meaning. So, let us make some changes and see how we can use \u2018bag of words\u2019 by preprocessing our text in a more effective way.<\/p>\n<p><strong>Step 3: Let\u2019s find out the frequency of each word with preprocessing our text.<\/strong> Preprocessing is so very important because it brings our text into such a form that is easily understandable, predictable and analyzable for our task.<\/p>\n<p>Firstly, we need to convert the above sentences into lowercase characters as case does not hold any information. Then it is very important to remove any special characters or punctuations if present in our document, or else it makes the conversion more messy.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-2.png\"><img loading=\"lazy\" class=\"aligncenter size-full wp-image-5506\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-2.png\" alt=\"\" width=\"662\" height=\"334\"><\/a><\/p>\n<p>From the above explanation, we can say the major advantage of Bag of Words is that it is very easy to understand and quite simple to implement in our datasets. But this approach has some disadvantages too such as:<\/p>\n<ol>\n<li>Bag of words leads to a high dimensional feature vector due to the large size of word vocabulary.<\/li>\n<li>Bag of words assumes all words are independent of each other ie\u2019, it doesn\u2019t leverage co-occurrence statistics between words.<\/li>\n<li>It leads to a highly sparse vector as there is nonzero value in dimensions corresponding to words that occur in the sentence.<\/li>\n<\/ol>\n<h3><strong><em>Bag of Words Model in Python Programming<\/em><\/strong><\/h3>\n<p>The first thing that we need to create is a proper dataset for implementing our Bag of Words model. In the above sections, we have manually created a bag of words model with three sentences. However, now we shall find a random corpus on Wikipedia such as \u2018<a href=\"https:\/\/en.wikipedia.org\/wiki\/Bag-of-words_model\">https:\/\/en.wikipedia.org\/wiki\/Bag-of-words_model<\/a>\u2018.<\/p>\n<p><strong>Step 1:<\/strong> The very first step is to import the required libraries: nltk, numpy, random, string, bs4, urllib.request and re.<\/p>\n<p><strong>Step 2:<\/strong> Once we are done with importing the libraries, now we will be using the <a href=\"https:\/\/beautiful-soup-4.readthedocs.io\/en\/latest\/\">Beautifulsoup4<\/a> library to parse the data from Wikipedia.Along with that we shall be using <a href=\"https:\/\/stackabuse.com\/using-regex-for-text-manipulation-in-python\/\">Python\u2019s regex library<\/a>, re, for preprocessing tasks of our document. So, we will scrape the Wikipedia article on Bag of Words.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-3-code.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5505\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-3-code.png\" alt=\"\" width=\"450\" height=\"205\"><\/a><\/p>\n<p><strong>Step 3:<\/strong> As we can observe, in the above code snippet we have imported the raw HTML for the Wikipedia article from which we have filtered the text within the paragraph text and, finally,have created a complete corpus by merging up all the paragraphs.<\/p>\n<p><strong>Step 4:<\/strong> The very next step is to split the corpus into individual sentences by using the sent_tokenize function from the NLTK library.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-4-code.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5504\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-4-code.png\" alt=\"\" width=\"636\" height=\"235\"><\/a><\/p>\n<p><strong>Step 5:<\/strong> Our text contains a number of punctuations which are unnecessary for our word frequency dictionary. In the below code snippet, we will see how to convert our text into lower case and then remove all the punctuations from our text, which will result in multiple empty spaces which can be again removed using regex.<\/p>\n<p><strong>Step 6:<\/strong> Once the preprocessing is done, let\u2019s find out the number of sentences present in our corpus and then, print one sentence from our corpus to see how it looks.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-5-code.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5503\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-5-code.png\" alt=\"\" width=\"644\" height=\"218\"><\/a><strong>Step 7:<\/strong> We can observe that the text doesn\u2019t contain any special character or multiple empty spaces, and so our own corpus is ready. The next step is to tokenize each sentence in the corpus and create a dictionary containing each word and their corresponding frequencies.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-6-code.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5502\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-6-code.png\" alt=\"\" width=\"675\" height=\"240\"><\/a><\/p>\n<p>As you can see above, we have created a dictionary called wordfreq. Next, we iterate through each word in the sentence and check if it exists in the wordfreq dictionary.\u00a0 On its existence,we will add the word as the key and set the value of the word as 1.<\/p>\n<p><strong>Step 8:<\/strong> Our corpus has more than 500 words in total and so we shall filter down to the 200 most frequently occurring words by using Python\u2019s heap library.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-7-code.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5501\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-7-code.png\" alt=\"\" width=\"675\" height=\"251\"><\/a><br \/><strong>Step 9:<\/strong> Now, comes the final step of converting the sentences in our corpus into their corresponding vector representation. Let\u2019s check the below code snippet to understand it. Our model is in the form of a list of lists which can be easily converted matrix form using this script:<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-8-code.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5500\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/03\/beg-of-words-python-8-code.png\" alt=\"\" width=\"693\" height=\"335\"><\/a><\/p>\n<div id=\"author-bio-box\">\n<h3><a href=\"https:\/\/data-science-blog.com\/en\/blog\/author\/ramtavva\/\" title=\"All posts by Ram Tavva\" rel=\"author\">Ram Tavva<\/a><\/h3>\n<div class=\"bio-gravatar\"><img loading=\"lazy\" src=\"https:\/\/secure.gravatar.com\/avatar\/31b0852adf5a757e05c737adce07f9b6?s=70&amp;d=mm&amp;r=g\" width=\"70\" height=\"70\" alt=\"Avatar\" class=\"avatar avatar-70 wp-user-avatar wp-user-avatar-70 photo avatar-default\"><\/div>\n<p><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"https:\/\/www.facebook.com\/ram.tavva\" class=\"bio-icon bio-icon-facebook\"><\/a><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"https:\/\/twitter.com\/ramtavva?s=09\" class=\"bio-icon bio-icon-twitter\"><\/a><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"https:\/\/www.facebook.com\/ram.tavva\" class=\"bio-icon bio-icon-linkedin\"><\/a><\/p>\n<p class=\"bio-description\">Senior Data Scientist and Alumnus of IIM- C (Indian Institute of Management &#8211; Kolkata) with over 25 years of professional experience. Specialized in Data Science, Artificial Intelligence, and Machine Learning.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/data-science-blog.com\/en\/blog\/2021\/04\/11\/bag-of-words-convert-text-into-vectors\/<\/p>\n","protected":false},"author":0,"featured_media":8229,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8228"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8228"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8228\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8229"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8228"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8228"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8228"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}