{"id":452,"date":"2020-08-18T09:07:31","date_gmt":"2020-08-18T09:07:31","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/18\/most-popular-word-embedding-techniques-in-nlp\/"},"modified":"2020-08-18T09:07:31","modified_gmt":"2020-08-18T09:07:31","slug":"most-popular-word-embedding-techniques-in-nlp","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/18\/most-popular-word-embedding-techniques-in-nlp\/","title":{"rendered":"Most Popular Word Embedding Techniques In NLP"},"content":{"rendered":"<div id=\"tve_editor\" data-post-id=\"4823\">\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173fd7605f1\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/1-Popular-Word-Embedding-Techniques.png?resize=613%2C394&amp;ssl=1\" class=\"tve_image wp-image-4826\" alt=\"Popular Word Embedding Techniques\" data-id=\"4826\" width=\"613\" data-init-width=\"700\" height=\"394\" data-init-height=\"450\" title=\"Popular Word Embedding Techniques\" loading=\"lazy\" data-width=\"613\" data-height=\"394\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4826\" alt=\"Popular Word Embedding Techniques\" data-id=\"4826\" width=\"613\" data-init-width=\"700\" height=\"394\" data-init-height=\"450\" title=\"Popular Word Embedding Techniques\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/1-Popular-Word-Embedding-Techniques.png?resize=613%2C394&amp;ssl=1\" data-width=\"613\" data-height=\"394\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<p dir=\"ltr\">To build any model in <a href=\"https:\/\/dataaspirant.com\/category\/machine-learning-2\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" class=\"tve-froala\">machine learning<\/a> or deep learning, the final level data has to be in numerical form, because models don\u2019t understand text or image data directly like humans do.<\/p>\n<p dir=\"ltr\">So how <a href=\"https:\/\/dataaspirant.com\/category\/natural-language-processing\/\" target=\"_blank\" rel=\"noopener noreferrer\">natural language processing<\/a> (NLP) models learn patterns from text data <img src=\"https:\/\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif\" data-lazy-src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f914.svg\" role=\"img\" class=\"emoji\" alt=\"\ud83e\udd14\"><\/p>\n<p><img role=\"img\" class=\"emoji\" alt=\"\ud83e\udd14\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/1f914.svg\"><br \/>\n?<\/p>\n<p dir=\"ltr\">We need smart ways to convert the text data into numerical data, which is called <strong>vectorization<\/strong> or in the NLP world, it is called <strong>word embeddings.<\/strong>\u00a0<\/p>\n<p dir=\"ltr\">Vectorization or word embedding is nothing but the process of converting text data to numerical vectors. Later the numerical vectors are used to build various machine learning models. In a way, we say this as <strong>extracting features<\/strong> from text to build multiple natural language processing models.<\/p>\n<p dir=\"ltr\">We have numerous ways to convert the text data to numerical vectors. In this article, we will see details about different word embedding techniques with examples, and also we will learn how to implement them in python.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_tw_qs tve_clearfix\" data-url=\"https:\/\/twitter.com\/intent\/tweet\" data-via=\"\" data-use_custom_url=\"\">\n<div class=\"thrv_tw_qs_container\">\n<div class=\"thrv_tw_quote\">\n<p>Most popular word embedding techniques in natural language processing<\/p>\n<\/div>\n<p>\n\t\t\t<span><br \/>\n\t\t\t\t<i><\/i><br \/>\n\t\t\t\t<span class=\"thrv_tw_qs_button_text thrv-inline-text tve_editable\">Click to Tweet<\/span><br \/>\n\t\t\t<\/span>\n\t\t<\/p>\n<\/div>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Before we dive further, let\u2019s quickly see what you will learn in this blog post.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<h2 id=\"t-1597685144197\" class=\"\">Natural Language Processing(NLP)<\/h2>\n<p dir=\"ltr\">Natural Language Processing, in short, called NLP, is a subfield of <a href=\"https:\/\/dataaspirant.com\/category\/data-science-2\/\" target=\"_blank\" class=\"tve-froala\" rel=\"noopener noreferrer\">data science<\/a>. With the increase in capturing text data, we need the best methods to extract meaningful information from text. For this, we are having a separate subfield in data science and called Natural Language Processing. Using these natural language processing techniques we build text-related applications or to automate tasks.<\/p>\n<p dir=\"ltr\">In technical words, Natural Language Processing is the process of training machines to understand and generate results like humans using our natural languages. Based on these 2 tasks, NLP is further classified as<\/p>\n<ul class=\"\">\n<li class=\"\">Natural Language Understanding (NLU)<\/li>\n<li class=\"\">Natural Language Generation (NLG)<\/li>\n<\/ul>\n<p dir=\"ltr\">To get some motivation to work on natural language processing projects, let\u2019s look at a few applications that belong to NLP.<\/p>\n<h3 id=\"t-1597685144198\" class=\"\">Natural Language Processing (NLP) Applications<\/h3>\n<p dir=\"ltr\">Below are some of the popular applications of nlp.\u00a0<\/p>\n<p dir=\"ltr\">By now, we clearly understood the need for word embedding, now let\u2019s look at the popular word embedding techniques.<\/p>\n<h3 id=\"t-1597685144199\" class=\"\">Word embedding techniques<\/h3>\n<p dir=\"ltr\">Below are the popular and simple word embedding methods to extract features from text are<\/p>\n<ul class=\"\">\n<li class=\" dir=\">Bag of words<\/li>\n<li class=\" dir=\">TF-IDF<\/li>\n<li class=\" dir=\">Word2vec<\/li>\n<li class=\" dir=\">Glove embedding<\/li>\n<li class=\" dir=\">Fastext<\/li>\n<li class=\" dir=\">ELMO (Embeddings for Language models)<\/li>\n<\/ul>\n<p dir=\"ltr\">But in this article, we will learn only the popular word embedding techniques, such as a bag of words, TF-IDF, Word2vec. The other advanced methods for converting text to numerical vector representation will explain in the upcoming articles.<\/p>\n<h2 class=\"\" id=\"t-1597685144200\">Bag of words<\/h2>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173fd7c8310\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/2-bag-of-words.png?resize=613%2C257&amp;ssl=1\" class=\"tve_image wp-image-4841\" alt=\"bag of words\" data-id=\"4841\" width=\"613\" data-init-width=\"1602\" height=\"257\" data-init-height=\"672\" title=\"bag of words\" loading=\"lazy\" data-width=\"613\" data-height=\"257\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4841\" alt=\"bag of words\" data-id=\"4841\" width=\"613\" data-init-width=\"1602\" height=\"257\" data-init-height=\"672\" title=\"bag of words\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/2-bag-of-words.png?resize=613%2C257&amp;ssl=1\" data-width=\"613\" data-height=\"257\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">The bag of words method is simple to understand and easy to implement. This method is mostly used in language modeling and text classification tasks. The concept behind this method is straightforward. In this method, we will represent sentences into vectors with the frequency of words that are occurring in those sentences.\u00a0<\/p>\n<p dir=\"ltr\">Confusing?<\/p>\n<p dir=\"ltr\">Okay, We will explain step by step the process of how the bag of word approach works.<\/p>\n<h3 id=\"t-1597685144202\" class=\"\">Bag of words approach<\/h3>\n<p dir=\"ltr\">In this approach we perform two operations.<\/p>\n<ol class=\"\">\n<li>Tokenization<\/li>\n<li>Vectors Creation<\/li>\n<\/ol>\n<h4 class=\"\">Tokenization<\/h4>\n<p dir=\"ltr\">The process of dividing each sentence into <strong>words<\/strong> or smaller parts. Here each word or symbol is called a <strong>token<\/strong>. After tokenization we will take unique words from the corpus. Here <strong>corpus<\/strong> means the tokens we have from all the documents we are considering for the bag of words creation.<\/p>\n<h4 class=\"\">Create vectors for each sentence<\/h4>\n<p dir=\"ltr\">Here the size of the vector is equal to the number of unique words of the corpus. For each sentence we will fill each position of a vector with corresponding word frequency in a particular sentence.<\/p>\n<p dir=\"ltr\">Let&#8217;s understand this with an example<\/p>\n<ol class=\"\">\n<li>This pasta is very tasty and affordable.<\/li>\n<li>This pasta is not tasty and is affordable.<\/li>\n<li>This pasta is very very delicious.<\/li>\n<\/ol>\n<p dir=\"ltr\">These <strong>3 sentences<\/strong> are example sentences, our first step is to perform tokenization. Before tokenization we have to convert all sentences to lowercase letters or uppercase letters for normalization, we will convert all the words in the sentences to lowercase.<\/p>\n<h5 class=\"\">Output of sentences after converting to lowercase<\/h5>\n<ul class=\"\">\n<li class=\" dir=\">this pasta is very tasty and affordable.<\/li>\n<li class=\" dir=\">this pasta is not tasty and is affordable.<\/li>\n<li class=\" dir=\">this pasta is very very delicious.<\/li>\n<\/ul>\n<p dir=\"ltr\">Now we will perform <strong>tokenization<\/strong>.\u00a0<\/p>\n<p dir=\"ltr\">Dividing sentences into words and creating a list with all unique words and also in \u00a0<strong>alphabetical<\/strong> order.\u00a0<\/p>\n<p dir=\"ltr\">We will get the below output after the tokenization step.\u00a0<\/p>\n<blockquote class=\"\"><p>[<em>\u201cand\u201d, \u201caffordable.\u201d, \u201cdelicious.\u201d, \u00a0\u201cis\u201d, \u201cnot\u201d, \u201cpasta\u201d, \u201ctasty\u201d, \u201cthis\u201d, \u201cvery\u201d<\/em>]<\/p><\/blockquote>\n<p dir=\"ltr\">Now what is our next step?<\/p>\n<p dir=\"ltr\">Creating vectors for each sentence with frequency of words. This is called a <strong>sparse matrix<\/strong>. Below is the sparse matrix of example sentences.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173fd80afd5\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/3-bag-of-words-representation.png?resize=613%2C151&amp;ssl=1\" class=\"tve_image wp-image-4849\" alt=\"bag of words representation\" data-id=\"4849\" width=\"613\" data-init-width=\"2508\" height=\"151\" data-init-height=\"620\" title=\"bag of words representation\" loading=\"lazy\" data-width=\"613\" data-height=\"151\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4849\" alt=\"bag of words representation\" data-id=\"4849\" width=\"613\" data-init-width=\"2508\" height=\"151\" data-init-height=\"620\" title=\"bag of words representation\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/3-bag-of-words-representation.png?resize=613%2C151&amp;ssl=1\" data-width=\"613\" data-height=\"151\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<p dir=\"ltr\">We can see in the above figure, every sentence converting into vectors. We can also find sentence similarities after converting sentences to vectors.<\/p>\n<p dir=\"ltr\">How can we find similarities ? Just calculating distance between any two vectors of \u00a0sentences by using any distance measure method for example\u00a0<a href=\"https:\/\/dataaspirant.com\/five-most-popular-similarity-measures-implementation-in-python\/\" target=\"_blank\" class=\"tve-froala\" rel=\"noopener noreferrer\">Euclidean Distance\u00a0<\/a><\/p>\n<p dir=\"ltr\">In the above example we are just taking each word as a feature, another name for this is 1-gram representence, we can also take bigram words , tri-Gram words etc .\u00a0<\/p>\n<p dir=\"ltr\">Examples for Bi-Gram word representation \u00a0of the first sentence as below.<\/p>\n<ul class=\"\">\n<li>this, pasta<\/li>\n<li>pasta, is<\/li>\n<li>is, very<\/li>\n<li>very, tasty<\/li>\n<li>tasty, and<\/li>\n<li>and affordable<\/li>\n<\/ul>\n<p dir=\"ltr\">Like this we can take more tri-gram words and n-gram words etc, here n is the number of words to split. But we can not get any semantic meaning or relation between words from the bag of words technique.<\/p>\n<p dir=\"ltr\">In Bag of word representation we have more zeros in the sparse matrices. The size of the matrix \u00a0will be increased based on the total number of words in the corpus. In real world applications corpus will contain thousands of words. So we need more resources to build analytics models with \u00a0this type of technique for large datasets. This drawback will be overcome in the next word embedding techniques. Now let\u2019s learn how to implement the bag of words technique in python with Sklearn<\/p>\n<h3 class=\"\" id=\"t-1597685144203\">Implementation of Bag of words with python sklearn<\/h3>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173fd8c2631\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/4-Implementation-of-bag-of-words-with-sklearn.png?resize=613%2C321&amp;ssl=1\" class=\"tve_image wp-image-4857\" alt=\"Implementation of bag of words with sklearn\" data-id=\"4857\" width=\"613\" data-init-width=\"2048\" height=\"321\" data-init-height=\"1074\" title=\"Implementation of bag of words with sklearn\" loading=\"lazy\" data-width=\"613\" data-height=\"321\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4857\" alt=\"Implementation of bag of words with sklearn\" data-id=\"4857\" width=\"613\" data-init-width=\"2048\" height=\"321\" data-init-height=\"1074\" title=\"Implementation of bag of words with sklearn\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/4-Implementation-of-bag-of-words-with-sklearn.png?resize=613%2C321&amp;ssl=1\" data-width=\"613\" data-height=\"321\" data-recalc-dims=\"1\"><\/span><\/div>\n<h4 class=\"\">Output<\/h4>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173fd8cf023\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/5-bag-of-words-output.png?resize=229%2C91&amp;ssl=1\" class=\"tve_image wp-image-4860\" alt=\"bag of words output\" data-id=\"4860\" width=\"229\" data-init-width=\"229\" height=\"91\" data-init-height=\"91\" title=\"bag of words output\" loading=\"lazy\" data-width=\"229\" data-height=\"91\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4860\" alt=\"bag of words output\" data-id=\"4860\" width=\"229\" data-init-width=\"229\" height=\"91\" data-init-height=\"91\" title=\"bag of words output\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/5-bag-of-words-output.png?resize=229%2C91&amp;ssl=1\" data-width=\"229\" data-height=\"91\" data-recalc-dims=\"1\"><\/span><\/div>\n<h2 class=\"\" id=\"t-1597685144204\">TF-IDF<\/h2>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173fd8f0268\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/6-TF-IDF.png?resize=613%2C368&amp;ssl=1\" class=\"tve_image wp-image-4866\" alt=\"TF - IDF\" data-id=\"4866\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"TF - IDF\" loading=\"lazy\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4866\" alt=\"TF - IDF\" data-id=\"4866\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"TF - IDF\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/6-TF-IDF.png?resize=613%2C368&amp;ssl=1\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Another popular word embedding technique for extracting features from corpus or vocabulary is TF-IDF. This is a statistical method to find how important a word is to a document all over other documents. <\/p>\n<p dir=\"ltr\">Let me explain more details about this technique like what are TF and IDF full forms ? and also what is important and \u00a0what is the process of this technique ? etc.<\/p>\n<h3 id=\"t-1597685144205\" class=\"\">TF<\/h3>\n<p dir=\"ltr\">The full form of TF is <strong>Term Frequency<\/strong> (TF). In TF , we are giving some scoring for each word or token based on the frequency of that word. The frequency of a word is dependent on the length of the document. Means in large size of document a word occurs more than a small or medium size of the documents.\u00a0<\/p>\n<p dir=\"ltr\">So to overcome this problem we will divide the <strong>frequency of a word<\/strong> with the length of the document (total number of words) to normalize.By using this technique also, we are creating a sparse matrix with frequency of every word.<\/p>\n<p dir=\"ltr\">Formula to calculate Term Frequency (TF)<\/p>\n<blockquote class=\"\"><p>\n<strong>TF =<\/strong><br \/>\n<em>no. of times term occurrences in a document \/ total number of words in a document<\/em>\n<\/p><\/blockquote>\n<h3 id=\"t-1597685144206\" class=\"\">IDF<\/h3>\n<p dir=\"ltr\">The full form of IDF is <strong>Inverse Document Frequency<\/strong>. Here also we are assigning \u00a0a score value \u00a0to a word , this scoring value explains how a word is rare across all documents. Rarer words have more IDF score.<\/p>\n<p dir=\"ltr\">Formula to calculate Inverse Document Frequency (IDF) :-<\/p>\n<blockquote class=\"\"><p>\n<strong>IDF =<\/strong><br \/>\n<em>log base e (total number of documents \/ number of documents which are having term )<\/em>\n<\/p><\/blockquote>\n<p dir=\"ltr\">Formula to calculate complete TF-IDF value is\u00a0<\/p>\n<blockquote class=\"\"><p>\n<strong><em>TF &#8211; IDF \u00a0= TF * IDF<\/em><\/strong>\u00a0<\/p><\/blockquote>\n<p dir=\"ltr\">TF-IDF value will be increased based on frequency of the word in a document. Like Bag of Words in this technique also we can not get any semantic meaning for words. <\/p>\n<p dir=\"ltr\">But this technique is mostly used for document classification and also successfully used by search engines like Google, as a ranking factor for content.\u00a0<\/p>\n<p dir=\"ltr\">Okay with the theory part for <strong>TF-IDF<\/strong> is completed now we will see how this happens with example and then we will learn the implementation in python.<\/p>\n<h4 class=\"\">Example sentences :-<\/h4>\n<ul class=\"\">\n<li>A: This pasta is very tasty and affordable.<\/li>\n<li>B: This pasta is not tasty and is affordable.<\/li>\n<li>C: This pasta is very very delicious.<\/li>\n<\/ul>\n<p dir=\"ltr\">Let&#8217;s consider each sentence as a document. Here also our first task is tokenization (dividing sentences into words or tokens) and then taking unique words.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff708ad8\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/tf-idf-calculation_2.png?resize=613%2C229&amp;ssl=1\" class=\"tve_image wp-image-4871\" alt=\"tf-idf calculation\" data-id=\"4871\" width=\"613\" data-init-width=\"1154\" height=\"229\" data-init-height=\"432\" title=\"tf-idf calculation\" loading=\"lazy\" data-width=\"613\" data-height=\"229\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4871\" alt=\"tf-idf calculation\" data-id=\"4871\" width=\"613\" data-init-width=\"1154\" height=\"229\" data-init-height=\"432\" title=\"tf-idf calculation\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/tf-idf-calculation_2.png?resize=613%2C229&amp;ssl=1\" data-width=\"613\" data-height=\"229\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">From the above table we can observe rarer words have more score than common words.That shows us the significance of the words in our corpus.<\/p>\n<h3 class=\"\" id=\"t-1597717516707\">Implementation of TF-IDF by using Sklearn<\/h3>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff720908\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/7-Implementation-of-TF-IDF-using-Sklearn.png?resize=613%2C402&amp;ssl=1\" class=\"tve_image wp-image-4876\" alt=\"Implementation of TF-IDF using Sklearn\" data-id=\"4876\" width=\"613\" data-init-width=\"2048\" height=\"402\" data-init-height=\"1344\" title=\"Implementation of TF-IDF using Sklearn\" loading=\"lazy\" data-width=\"613\" data-height=\"402\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4876\" alt=\"Implementation of TF-IDF using Sklearn\" data-id=\"4876\" width=\"613\" data-init-width=\"2048\" height=\"402\" data-init-height=\"1344\" title=\"Implementation of TF-IDF using Sklearn\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/7-Implementation-of-TF-IDF-using-Sklearn.png?resize=613%2C402&amp;ssl=1\" data-width=\"613\" data-height=\"402\" data-recalc-dims=\"1\"><\/span><\/div>\n<h4 class=\"\">Output<\/h4>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff731da2\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/8-tf-idf-output.png?resize=613%2C156&amp;ssl=1\" class=\"tve_image wp-image-4879\" alt=\"tf idf output\" data-id=\"4879\" width=\"613\" data-init-width=\"711\" height=\"156\" data-init-height=\"181\" title=\"tf idf output\" loading=\"lazy\" data-width=\"613\" data-height=\"156\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4879\" alt=\"tf idf output\" data-id=\"4879\" width=\"613\" data-init-width=\"711\" height=\"156\" data-init-height=\"181\" title=\"tf idf output\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/8-tf-idf-output.png?resize=613%2C156&amp;ssl=1\" data-width=\"613\" data-height=\"156\" data-recalc-dims=\"1\"><\/span><\/div>\n<h2 class=\"\" id=\"t-1597717516708\">Word2vec<\/h2>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff747525\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/9-word2vect.png?resize=613%2C425&amp;ssl=1\" class=\"tve_image wp-image-4883\" alt data-id=\"4883\" width=\"613\" data-init-width=\"783\" height=\"425\" data-init-height=\"543\" title=\"word2vect\" loading=\"lazy\" data-width=\"613\" data-height=\"425\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4883\" alt=\"\" data-id=\"4883\" width=\"613\" data-init-width=\"783\" height=\"425\" data-init-height=\"543\" title=\"word2vect\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/9-word2vect.png?resize=613%2C425&amp;ssl=1\" data-width=\"613\" data-height=\"425\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">word2vect<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Image reference : https:\/\/devopedia.org<\/p>\n<p dir=\"ltr\">The Word2Vec model is used for learning vector representations of words called \u201cword embeddings\u201d. Did you observe that we didn\u2019t get any semantic meaning from words of corpus by using previous methods? But for most of the applications of NLP tasks like sentiment classification, sarcasm detection etc require semantic meaning of a word and semantic relationships of a word with other words.<\/p>\n<p dir=\"ltr\">So can we get semantic meaning from words ?<\/p>\n<p dir=\"ltr\">Yeah exactly you got the answer , the answer is by using word2vec technique \u00a0we will get what we want.<\/p>\n<p dir=\"ltr\">Word embeddings have a capability of capturing semantic and syntactic relationships between words and also the context of words in a document. Word2vec is the technique to implement word embeddings.<\/p>\n<p dir=\"ltr\">Every word in a sentence is dependent on another word or other words.If you want to find similarities and relations between words ,we have to capture word dependencies.<\/p>\n<p dir=\"ltr\">By using <strong>Bag-of-words<\/strong> and <strong>TF-IDF<\/strong> techniques we can not capture the meaning or relation of the words from vectors. Word2vec constructs such vectors called <strong>embeddings<\/strong>.<\/p>\n<p dir=\"ltr\">Word2vec model takes input as a <strong>large size of corpus<\/strong> and produces output to vector space. This vector space size may be in hundred of dimensionality. Each word vector will be placed on this vector space.<\/p>\n<p dir=\"ltr\">In vector space whatever words share context commonly in a corpus that are closer to each other. Word vector having positions of corresponding words in a vector space.<\/p>\n<p dir=\"ltr\">The Word2vec method learns all those types of relationships of words while building a model. For this purpose word2vec uses 2 types of methods. There are<\/p>\n<ol class=\"\">\n<li class=\"\">Skip-gram<\/li>\n<li>CBOW (Continuous Bag of Words)<\/li>\n<\/ol>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff75ea7f\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/10-cbow-word2vec.png?resize=613%2C262&amp;ssl=1\" class=\"tve_image wp-image-4888\" alt data-id=\"4888\" width=\"613\" data-init-width=\"999\" height=\"262\" data-init-height=\"427\" title=\"cbow word2vec\" loading=\"lazy\" data-width=\"613\" data-height=\"262\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4888\" alt=\"\" data-id=\"4888\" width=\"613\" data-init-width=\"999\" height=\"262\" data-init-height=\"427\" title=\"cbow word2vec\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/10-cbow-word2vec.png?resize=613%2C262&amp;ssl=1\" data-width=\"613\" data-height=\"262\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Image reference : https:\/\/community.alteryx.com<\/p>\n<p dir=\"ltr\">Here one more thing we have to discuss that is window size. Did you remember the Bag-Of-words technique we discussed about 1-gram or uni-gram, bigram ,trigram \u2026.n-gram representation of text ?<\/p>\n<p dir=\"ltr\">This method also follows the same technique. But here it is called <strong>window size<\/strong>.<\/p>\n<p dir=\"ltr\">The Word2vec model will capture relationships of words with the help of window size by using skip-gram and CBow methods.<\/p>\n<p dir=\"ltr\">What is the difference between these 2 methods ? Do you want to know ?<\/p>\n<p dir=\"ltr\">That is a really simple technique. Before going to discuss these techniques , we have to know one more thing , why are we taking windows in this technique? \u00a0Just to know the center word and context of the center word. (I have to add few words here like we can not use whole sentence)<\/p>\n<h3 id=\"t-1597717516709\" class=\"\">Skip-Gram<\/h3>\n<p dir=\"ltr\">In this method , take the center word from the <strong>window size<\/strong> words as an input and context words (neighbour words) as outputs. Word2vec models predict the context words of a center word using skip-gram method. Skip-gram works well with a small dataset and identifies rare words really well.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff773e11\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/11-The-architecture-of-Skip-gram.png?resize=601%2C715&amp;ssl=1\" class=\"tve_image wp-image-4893\" alt=\"The architecture of Skip gram\" data-id=\"4893\" width=\"601\" data-init-width=\"601\" height=\"715\" data-init-height=\"715\" title=\"The architecture of Skip gram\" loading=\"lazy\" data-width=\"601\" data-height=\"715\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4893\" alt=\"The architecture of Skip gram\" data-id=\"4893\" width=\"601\" data-init-width=\"601\" height=\"715\" data-init-height=\"715\" title=\"The architecture of Skip gram\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/11-The-architecture-of-Skip-gram.png?resize=601%2C715&amp;ssl=1\" data-width=\"601\" data-height=\"715\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Image reference : researchgate.net\u00a0<\/p>\n<h3 id=\"t-1597717516710\" class=\"\">Continuous Bag-of-words<\/h3>\n<p dir=\"ltr\">CBow is just a reverse method of the skip gram method. Here we are taking context words as input and predicting the center word within the window. Another difference from skip gram method is, It was working <strong>faster and better<\/strong> representations for most frequency words.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff782f9a\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/12-continuous-bag-of-words.png?resize=604%2C756&amp;ssl=1\" class=\"tve_image wp-image-4896\" alt=\"continuous bag of words\" data-id=\"4896\" width=\"604\" data-init-width=\"604\" height=\"756\" data-init-height=\"756\" title=\"continuous bag of words\" loading=\"lazy\" data-width=\"604\" data-height=\"756\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4896\" alt=\"continuous bag of words\" data-id=\"4896\" width=\"604\" data-init-width=\"604\" height=\"756\" data-init-height=\"756\" title=\"continuous bag of words\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/12-continuous-bag-of-words.png?resize=604%2C756&amp;ssl=1\" data-width=\"604\" data-height=\"756\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Image reference : researchgate.net\u00a0<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv-page-section thrv-lp-block\" data-inherit-lp-settings=\"1\" data-css=\"tve-u-173ff795952\" data-keep-css_id=\"1\">\n<div class=\"tve-page-section-in tve_empty_dropzone  \" data-css=\"tve-u-173ff795b49\">\n<div class=\"thrv_wrapper thrv_text_element\" data-css=\"tve-u-173ff795956\">\n<p class=\"tcb-global-text-\" data-css=\"tve-u-173ff795957\">Difference between Skip gram &amp; CBow<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv-columns dynamic-group-kbuvrect\" data-css=\"tve-u-173ff795958\">\n<div class=\"tcb-flex-row v-2 tcb--cols--2\" data-css=\"tve-u-173ff795959\">\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbuvrcvb\" data-css=\"tve-u-173ff79595a\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbuvrbgn\" data-css=\"tve-u-173ff79595b\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper thrv-styled_list tcb-icon-display dynamic-group-kbuvr94e\" data-icon-code=\"icon-check-circle-solid\" data-css=\"tve-u-173ff795961\">\n<ul class=\"tcb-styled-list\">\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173ff795962\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173ff795964\">In this input is centre word and output is context words (neighbour words).<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173ff795965\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173ff795967\">Works well with small datasets.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173ff795968\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173ff79596a\">Skip-gram identifies rarer words better.<\/span>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbuvrcvb\" data-css=\"tve-u-173ff79596e\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbuvrbgn\" data-css=\"tve-u-173ff79596f\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper thrv-styled_list dynamic-group-kbuvr94e\" data-icon-code=\"icon-times-circle-solid\" data-css=\"tve-u-173ff795975\">\n<ul class=\"tcb-styled-list\">\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173ff795976\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173ff795978\">In this context or neighbor words are input and output is the center word.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173ff795979\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173ff79597b\">Works good with large datasets.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173ff79597c\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173ff79597e\">Better representation for frequent words than rarer.<\/span>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h3 id=\"t-1597717516714\" class=\"\">Word2vec implementation<\/h3>\n<p dir=\"ltr\">Let\u2019s jump into the implementation part. here we will see\u00a0<\/p>\n<ol class=\"\">\n<li>How to build\u00a0 word2vec model with these two methods<\/li>\n<li>Usage of Word embedding Pre-trained models\n<ol>\n<li>Google word2vec<\/li>\n<li>Stanford glove Embeddings<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h4 class=\"\">Building our word2vec model with custom text<\/h4>\n<h4 class=\"\">Word2vec with gensim<\/h4>\n<p dir=\"ltr\">For this i am taking just a sample text file and will build a word2vec model by using the gensim python library.<\/p>\n<h4 class=\"\">Require libraries<\/h4>\n<ol class=\"\">\n<li>Gensim (<strong>pip install &#8211;upgrade gensim<\/strong>)<\/li>\n<li>NLTK (<strong>pip install nltk<\/strong>)<\/li>\n<li>Regex (<strong>pip install re<\/strong>)<\/li>\n<\/ol>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff8103aa\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/13-importing-required-libraries-for-word2vec_model.png?resize=613%2C590&amp;ssl=1\" class=\"tve_image wp-image-4907\" alt=\"importing required libraries for word2vec_model\" data-id=\"4907\" width=\"613\" data-init-width=\"1060\" height=\"590\" data-init-height=\"1020\" title=\"importing required libraries for word2vec_model\" loading=\"lazy\" data-width=\"613\" data-height=\"590\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4907\" alt=\"importing required libraries for word2vec_model\" data-id=\"4907\" width=\"613\" data-init-width=\"1060\" height=\"590\" data-init-height=\"1020\" title=\"importing required libraries for word2vec_model\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/13-importing-required-libraries-for-word2vec_model.png?resize=613%2C590&amp;ssl=1\" data-width=\"613\" data-height=\"590\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>We will get output like this\u00a0<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff81cee9\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/14-output.png?resize=577%2C63&amp;ssl=1\" class=\"tve_image wp-image-4909\" alt=\"output\" data-id=\"4909\" width=\"577\" data-init-width=\"577\" height=\"63\" data-init-height=\"63\" title=\"output\" loading=\"lazy\" data-width=\"577\" data-height=\"63\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4909\" alt=\"output\" data-id=\"4909\" width=\"577\" data-init-width=\"577\" height=\"63\" data-init-height=\"63\" title=\"output\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/14-output.png?resize=577%2C63&amp;ssl=1\" data-width=\"577\" data-height=\"63\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff82870b\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/15-reading-text-file.png?resize=613%2C287&amp;ssl=1\" class=\"tve_image wp-image-4911\" alt=\"reading text file\" data-id=\"4911\" width=\"613\" data-init-width=\"1600\" height=\"287\" data-init-height=\"750\" title=\"reading text file\" loading=\"lazy\" data-width=\"613\" data-height=\"287\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4911\" alt=\"reading text file\" data-id=\"4911\" width=\"613\" data-init-width=\"1600\" height=\"287\" data-init-height=\"750\" title=\"reading text file\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/15-reading-text-file.png?resize=613%2C287&amp;ssl=1\" data-width=\"613\" data-height=\"287\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff830e5b\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/16_output.png?resize=613%2C49&amp;ssl=1\" class=\"tve_image wp-image-4912\" alt=\"output\" data-id=\"4912\" width=\"613\" data-init-width=\"1257\" height=\"49\" data-init-height=\"101\" title=\"output\" loading=\"lazy\" data-width=\"613\" data-height=\"49\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4912\" alt=\"output\" data-id=\"4912\" width=\"613\" data-init-width=\"1257\" height=\"49\" data-init-height=\"101\" title=\"output\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/16_output.png?resize=613%2C49&amp;ssl=1\" data-width=\"613\" data-height=\"49\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>Now i am removing punctuations from all sentences. Because we can not get that much information from punctuations.But not all applications. <\/p>\n<p>For this sample example we <strong>don\u2019t need<\/strong> any punctuations , numbers, all these things so i will remove them with a <strong>regex pattern<\/strong>.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff84a2c4\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/17-removing-punctuations-from-sentences.png?resize=613%2C258&amp;ssl=1\" class=\"tve_image wp-image-4916\" alt data-id=\"4916\" width=\"613\" data-init-width=\"1654\" height=\"258\" data-init-height=\"696\" title=\"removing punctuations from sentences\" loading=\"lazy\" data-width=\"613\" data-height=\"258\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4916\" alt=\"\" data-id=\"4916\" width=\"613\" data-init-width=\"1654\" height=\"258\" data-init-height=\"696\" title=\"removing punctuations from sentences\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/17-removing-punctuations-from-sentences.png?resize=613%2C258&amp;ssl=1\" data-width=\"613\" data-height=\"258\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">removing punctuations from sentences<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff85a315\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/18_output.png?resize=613%2C59&amp;ssl=1\" class=\"tve_image wp-image-4918\" alt=\"output\" data-id=\"4918\" width=\"613\" data-init-width=\"1276\" height=\"59\" data-init-height=\"123\" title=\"output\" loading=\"lazy\" data-width=\"613\" data-height=\"59\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4918\" alt=\"output\" data-id=\"4918\" width=\"613\" data-init-width=\"1276\" height=\"59\" data-init-height=\"123\" title=\"output\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/18_output.png?resize=613%2C59&amp;ssl=1\" data-width=\"613\" data-height=\"59\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>Now we have to apply <strong>tokenization<\/strong> to all sentences.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff86b474\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/19-apply-word-tokenization-on-sentences.png?resize=613%2C294&amp;ssl=1\" class=\"tve_image wp-image-4922\" alt=\"apply word tokenization on sentences\" data-id=\"4922\" width=\"613\" data-init-width=\"1564\" height=\"294\" data-init-height=\"750\" title=\"apply word tokenization on sentences\" loading=\"lazy\" data-width=\"613\" data-height=\"294\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4922\" alt=\"apply word tokenization on sentences\" data-id=\"4922\" width=\"613\" data-init-width=\"1564\" height=\"294\" data-init-height=\"750\" title=\"apply word tokenization on sentences\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/19-apply-word-tokenization-on-sentences.png?resize=613%2C294&amp;ssl=1\" data-width=\"613\" data-height=\"294\" data-recalc-dims=\"1\"><\/span><\/div>\n<h4 class=\"\">Output<\/h4>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff881ccc\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/20_output.png?resize=613%2C38&amp;ssl=1\" class=\"tve_image wp-image-4925\" alt=\"output\" data-id=\"4925\" width=\"613\" data-init-width=\"1330\" height=\"38\" data-init-height=\"82\" title=\"output\" loading=\"lazy\" data-width=\"613\" data-height=\"38\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4925\" alt=\"output\" data-id=\"4925\" width=\"613\" data-init-width=\"1330\" height=\"38\" data-init-height=\"82\" title=\"output\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/20_output.png?resize=613%2C38&amp;ssl=1\" data-width=\"613\" data-height=\"38\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">We can give these tokenized sentences to word2vec as input to the word2vec model.<\/p>\n<h4 class=\"\">Building word2vec with CBOW method<\/h4>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff896387\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/21-Building-word2vec-with-CBOW-method.png?resize=613%2C449&amp;ssl=1\" class=\"tve_image wp-image-4928\" alt=\"Building word2vec with CBOW method\" data-id=\"4928\" width=\"613\" data-init-width=\"1762\" height=\"449\" data-init-height=\"1290\" title=\"Building word2vec with CBOW method\" loading=\"lazy\" data-width=\"613\" data-height=\"449\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4928\" alt=\"Building word2vec with CBOW method\" data-id=\"4928\" width=\"613\" data-init-width=\"1762\" height=\"449\" data-init-height=\"1290\" title=\"Building word2vec with CBOW method\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/21-Building-word2vec-with-CBOW-method.png?resize=613%2C449&amp;ssl=1\" data-width=\"613\" data-height=\"449\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\">Output<\/h4>\n<p><em>Total number of words\u00a0<\/em><\/p>\n<p><em>79<\/em><\/p>\n<p><em>array([-0.20608747, \u00a00.05975117], dtype=float32)<\/em><\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Word2vec model building is done.<\/p>\n<p dir=\"ltr\">So let\u2019s see how it looks like by using <strong>matplotlib<\/strong> for visualization.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff8bff41\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/22-visualize-CBOW-word2vec-model.png?resize=613%2C457&amp;ssl=1\" class=\"tve_image wp-image-4934\" alt=\"visualize CBOW word2vec model\" data-id=\"4934\" width=\"613\" data-init-width=\"1294\" height=\"457\" data-init-height=\"966\" title=\"visualize CBOW word2vec model\" loading=\"lazy\" data-width=\"613\" data-height=\"457\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4934\" alt=\"visualize CBOW word2vec model\" data-id=\"4934\" width=\"613\" data-init-width=\"1294\" height=\"457\" data-init-height=\"966\" title=\"visualize CBOW word2vec model\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/22-visualize-CBOW-word2vec-model.png?resize=613%2C457&amp;ssl=1\" data-width=\"613\" data-height=\"457\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff8ca760\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/23_output.png?resize=391%2C274&amp;ssl=1\" class=\"tve_image wp-image-4936\" alt=\"output\" data-id=\"4936\" width=\"391\" data-init-width=\"391\" height=\"274\" data-init-height=\"274\" title=\"output\" loading=\"lazy\" data-width=\"391\" data-height=\"274\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4936\" alt=\"output\" data-id=\"4936\" width=\"391\" data-init-width=\"391\" height=\"274\" data-init-height=\"274\" title=\"output\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/23_output.png?resize=391%2C274&amp;ssl=1\" data-width=\"391\" data-height=\"274\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">We can see in the above figure , node , tree, random, words are close to each other and also the distance between movie and algorithm. Maybe we can\u2019t observe more words like this because of dataset size , if we use large dataset then we can \u00a0observe more clearly.<\/p>\n<h4 class=\"\">Building word2vec skip-gram method<\/h4>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff8ef86c\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/24-Building-word2vec-skip-gram-method.png?resize=613%2C505&amp;ssl=1\" class=\"tve_image wp-image-4940\" alt=\"Building word2vec skip gram method\" data-id=\"4940\" width=\"613\" data-init-width=\"1762\" height=\"505\" data-init-height=\"1452\" title=\"Building word2vec skip gram method\" loading=\"lazy\" data-width=\"613\" data-height=\"505\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4940\" alt=\"Building word2vec skip gram method\" data-id=\"4940\" width=\"613\" data-init-width=\"1762\" height=\"505\" data-init-height=\"1452\" title=\"Building word2vec skip gram method\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/24-Building-word2vec-skip-gram-method.png?resize=613%2C505&amp;ssl=1\" data-width=\"613\" data-height=\"505\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff8f7118\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/25_output.png?resize=241%2C50&amp;ssl=1\" class=\"tve_image wp-image-4942\" alt=\"output\" data-id=\"4942\" width=\"241\" data-init-width=\"241\" height=\"50\" data-init-height=\"50\" title=\"output\" loading=\"lazy\" data-width=\"241\" data-height=\"50\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4942\" alt=\"output\" data-id=\"4942\" width=\"241\" data-init-width=\"241\" height=\"50\" data-init-height=\"50\" title=\"output\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/25_output.png?resize=241%2C50&amp;ssl=1\" data-width=\"241\" data-height=\"50\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>Let\u2019s see the visualization\u00a0<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff910270\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/26-visualization-of-skip-gram-word2vec-model.png?resize=613%2C457&amp;ssl=1\" class=\"tve_image wp-image-4944\" alt=\"visualization of skip gram word2vec model\" data-id=\"4944\" width=\"613\" data-init-width=\"1294\" height=\"457\" data-init-height=\"966\" title=\"visualization of skip gram word2vec model\" loading=\"lazy\" data-width=\"613\" data-height=\"457\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4944\" alt=\"visualization of skip gram word2vec model\" data-id=\"4944\" width=\"613\" data-init-width=\"1294\" height=\"457\" data-init-height=\"966\" title=\"visualization of skip gram word2vec model\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/26-visualization-of-skip-gram-word2vec-model.png?resize=613%2C457&amp;ssl=1\" data-width=\"613\" data-height=\"457\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff91dc64\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/27_output.png?resize=435%2C274&amp;ssl=1\" class=\"tve_image wp-image-4946\" alt=\"output\" data-id=\"4946\" width=\"435\" data-init-width=\"435\" height=\"274\" data-init-height=\"274\" title=\"output\" loading=\"lazy\" data-width=\"435\" data-height=\"274\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4946\" alt=\"output\" data-id=\"4946\" width=\"435\" data-init-width=\"435\" height=\"274\" data-init-height=\"274\" title=\"output\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/27_output.png?resize=435%2C274&amp;ssl=1\" data-width=\"435\" data-height=\"274\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Same as CBOW visualization graph here also same thing happens, \u00a0node , tree, random, words are close to each other and also the distance between movie and algorithm.<\/p>\n<h2 class=\"\" id=\"t-1597717516715\">Word embedding model using Pre-trained models<\/h2>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff931b29\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/word2vect-2.png?resize=613%2C368&amp;ssl=1\" class=\"tve_image wp-image-4950\" alt=\"word2vect\" data-id=\"4950\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"word2vect\" loading=\"lazy\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4950\" alt=\"word2vect\" data-id=\"4950\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"word2vect\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/word2vect-2.png?resize=613%2C368&amp;ssl=1\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">If our \u00a0dataset size is small, then we can get too many words, and if we can&#8217;t provide more sentences, the model will not learn more from our dataset. Otherwise if we want to build a word2vec model with a large corpus then it will require more resources like time,memory etc. <\/p>\n<p dir=\"ltr\">So how can we build a better word embedding model ? don\u2019t worry , we can utilize already trained models. Here we are using 2 most popular pre-trained word embedding models. We \u00a0don&#8217;t explain about these pre-trained models in detail, but tell how to use them.\u00a0<\/p>\n<h3 id=\"t-1597717516716\" class=\"\">Google word2vec<\/h3>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>\tWe can download google word2vec pretrained model from \u00a0<a href=\"https:\/\/drive.google.com\/file\/d\/0B7XkCwpI5KDYNlNUTTlSS21pQmM\/edit?usp=sharing\" target=\"_blank\" rel=\"noopener noreferrer\">link<\/a>.This is the compressed file so you have to extract that file before using it in the script.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff945445\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/28-load-google-word2vec-pretrained-model-file.png?resize=613%2C217&amp;ssl=1\" class=\"tve_image wp-image-4953\" alt=\"load google word2vec pretrained model file\" data-id=\"4953\" width=\"613\" data-init-width=\"1510\" height=\"217\" data-init-height=\"534\" title=\"load google word2vec pretrained model file\" loading=\"lazy\" data-width=\"613\" data-height=\"217\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4953\" alt=\"load google word2vec pretrained model file\" data-id=\"4953\" width=\"613\" data-init-width=\"1510\" height=\"217\" data-init-height=\"534\" title=\"load google word2vec pretrained model file\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/28-load-google-word2vec-pretrained-model-file.png?resize=613%2C217&amp;ssl=1\" data-width=\"613\" data-height=\"217\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">We will see how word embeddings capture the relation between words with example of\u00a0<\/p>\n<p dir=\"ltr\">King &#8211; man = ? &#8211; woman<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff96d73a\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/29-apply-pretrained-word2vec-on-an-example.png?resize=613%2C175&amp;ssl=1\" class=\"tve_image wp-image-4957\" alt=\"apply pretrained word2vec on an example\" data-id=\"4957\" width=\"613\" data-init-width=\"1870\" height=\"175\" data-init-height=\"534\" title=\"apply pretrained word2vec on an example\" loading=\"lazy\" data-width=\"613\" data-height=\"175\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4957\" alt=\"apply pretrained word2vec on an example\" data-id=\"4957\" width=\"613\" data-init-width=\"1870\" height=\"175\" data-init-height=\"534\" title=\"apply pretrained word2vec on an example\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/29-apply-pretrained-word2vec-on-an-example.png?resize=613%2C175&amp;ssl=1\" data-width=\"613\" data-height=\"175\" data-recalc-dims=\"1\"><\/span><\/div>\n<h4 class=\"\">Output<\/h4>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff97bcd4\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/30_output.png?resize=613%2C40&amp;ssl=1\" class=\"tve_image wp-image-4960\" alt=\"output\" data-id=\"4960\" width=\"613\" data-init-width=\"822\" height=\"40\" data-init-height=\"53\" title=\"output\" loading=\"lazy\" data-width=\"613\" data-height=\"40\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4960\" alt=\"output\" data-id=\"4960\" width=\"613\" data-init-width=\"822\" height=\"40\" data-init-height=\"53\" title=\"output\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/30_output.png?resize=613%2C40&amp;ssl=1\" data-width=\"613\" data-height=\"40\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h3 id=\"t-1597717516717\" class=\"\">Stanford Glove Embeddings<\/h3>\n<p dir=\"ltr\">Full form Glove is Global Vectors for Word Representation.<\/p>\n<p>We can download this pretrained model from <a href=\"http:\/\/nlp.stanford.edu\/data\/glove.6B.zip\" class=\"\">this<\/a> link.This file also compressed one we have to extract , after extracting you can see different files. Glove embedding model provides different dimensions \u00a0of models like below<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff99334d\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/32-glove-file_extracting.png?resize=272%2C103&amp;ssl=1\" class=\"tve_image wp-image-4964\" alt=\"glove file_extracting\" data-id=\"4964\" width=\"272\" data-init-width=\"272\" height=\"103\" data-init-height=\"103\" title=\"glove file_extracting\" loading=\"lazy\" data-width=\"272\" data-height=\"103\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4964\" alt=\"glove file_extracting\" data-id=\"4964\" width=\"272\" data-init-width=\"272\" height=\"103\" data-init-height=\"103\" title=\"glove file_extracting\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/32-glove-file_extracting.png?resize=272%2C103&amp;ssl=1\" data-width=\"272\" data-height=\"103\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">For this we have to do some pre-requested task.we have to convert the glove word embedding file to word2vec using <strong>glove2word2vec()<\/strong> function. From those file , i am taking 100 dimensions file <strong>glove.6B.100d.txt<\/strong> \u00a0<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff9ee353\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/33-a-load_glove_pretrained_model_and_Apply_on_an_example.png?resize=613%2C426&amp;ssl=1\" class=\"tve_image wp-image-4970\" alt=\"load glove pretrained model and Apply on an example\" data-id=\"4970\" width=\"613\" data-init-width=\"1780\" height=\"426\" data-init-height=\"1236\" title=\"load glove pretrained model and Apply on an example\" loading=\"lazy\" data-width=\"613\" data-height=\"426\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4970\" alt=\"load glove pretrained model and Apply on an example\" data-id=\"4970\" width=\"613\" data-init-width=\"1780\" height=\"426\" data-init-height=\"1236\" title=\"load glove pretrained model and Apply on an example\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/33-a-load_glove_pretrained_model_and_Apply_on_an_example.png?resize=613%2C426&amp;ssl=1\" data-width=\"613\" data-height=\"426\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173ff9b75c0\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/33_output.png?resize=364%2C37&amp;ssl=1\" class=\"tve_image wp-image-4968\" alt=\"output\" data-id=\"4968\" width=\"364\" data-init-width=\"364\" height=\"37\" data-init-height=\"37\" title=\"output\" loading=\"lazy\" data-width=\"364\" data-height=\"37\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4968\" alt=\"output\" data-id=\"4968\" width=\"364\" data-init-width=\"364\" height=\"37\" data-init-height=\"37\" title=\"output\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/33_output.png?resize=364%2C37&amp;ssl=1\" data-width=\"364\" data-height=\"37\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h2 id=\"t-1597717516718\" class=\"\">Conclusion<\/h2>\n<p dir=\"ltr\">We can use any one of the text feature extraction based on our project requirement. Because every method has their advantages \u00a0like a Bag-Of-Words suitable for text classification, TF-IDF is for document classification and if you want semantic relation between words then go with word2vec.<\/p>\n<p dir=\"ltr\">We can\u2019t say blindly what type of feature extraction gives better results. One more thing is building word embeddings from our dataset or corpus will give better results. But we don\u2019t always have enough size of data set so in that case we can use pre-trained models with <strong>transfer learning<\/strong>.<\/p>\n<p dir=\"ltr\">We didn\u2019t explain <strong>transfer learning<\/strong> concept in this article, surely we will explain how to apply transfer learning technique to train pre-trained word embeddings with our corpus in the future articles.<\/p>\n<h4 class=\"\">Recommended NLP courses<\/h4>\n<\/div>\n<div class=\"thrv_wrapper thrv-page-section thrv-lp-block\" data-inherit-lp-settings=\"1\" data-css=\"tve-u-173ffc54617\" data-keep-css_id=\"1\">\n<div class=\"tve-page-section-in tve_empty_dropzone  \" data-css=\"tve-u-173ffc549bd\">\n<div class=\"thrv_wrapper thrv-columns dynamic-group-kbt3q0q7\" data-css=\"tve-u-173ffc54619\">\n<div class=\"tcb-flex-row v-2 tcb--cols--3 tcb-medium-no-wrap tcb-mobile-wrap m-edit\" data-css=\"tve-u-173ffc5461a\">\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-173ffc5461b\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-173ffc5461c\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-173ffc54627\">\n<span class=\"tve_image_frame\"><a href=\"https:\/\/dataaspirant.com\/recommends\/data-science-courses\/nlp-specialization\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4068\" alt=\"nlp specialization\" data-id=\"4068\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"NLP specialization\" loading=\"lazy\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173ffc54628\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4068\" alt=\"nlp specialization\" data-id=\"4068\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"NLP specialization\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173ffc54628\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><\/a><span class=\"tve-image-overlay\"><\/span><\/span>\n<\/div>\n<h4 class=\"\" data-css=\"tve-u-173ffc5462a\">NLP Specialization with Python<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-173ffc5461b\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-173ffc54636\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-173ffc54637\">\n<span class=\"tve_image_frame\"><a href=\"https:\/\/dataaspirant.com\/recommends\/data-science-courses\/nlp-classification-vector-spaces\/\" rel=\"nofollow noopener noreferrer\" target=\"_blank\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4070\" alt=\"Natural-language-processing-classifiation-vector-spaces\" data-id=\"4070\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"Natural-language-processing-2\" loading=\"lazy\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173ffc54638\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4070\" alt=\"Natural-language-processing-classifiation-vector-spaces\" data-id=\"4070\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"Natural-language-processing-2\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173ffc54638\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><\/a><span class=\"tve-image-overlay\"><\/span><\/span>\n<\/div>\n<h4 class=\"\" data-css=\"tve-u-173ffc5463a\">NLP Classification and Vector spaces<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-173ffc5461b\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-173ffc54645\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-173ffc54646\"><span class=\"tve_image_frame\"><a href=\"https:\/\/dataaspirant.com\/recommends\/data-science-courses\/spacy-nlp-python-course\/\" rel=\"nofollow noopener noreferrer\" target=\"_blank\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4072\" alt=\"natural-language-processing-python\" data-id=\"4072\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"natural-language-processing-python\" loading=\"lazy\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173ffc54647\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4072\" alt=\"natural-language-processing-python\" data-id=\"4072\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"natural-language-processing-python\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173ffc54647\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><\/a><span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-173ffc54649\"><\/h4>\n<h4 class=\"\" data-css=\"tve-u-173ffc54649\">NLP Model Building With Python<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/dataaspirant.com\/word-embedding-techniques-nlp\/<\/p>\n","protected":false},"author":0,"featured_media":453,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/452"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=452"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/452\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/453"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=452"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=452"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=452"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}