{"id":9,"date":"2020-08-04T11:58:38","date_gmt":"2020-08-04T11:58:38","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/04\/how-to-build-an-effective-email-spam-classification-model-with-spacy-python\/"},"modified":"2020-08-04T11:58:38","modified_gmt":"2020-08-04T11:58:38","slug":"how-to-build-an-effective-email-spam-classification-model-with-spacy-python","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/04\/how-to-build-an-effective-email-spam-classification-model-with-spacy-python\/","title":{"rendered":"How To Build an Effective Email Spam Classification model with Spacy Python"},"content":{"rendered":"<div id=\"tve_editor\" data-post-id=\"3956\" readability=\"156.47377622378\">\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1736045fb73\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Email-classifier-spacy-python-final.png?resize=613%2C368&amp;ssl=1\" class=\"tve_image wp-image-3913\" alt=\"Email classifier with space python\" data-id=\"3913\" width=\"613\" data-init-width=\"700\" height=\"368\" data-init-height=\"420\" title=\"Building Email Classifier with Spacy Python\" loading=\"lazy\" data-width=\"613\" data-height=\"368\" data-css=\"tve-u-17360460af7\" srcset=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Email-classifier-spacy-python-final.png?w=700&amp;ssl=1 700w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Email-classifier-spacy-python-final.png?resize=300%2C180&amp;ssl=1 300w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-3913\" alt=\"Email classifier with space python\" data-id=\"3913\" width=\"613\" data-init-width=\"700\" height=\"368\" data-init-height=\"420\" title=\"Building Email Classifier with Spacy Python\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Email-classifier-spacy-python-final.png?resize=613%2C368&amp;ssl=1\" data-width=\"613\" data-height=\"368\" data-css=\"tve-u-17360460af7\" srcset=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Email-classifier-spacy-python-final.png?w=700&amp;ssl=1 700w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Email-classifier-spacy-python-final.png?resize=300%2C180&amp;ssl=1 300w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><\/noscript><\/span>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\" readability=\"51.297184567258\">\n<p>\u00a0Capturing data in different forms is increasing exponentially this includes numerical data, text data, image data .. etc.<\/p>\n<p>Numerical data is the main source for building various <a href=\"https:\/\/dataaspirant.com\/random-forest-classifier-python-scikit-learn\/\" target=\"_blank\" class=\"tve-froala fr-basic\" rel=\"noopener noreferrer\">machine learning<\/a> and statistical models, but with the increase in text data, people are using <a href=\"https:\/\/dataaspirant.com\/recommends\/data-science-courses\/nlp-specialization\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" title=\"nlp-specialization\" class=\"thirstylink\" data-linkid=\"4065\">natural language processing<\/a> techniques and extracting meaningful information from text data to get more insights that helps in taking the key business decisions.<\/p>\n<p>Generating insights from text data is not as simple as generating insights from numerical data because the text data won\u2019t be in a structured manner. For processing text data the first step is to convert the unstructured text data into structured data.<\/p>\n<p>We are having various Python libraries to extract text data such as <strong>NLTK, spacy, text blob<\/strong>.<\/p>\n<p>In this article, we are using the\u00a0<a href=\"http:\/\/spacy.io\/\" class=\"tve-froala\">spacy<\/a> natural language python library to build an email spam classification model to identify an email is <strong>spam or not<\/strong> in just a few lines of code.<\/p>\n<\/div>\n<h2 class=\"\" id=\"t-1595062139712\"><strong>Email spam classification model building<\/strong><\/h2>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"36\">\n<p>Regularly we check our emails, not all the emails which came to our mail account will appear in \u00a0inbox. Many of them go to <strong>spam or junk<\/strong> folders.<\/p>\n<blockquote class=\"\"><p>Ever wondered how it was happening?\u00a0<\/p><\/blockquote>\n<p>How the mails are classified and sent to <strong>inbox or spam<\/strong> folder based on the email text?\u00a0<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"51\">\n<p>Before any email reaching your inbox, Google is using their own <strong>email classifier,<\/strong> which will identify whether the recevied email need to send to inbox or spam.<\/p>\n<p>If you are still thinking about how the email classifier works don&#8217;t worry.<\/p>\n<p>\u00a0In this article, we are going to build an <strong>email spam classifier<\/strong> in python that classifies the given mail is spam or not.<\/p>\n<p>There are a number of ways to build email classifier using Natural Language Processing different algorithms, we can you scikit learn or any other package. But in this article, we are going to use the spacy library to build the email classifier.<\/p>\n<p>The main advantage of spacy is code is <strong>well optimized<\/strong>, it will come up with many options which helps us to build a model in very less time and with minimal code.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"32\">\n<p>Without any delay let\u2019s start building the email classification model now.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_tw_qs tve_clearfix\" data-url=\"https:\/\/twitter.com\/intent\/tweet\" data-via=\"\" data-use_custom_url=\"\" data-css=\"tve-u-173611c8ee7\">\n<div class=\"thrv_tw_qs_container\" readability=\"6\">\n<div class=\"thrv_tw_quote\" readability=\"7\">\n<p><strong>How to build the efficient email classifier with spacy python<\/strong><\/p>\n<\/div>\n<p>\n\t\t\t<span><br \/>\n\t\t\t\t<i><\/i><br \/>\n\t\t\t\t<span class=\"thrv_tw_qs_button_text thrv-inline-text tve_editable\">Click to Tweet<\/span><br \/>\n\t\t\t<\/span>\n\t\t<\/p>\n<\/div>\n<\/div>\n<h3 class=\"\" id=\"t-1595062139713\"><strong>Load spam-ham email dataset<\/strong><\/h3>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"35\">\n<p>For building the email classifier we are going to use the email spam dataset that was downloaded from kaggle NLP datasets.<\/p>\n<p>Let\u2019s load the data set using <strong>Pandas<\/strong> read csv method.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption tve_ea_thrive_zoom\" data-css=\"tve-u-1736121cc58\" readability=\"7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=613%2C83&amp;ssl=1\" class=\"tve_image wp-image-3975 tve_evt_manager_listen tve_et_click\" alt=\"Email spam classification load dataset\" data-id=\"3975\" width=\"613\" data-init-width=\"2870\" height=\"83\" data-init-height=\"388\" title=\"email-spam-spacy-load-dataset-output\" loading=\"lazy\" data-width=\"613\" data-height=\"83\" data-tcb-events='__TCB_EVENT_[{\"t\":\"click\",\"a\":\"thrive_zoom\",\"c\":{\"id\":\"3975\",\"size\":\"full\"}}]_TNEVE_BCT__' srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?w=2870&amp;ssl=1 2870w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=300%2C41&amp;ssl=1 300w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=1024%2C138&amp;ssl=1 1024w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=768%2C104&amp;ssl=1 768w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=1536%2C208&amp;ssl=1 1536w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=2048%2C277&amp;ssl=1 2048w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-3975 tve_evt_manager_listen tve_et_click\" alt=\"Email spam classification load dataset\" data-id=\"3975\" width=\"613\" data-init-width=\"2870\" height=\"83\" data-init-height=\"388\" title=\"email-spam-spacy-load-dataset-output\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=613%2C83&amp;ssl=1\" data-width=\"613\" data-height=\"83\" data-tcb-events='__TCB_EVENT_[{\"t\":\"click\",\"a\":\"thrive_zoom\",\"c\":{\"id\":\"3975\",\"size\":\"full\"}}]_TNEVE_BCT__' srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?w=2870&amp;ssl=1 2870w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=300%2C41&amp;ssl=1 300w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=1024%2C138&amp;ssl=1 1024w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=768%2C104&amp;ssl=1 768w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=1536%2C208&amp;ssl=1 1536w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?resize=2048%2C277&amp;ssl=1 2048w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-spacy-load-dataset-output.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><\/noscript><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Email spam classification load dataset output<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"38\">\n<p>Let\u2019s quickly check the attributes of the loaded dataset. The dataset is having information in 2 columns, one column contains the label, the other column contains the text. \u00a0<\/p>\n<ul class=\"\">\n<li class=\"\">\n<strong>label:<\/strong> Helps in identifiy the text is spam or ham.<\/li>\n<li class=\"\">\n<strong>text: <\/strong>The email text<\/li>\n<\/ul>\n<p>Now let&#8217;s check the total numer of observations of the loaded dataset and also let&#8217;s understand the spam and ham distribution in the data.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption tve_ea_thrive_zoom\" data-css=\"tve-u-17361757ebb\" readability=\"7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=613%2C85&amp;ssl=1\" class=\"tve_image wp-image-3995 tve_evt_manager_listen tve_et_click\" alt=\"Email spam &amp; ham dataset size\" data-id=\"3995\" width=\"613\" data-init-width=\"2834\" height=\"85\" data-init-height=\"394\" title=\"email-spam-ham-data-size\" loading=\"lazy\" data-width=\"613\" data-height=\"85\" data-tcb-events='__TCB_EVENT_[{\"t\":\"click\",\"a\":\"thrive_zoom\",\"c\":{\"id\":\"3995\",\"size\":\"full\"}}]_TNEVE_BCT__' data-css=\"tve-u-1736175c66c\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?w=2834&amp;ssl=1 2834w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=300%2C42&amp;ssl=1 300w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=1024%2C142&amp;ssl=1 1024w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=768%2C107&amp;ssl=1 768w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=1536%2C214&amp;ssl=1 1536w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=2048%2C285&amp;ssl=1 2048w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-3995 tve_evt_manager_listen tve_et_click\" alt=\"Email spam &amp; ham dataset size\" data-id=\"3995\" width=\"613\" data-init-width=\"2834\" height=\"85\" data-init-height=\"394\" title=\"email-spam-ham-data-size\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=613%2C85&amp;ssl=1\" data-width=\"613\" data-height=\"85\" data-tcb-events='__TCB_EVENT_[{\"t\":\"click\",\"a\":\"thrive_zoom\",\"c\":{\"id\":\"3995\",\"size\":\"full\"}}]_TNEVE_BCT__' data-css=\"tve-u-1736175c66c\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?w=2834&amp;ssl=1 2834w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=300%2C42&amp;ssl=1 300w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=1024%2C142&amp;ssl=1 1024w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=768%2C107&amp;ssl=1 768w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=1536%2C214&amp;ssl=1 1536w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?resize=2048%2C285&amp;ssl=1 2048w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-data-size.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><\/noscript><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Email spam &amp; ham dataset size<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"32\">\n<p>We\u2019re having \u00a0<strong>5572<\/strong> observations in the loaded data.<\/p>\n<ul class=\"\">\n<li class=\"\">ham count : <strong>4825<\/strong>\n<\/li>\n<li class=\"\">spam count: <strong>747<\/strong>\n<\/li>\n<\/ul>\n<\/div>\n<div class=\"thrv_wrapper thrv-columns\">\n<div class=\"tcb-flex-row v-2 tcb--cols--1\">\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col\" readability=\"6\">\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-1736176bbc5\" readability=\"7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?resize=613%2C383&amp;ssl=1\" class=\"tve_image wp-image-3996\" alt=\"Spam and ham data percentage\" data-id=\"3996\" width=\"613\" data-init-width=\"1158\" height=\"383\" data-init-height=\"724\" title=\"spam-ham-percentages\" loading=\"lazy\" data-width=\"613\" data-height=\"383\" srcset=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?w=1158&amp;ssl=1 1158w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?resize=300%2C188&amp;ssl=1 300w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?resize=1024%2C640&amp;ssl=1 1024w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?resize=768%2C480&amp;ssl=1 768w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-3996\" alt=\"Spam and ham data percentage\" data-id=\"3996\" width=\"613\" data-init-width=\"1158\" height=\"383\" data-init-height=\"724\" title=\"spam-ham-percentages\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?resize=613%2C383&amp;ssl=1\" data-width=\"613\" data-height=\"383\" srcset=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?w=1158&amp;ssl=1 1158w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?resize=300%2C188&amp;ssl=1 300w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?resize=1024%2C640&amp;ssl=1 1024w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/spam-ham-percentages.png?resize=768%2C480&amp;ssl=1 768w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><\/noscript><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Spam and ham distribution<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"33\">\n<p>The above graph shows the ham and spam distribution. The distribution says the dataset is having <strong>majority of ham<\/strong> class population than the spam class population.\u00a0<\/p>\n<\/div>\n<h3 id=\"t-1595062139714\" class=\"\">Create spacy text categorization model pipeline<\/h3>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"33\">\n<p>To build models with spacy you can load the existing pipeline models or you create an empty model and we can add the modeling steps in a pipeline fashion.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"65\">\n<p>Let\u2019s understand the above code line by line.<\/p>\n<p>\u00a0In the <strong>10th line<\/strong>, we have created the empty model with spacy and passing the language which is English (en).\u00a0<\/p>\n<p>Next lines we are creating a pipeline saying that we need this model has to perform text classification.<\/p>\n<p>In the config specifing it as <strong>exclusive class,<\/strong>\u00a0 which means we will provide the target classes in our case <strong>spam or ham<\/strong>. Using the <strong>bow<\/strong> architecture which means using the bag-of-words approach for modeling.\u00a0<\/p>\n<p>We can use other inbuild spacy architecutres also, But for this article we are using the bag of words approach.<\/p>\n<p>Next, we are adding the created <strong>text_cat<\/strong> pipeline to our empty model.<\/p>\n<p>At this stage we are having a model and we are saying that this model has to perform the text classifcation using the bag of words approach.<\/p>\n<p>Next we are adding the target classes spam and ham to the created text categorization model.<\/p>\n<p>\u00a0In this article we\u2019re not going to perform any NLP related data preprocessing techniques on the text.<\/p>\n<p>Generally, people will do a number of things before building a text related model such as<\/p>\n<ul class=\"\">\n<li class=\"\">Creating tokens<\/li>\n<li class=\"\">Cleaning tokens<\/li>\n<li class=\"\">Removing stop words<\/li>\n<li class=\"\">Lemmatization<\/li>\n<li class=\"\">Stemming ..etc<\/li>\n<\/ul>\n<p>But we are <strong>not performing<\/strong> any of these for this article. If you want you can perform these steps before starting the modeling.<\/p>\n<\/div>\n<h3 id=\"t-1595062139715\" class=\"\">Split train and test datasets<\/h3>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"40\">\n<p>We will split the loaded data into two separate datasets.<\/p>\n<ul class=\"\">\n<li class=\"\">\n<strong>Train<\/strong> dataset: For training the text categorization model.<\/li>\n<li class=\"\">\n<strong>Test<\/strong> dataset: For validating the performence of the model.<\/li>\n<\/ul>\n<p>To split the data into 2 such datasets we are using scikit learn model selection train test split method, in such a way that the test data will be 33% of the loaded data.<\/p>\n<p>\u00a0Let\u2019s say we are having 100 records in the loaded dataset, if we specify the test as \u00a030% then train and test split method split 70 records for training and the remaining 30 records for testing.<\/p>\n<\/div>\n<h3 id=\"t-1595062139716\" class=\"\">Creating training model data<\/h3>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"41\">\n<p>Now we have two datasets one is for training and the other one is \u00a0for testing our model.<\/p>\n<p>\u00a0Unlike the other scikit-learn models you can not pass the target as a single column for spacy, we need to explicitly create the targets like a boolean list.<\/p>\n<p>Like for each email text, the target label is true for which class and false for which class.<\/p>\n<p>In the below code let\u2019s do that.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"42\">\n<p>Basically we are creating onehot encodings for target categories, where we are creating two boolean labels and assigning ture for the actual label and false for the other label.<\/p>\n<p>Now we\u2019re having features and target for training the model but first we need to combine the feature and targets into a single dataset to build the email classifiaction model.<\/p>\n<p>We are taking features (email text), converted train labels (booleans) and joining them using the\u00a0<strong>zip method,<\/strong> the same appraoch we are applying both for training and test datasets.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"33\">\n<h2 class=\"\" id=\"t-1595062139717\">Creating a training function<\/h2>\n<p>Now we are having both train and test data to build the model next Let\u2019s create a \u00a0<strong>train function<\/strong> that takes the below input parameters to build the model.<\/p>\n<ul class=\"\">\n<li class=\"\">\n<strong>model: <\/strong>Created empty model<\/li>\n<li class=\"\">\n<strong>train data:<\/strong> Created train data<\/li>\n<li class=\"\">\n<strong>optimizer:<\/strong> Optimizer (will create before calling the train function)<\/li>\n<li class=\"\">\n<strong>batch size:<\/strong> Size of the bathes<\/li>\n<li class=\"\">\n<strong>epochs:<\/strong> Epochs size<\/li>\n<\/ul>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"36\">\n<p>Let&#8217;s understand the above code line by line.<\/p>\n<p>\u00a0For each epoch we are shuffling the data using the random shuffle method then creating the batches. For each batch updating the model using the optimizer, at the end capturing the losses.\u00a0<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"35\">\n<p>For building email classifier model we created the optimizer and we are using a batch size of 5 and 10 epochs run the above function which we created for training the model.<\/p>\n<p>Let\u2019s see what was the losses we are getting.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-17366913a38\" readability=\"7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=613%2C122&amp;ssl=1\" class=\"tve_image wp-image-4025\" alt=\"email spam classifier losses\" data-id=\"4025\" width=\"613\" data-init-width=\"2862\" height=\"122\" data-init-height=\"570\" title=\"email-spam-ham-train-losses\" loading=\"lazy\" data-width=\"613\" data-height=\"122\" srcset=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?w=2862&amp;ssl=1 2862w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=300%2C60&amp;ssl=1 300w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=1024%2C204&amp;ssl=1 1024w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=768%2C153&amp;ssl=1 768w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=1536%2C306&amp;ssl=1 1536w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=2048%2C408&amp;ssl=1 2048w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-4025\" alt=\"email spam classifier losses\" data-id=\"4025\" width=\"613\" data-init-width=\"2862\" height=\"122\" data-init-height=\"570\" title=\"email-spam-ham-train-losses\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=613%2C122&amp;ssl=1\" data-width=\"613\" data-height=\"122\" srcset=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?w=2862&amp;ssl=1 2862w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=300%2C60&amp;ssl=1 300w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=1024%2C204&amp;ssl=1 1024w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=768%2C153&amp;ssl=1 768w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=1536%2C306&amp;ssl=1 1536w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?resize=2048%2C408&amp;ssl=1 2048w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-train-losses.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><\/noscript><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">email spam classifier losses<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"36\">\n<h2 class=\"\" id=\"t-1595062139718\">Creating a prediction function<\/h2>\n<p>We have trained the model now we can check the efficiency of the model we build.<\/p>\n<p>For that, we need to create a prediction function before doing that let\u2019s check what the model is predicting for any given email text.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173669d3090\" readability=\"7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=613%2C55&amp;ssl=1\" class=\"tve_image wp-image-4029\" alt=\"Email spacy spam classifier sample predictions \" data-id=\"4029\" width=\"613\" data-init-width=\"2880\" height=\"55\" data-init-height=\"260\" title=\"email-spam-ham-sample-predictions\" loading=\"lazy\" data-width=\"613\" data-height=\"55\" data-css=\"tve-u-173669d3f76\" srcset=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?w=2880&amp;ssl=1 2880w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=300%2C27&amp;ssl=1 300w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=1024%2C92&amp;ssl=1 1024w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=768%2C69&amp;ssl=1 768w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=1536%2C139&amp;ssl=1 1536w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=2048%2C185&amp;ssl=1 2048w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-4029\" alt=\"Email spacy spam classifier sample predictions \" data-id=\"4029\" width=\"613\" data-init-width=\"2880\" height=\"55\" data-init-height=\"260\" title=\"email-spam-ham-sample-predictions\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=613%2C55&amp;ssl=1\" data-width=\"613\" data-height=\"55\" data-css=\"tve-u-173669d3f76\" srcset=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?w=2880&amp;ssl=1 2880w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=300%2C27&amp;ssl=1 300w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=1024%2C92&amp;ssl=1 1024w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=768%2C69&amp;ssl=1 768w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=1536%2C139&amp;ssl=1 1536w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?resize=2048%2C185&amp;ssl=1 2048w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-ham-sample-predictions.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><\/noscript><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Email spacy spam classifier sample predictions<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"38\">\n<p>For the above email text, the actual output is ham and our model is having high probability which is nearly <strong>99%<\/strong><br \/>\n<strong>for ham<\/strong> and 1% for spam. Which means our model is predicting the email text properly.<\/p>\n<p>\u00a0Now let\u2019s write a <strong>generalized function<\/strong> that takes the model, email texts and predicts the outcome labels.\u00a0<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"45\">\n<p>The predictions function takes two parameters one is model the other one is email text, here text is basically the email content.<\/p>\n<p>\u00a0In the <strong>4th<\/strong> line of the function text is tokenizing then splitting the content of the email and storing in docs.<\/p>\n<p>\u00a0In the next lines calls the <strong>textcat<\/strong> method we created, using the textcat object to predict the email class that is ham or spam for the text.<\/p>\n<p>\u00a0Scores basically gives the probabilities for both of the classes, for identifying the label class we\u2019re taking the max probability using the argmax then we are returning the predictions.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"36\">\n<h2 class=\"\" id=\"t-1595062139719\">Spam Email Classifier Model Evolution<\/h2>\n<p>To measure the performance of the build email classifier we are going to use acuuracy and confusion matrix metrics.<\/p>\n<h3 class=\"\" id=\"t-1595175862402\">Accuracy<\/h3>\n<p>Now let\u2019s call the predicts function on the training dataset and test dataset to measure the accuracy of the model.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"34\">\n<p>For calculating the accuracy we are using scikit learn accuracy score method, this method takes two parameters which are the actual labels and the predicted labels.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173680e9cf0\" readability=\"7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=613%2C46&amp;ssl=1\" class=\"tve_image wp-image-4044\" alt=\"Email spam classifier accuracy\" data-id=\"4044\" width=\"613\" data-init-width=\"2874\" height=\"46\" data-init-height=\"214\" title=\"email-spam-classifier-accuracy\" loading=\"lazy\" data-width=\"613\" data-height=\"46\" data-css=\"tve-u-173680eb231\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?w=2874&amp;ssl=1 2874w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=300%2C22&amp;ssl=1 300w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=1024%2C76&amp;ssl=1 1024w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=768%2C57&amp;ssl=1 768w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=1536%2C114&amp;ssl=1 1536w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=2048%2C152&amp;ssl=1 2048w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-4044\" alt=\"Email spam classifier accuracy\" data-id=\"4044\" width=\"613\" data-init-width=\"2874\" height=\"46\" data-init-height=\"214\" title=\"email-spam-classifier-accuracy\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=613%2C46&amp;ssl=1\" data-width=\"613\" data-height=\"46\" data-css=\"tve-u-173680eb231\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?w=2874&amp;ssl=1 2874w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=300%2C22&amp;ssl=1 300w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=1024%2C76&amp;ssl=1 1024w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=768%2C57&amp;ssl=1 768w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=1536%2C114&amp;ssl=1 1536w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?resize=2048%2C152&amp;ssl=1 2048w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/email-spam-classifier-accuracy.png?w=1380&amp;ssl=1 1380w\" sizes=\"(max-width: 613px) 100vw, 613px\" data-recalc-dims=\"1\"><\/noscript><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Email spam classifier accuracy<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"49.162303664921\">\n<p>\u00a0For training dataset the model is getting <strong>99%<\/strong> of accuracy on test data set we are getting <strong>98%<\/strong> of accuracy.<\/p>\n<p>\u00a0For evaluating any model accuracy alone <strong>cannot<\/strong> be an efficient way to estimate the model is accurate or not.<\/p>\n<p>\u00a0Let me give you an example.<\/p>\n<p>Let\u2019s say we are predicting whether someone will <strong>default<\/strong> for any bank organization or not. If we see the data people defaulting rate will be very less let\u2019s say 10%. suppose if you build a model to predict a customer is going to be default you see you are getting accuracy as 90%.<\/p>\n<blockquote class=\"\"><p>\n<strong>Do you think the model is good?<\/strong> \u00a0<\/p><\/blockquote>\n<p>Take a moment and think on that, data itself says 90% of people or <strong>not<\/strong> going to be default and 10% people are going to be default. If we say people or not going to be default without building any machine learning model the accuracy is 90%<\/p>\n<p>To handel these kind of issues we use other metrics which check if the accuracy we got is the is reasonable. For that we can use the <a href=\"https:\/\/dataaspirant.com\/confusion-matrix-sklearn-python\/\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>confusion matrix<\/strong><\/a>.\u00a0<\/p>\n<\/div>\n<h3 class=\"\" id=\"t-1595178620540\">Confusion matrix<\/h3>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173682c08fd\" readability=\"7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_train.png?resize=559%2C466&amp;ssl=1\" class=\"tve_image wp-image-4054\" alt=\"Confusion matrix on train dataset\" data-id=\"4054\" width=\"559\" data-init-width=\"559\" height=\"466\" data-init-height=\"466\" title=\"confusion_matrix_train\" loading=\"lazy\" data-width=\"559\" data-height=\"466\" data-css=\"tve-u-173682c1aaf\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_train.png?w=559&amp;ssl=1 559w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_train.png?resize=300%2C250&amp;ssl=1 300w\" sizes=\"(max-width: 559px) 100vw, 559px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-4054\" alt=\"Confusion matrix on train dataset\" data-id=\"4054\" width=\"559\" data-init-width=\"559\" height=\"466\" data-init-height=\"466\" title=\"confusion_matrix_train\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_train.png?resize=559%2C466&amp;ssl=1\" data-width=\"559\" data-height=\"466\" data-css=\"tve-u-173682c1aaf\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_train.png?w=559&amp;ssl=1 559w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_train.png?resize=300%2C250&amp;ssl=1 300w\" sizes=\"(max-width: 559px) 100vw, 559px\" data-recalc-dims=\"1\"><\/noscript><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Confusion matrix on train dataset<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173682cdd3d\" readability=\"7\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_test.png?resize=559%2C466&amp;ssl=1\" class=\"tve_image wp-image-4055\" alt=\"Confusion matrix on test dataset\" data-id=\"4055\" width=\"559\" data-init-width=\"559\" height=\"466\" data-init-height=\"466\" title=\"confusion_matrix_test\" loading=\"lazy\" data-width=\"559\" data-height=\"466\" data-css=\"tve-u-173682ceb25\" srcset=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_test.png?w=559&amp;ssl=1 559w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_test.png?resize=300%2C250&amp;ssl=1 300w\" sizes=\"(max-width: 559px) 100vw, 559px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-4055\" alt=\"Confusion matrix on test dataset\" data-id=\"4055\" width=\"559\" data-init-width=\"559\" height=\"466\" data-init-height=\"466\" title=\"confusion_matrix_test\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_test.png?resize=559%2C466&amp;ssl=1\" data-width=\"559\" data-height=\"466\" data-css=\"tve-u-173682ceb25\" srcset=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_test.png?w=559&amp;ssl=1 559w, https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/confusion_matrix_test.png?resize=300%2C250&amp;ssl=1 300w\" sizes=\"(max-width: 559px) 100vw, 559px\" data-recalc-dims=\"1\"><\/noscript><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Confusion matrix on test dataset<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\" readability=\"32.890070921986\">\n<p>Confusion matrix shows the actual predictions and the miss classifications for each target class. The above confusion matrix plots for the train and test showcasing the same.<\/p>\n<h4 class=\"\">Complete code<\/h4>\n<p>You can fork the complete code for building the email spam classifier in our <a href=\"https:\/\/github.com\/saimadhu-polamuri\/DataAspirant_codes\/tree\/master\/spacy-email-classifier\" target=\"_blank\" class=\"tve-froala fr-basic\" rel=\"noopener noreferrer\">github repository<\/a>.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" readability=\"40\">\n<h2 class=\"\" id=\"t-1595062139720\">What Next?<\/h2>\n<p>Now we build the email spam classifier, using this pipeline you can solve similar kind of problems. All you needs to change is the inputs, outputs and apply few natural language text preprocessing techniques before modeling.<\/p>\n<ul class=\"\">\n<li>Sentiment analysis classifier<\/li>\n<li>Documents topic indentification<\/li>\n<li>Sports or business news articles classification<\/li>\n<\/ul>\n<h2 class=\"\" id=\"t-1595062139721\">Conclusion<\/h2>\n<p>In this article we learned how to build the email spam classifier with sapcy in few lines of code. Don&#8217;t stop, using the same code try to build various text classification models.<\/p>\n<h4 class=\"\">Recommended NLP courses<\/h4>\n<\/div>\n<div class=\"thrv_wrapper thrv-page-section thrv-lp-block\" data-inherit-lp-settings=\"1\" data-css=\"tve-u-1736ab1855f\" tcb-template-name=\"Product Highlight 03\" tcb-template-id=\"60907\" data-keep-css_id=\"1\">\n<div class=\"tve-page-section-in tve_empty_dropzone  \" data-css=\"tve-u-1736ab1892e\">\n<div class=\"thrv_wrapper thrv-columns dynamic-group-kbt3q0q7\" data-css=\"tve-u-1736ab18561\">\n<div class=\"tcb-flex-row v-2 tcb--cols--3 tcb-medium-no-wrap tcb-mobile-wrap m-edit\" data-css=\"tve-u-1736ab18562\">\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-1736ab18563\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-1736ab18564\" readability=\"22.28125\">\n<div class=\"tve-cb\" readability=\"5.03125\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z img_style_rounded_corners\" data-css=\"tve-u-1736ab1856f\">\n<span class=\"tve_image_frame\"><a href=\"https:\/\/dataaspirant.com\/recommends\/data-science-courses\/nlp-specialization\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" class=\"hasimg thirstylinkimg\" title=\"nlp-specialization\" data-linkid=\"4065\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4068\" alt=\"nlp specialization\" data-id=\"4068\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"NLP specialization\" loading=\"lazy\" data-width=\"172\" data-height=\"172\" mt-d=\"0\" data-css=\"tve-u-1736ab18570\" center-v-d=\"false\" mt-t=\"-65\" mt-m=\"-86\" ml-d=\"0\" data-link-wrap=\"true\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?w=300&amp;ssl=1 300w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?resize=150%2C150&amp;ssl=1 150w\" sizes=\"(max-width: 172px) 100vw, 172px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-4068\" alt=\"nlp specialization\" data-id=\"4068\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"NLP specialization\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" mt-d=\"0\" data-css=\"tve-u-1736ab18570\" center-v-d=\"false\" mt-t=\"-65\" mt-m=\"-86\" ml-d=\"0\" data-link-wrap=\"true\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?w=300&amp;ssl=1 300w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/nlp-spelization.jpg?resize=150%2C150&amp;ssl=1 150w\" sizes=\"(max-width: 172px) 100vw, 172px\" data-recalc-dims=\"1\"><\/noscript><\/a><span class=\"tve-image-overlay\"><\/span><\/span>\n<\/div>\n<h4 class=\"\" data-css=\"tve-u-1736ab18572\">NLP Specialization with Python\u00a0<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-1736ab18563\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-1736ab1857e\" readability=\"22.794117647059\">\n<div class=\"tve-cb\" readability=\"5.1470588235294\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z img_style_rounded_corners\" data-css=\"tve-u-1736ab1857f\">\n<span class=\"tve_image_frame\"><a href=\"https:\/\/dataaspirant.com\/recommends\/data-science-courses\/nlp-classification-vector-spaces\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" class=\"hasimg thirstylinkimg\" title=\"nlp-classification-vector-spaces\" data-linkid=\"4066\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4070\" alt=\"Natural-language-processing-classifiation-vector-spaces\" data-id=\"4070\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"Natural-language-processing-2\" loading=\"lazy\" data-width=\"172\" data-height=\"172\" mt-d=\"0\" data-css=\"tve-u-1736ab18580\" center-v-d=\"false\" mt-t=\"-65\" mt-m=\"-86\" ml-d=\"0\" data-link-wrap=\"true\" srcset=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?w=300&amp;ssl=1 300w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?resize=150%2C150&amp;ssl=1 150w\" sizes=\"(max-width: 172px) 100vw, 172px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-4070\" alt=\"Natural-language-processing-classifiation-vector-spaces\" data-id=\"4070\" width=\"172\" data-init-width=\"300\" height=\"172\" data-init-height=\"300\" title=\"Natural-language-processing-2\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" mt-d=\"0\" data-css=\"tve-u-1736ab18580\" center-v-d=\"false\" mt-t=\"-65\" mt-m=\"-86\" ml-d=\"0\" data-link-wrap=\"true\" srcset=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?w=300&amp;ssl=1 300w, https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/Natural-language-processing-2.jpg?resize=150%2C150&amp;ssl=1 150w\" sizes=\"(max-width: 172px) 100vw, 172px\" data-recalc-dims=\"1\"><\/noscript><\/a><span class=\"tve-image-overlay\"><\/span><\/span>\n<\/div>\n<h4 class=\"\" data-css=\"tve-u-1736ab18582\">NLP Classification and Vector spaces<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-1736ab18563\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-1736ab1858d\" readability=\"22\">\n<div class=\"tve-cb\" readability=\"4.9677419354839\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z img_style_rounded_corners\" data-css=\"tve-u-1736ab1858e\">\n<span class=\"tve_image_frame\"><a href=\"https:\/\/dataaspirant.com\/recommends\/data-science-courses\/spacy-nlp-python-course\/\" rel=\"nofollow noopener noreferrer\" target=\"_blank\" class=\"hasimg thirstylinkimg\" title=\"spacy-nlp-python-course\" data-linkid=\"4064\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4072\" alt=\"natural-language-processing-python\" data-id=\"4072\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"natural-language-processing-python\" loading=\"lazy\" data-width=\"172\" data-height=\"172\" mt-d=\"0\" data-css=\"tve-u-1736ab1858f\" center-v-d=\"false\" mt-t=\"-65\" mt-m=\"-86\" ml-d=\"0\" data-link-wrap=\"true\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?resize=150%2C150&amp;ssl=1 150w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?zoom=2&amp;resize=172%2C172&amp;ssl=1 344w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?zoom=3&amp;resize=172%2C172&amp;ssl=1 516w\" sizes=\"(max-width: 172px) 100vw, 172px\" data-recalc-dims=\"1\"><noscript><img class=\"tve_image wp-image-4072\" alt=\"natural-language-processing-python\" data-id=\"4072\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"natural-language-processing-python\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" mt-d=\"0\" data-css=\"tve-u-1736ab1858f\" center-v-d=\"false\" mt-t=\"-65\" mt-m=\"-86\" ml-d=\"0\" data-link-wrap=\"true\" srcset=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?resize=150%2C150&amp;ssl=1 150w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?zoom=2&amp;resize=172%2C172&amp;ssl=1 344w, https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/07\/natural-language-processing-python.jpg?zoom=3&amp;resize=172%2C172&amp;ssl=1 516w\" sizes=\"(max-width: 172px) 100vw, 172px\" data-recalc-dims=\"1\"><\/noscript><\/a><span class=\"tve-image-overlay\"><\/span><\/span>\n<\/div>\n<h4 class=\"\" data-css=\"tve-u-1736ab18591\">NLP Model Building With Python<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/dataaspirant.com\/build-email-spam-classification-model-spacy-python\/<\/p>\n","protected":false},"author":1,"featured_media":10,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/9"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=9"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/9\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/10"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=9"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=9"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=9"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}