{"id":8375,"date":"2021-07-20T01:33:07","date_gmt":"2021-07-20T01:33:07","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/07\/20\/building-a-named-entity-recognition-model-using-a-bilstm-crf-network\/"},"modified":"2021-07-20T01:33:07","modified_gmt":"2021-07-20T01:33:07","slug":"building-a-named-entity-recognition-model-using-a-bilstm-crf-network","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/07\/20\/building-a-named-entity-recognition-model-using-a-bilstm-crf-network\/","title":{"rendered":"Building a Named Entity Recognition model using a BiLSTM-CRF network"},"content":{"rendered":"<div>\n<p><em>In this blog post we present the Named Entity Recognition problem and show how a BiLSTM-CRF model can be fitted using a freely available annotated corpus and Keras. The model achieves relatively high accuracy and all data and code is freely available in the article.<\/em><\/p>\n<p>Named Entity Recognition (NER) is an NLP problem, which involves locating and classifying named entities (people, places, organizations etc.) mentioned in unstructured text. This problem is used in many NLP applications that deal with use-cases like machine translation, information retrieval, chatbots and others.<\/p>\n<p>The categories that the named entities are classified into are predefined and often contain entries like locations, organisations, job types, personal names, times and others.<\/p>\n<p>An example of unstructured text presented to an NER system could be:<\/p>\n<p>\u201cPresident Joe Biden visits Europe in first presidential overseas trip\u201d<\/p>\n<p>After processing the input, a NER model could output something like this:<\/p>\n<p>[President]<sub>Title<\/sub> [Biden]<sub>Name<\/sub> visits [Europe]<sub>Geography<\/sub> in first presidential overseas trip<\/p>\n<p>It follows from this example, that the NER task can be broken down into two independent tasks:<\/p>\n<ul>\n<li>first, we need to establish the boundaries of each entity (i.e. we need to tokenize the input)<\/li>\n<li>second, we need to assign each entity to one of the predefined classes<\/li>\n<\/ul>\n<h2>Approaching a Named Entity Recognition (NER) problem<\/h2>\n<p>An NER problem can be generally approached in two different ways:<\/p>\n<ul>\n<li><strong>grammar-based techniques<\/strong> \u2013 This approach involves experienced linguists who manually define specific rules for entity recognition (e.g. if an entity name contains the token \u201cJohn\u201d it is a person, but if it also contains the token \u201cUniversity\u201d then it is an organisation). This type of hand-crafted rule yields very high precision, but it requires a tremendous amount of work to define entity structures and capture edge cases. Another drawback is that keeping such a grammar-based system up to date requires constant manual intervention and is a laborious task.<\/li>\n<li><strong>statistical model-based techniques <\/strong>\u2013 Using Machine Learning we can streamline and simplify the process of building NER models, because this approach does not need a predefined exhaustive set of naming rules. The process of statistical learning can automatically extract said rules from a training dataset. Moreover, keeping the NER model up to date can also be performed in an automated fashion. The drawback with statistical model-based techniques is that the automated extraction of a comprehensive set of rules requires a large amount of labeled training data.<\/li>\n<\/ul>\n<h2>How to build a statistical Named Entity Recognition (NER) model<\/h2>\n<p>In this blog post we will focus on building a statistical NER model, using the freely available <a href=\"https:\/\/www.kaggle.com\/abhinavwalia95\/entity-annotated-corpus\/\">Annotated Corpus for Named Entity Recognition<\/a>. This dataset is based on the GMB (<a href=\"https:\/\/gmb.let.rug.nl\/\">Groningen Meaning Bank<\/a>) corpus, and has been tagged, annotated and built specifically to train a classifier to predict named entities such as name, location, etc. The tags used in the dataset follow the IOB format, which we cover in the next section.<\/p>\n<h3>The IOB format<\/h3>\n<p>Inside\u2013outside\u2013beginning (IOB) is a common format for tagging entities in computer linguistics, especially in a NER context. This scheme was initially proposed by Ramshaw and Marcus (1995), and the meaning of the IOB tags is as follows:<\/p>\n<ul>\n<li>the I-prefix indicates that the tag is inside a chunk (i.e. a noun group, a verb group etc.)<\/li>\n<li>the O-prefix indicates that the token belongs to no chunk<\/li>\n<li>the B-prefix indicates that the tag is at the beginning of a chunk that follows another chunk without O tags between the two chunks<\/li>\n<\/ul>\n<p>The entity tags used in the sample dataset are as follows:<\/p>\n<figure class=\"wp-block-table aligncenter\">\n<table>\n<tbody>\n<tr>\n<td><strong>Tag<\/strong><\/td>\n<td><strong>Meaning<\/strong><\/td>\n<td><strong>Example<\/strong><\/td>\n<\/tr>\n<tr>\n<td>geo<\/td>\n<td>Geography<\/td>\n<td>Britain<\/td>\n<\/tr>\n<tr>\n<td>org<\/td>\n<td>Organisation<\/td>\n<td>IAEA<\/td>\n<\/tr>\n<tr>\n<td>per<\/td>\n<td>Person<\/td>\n<td>Thomas<\/td>\n<\/tr>\n<tr>\n<td>gpe<\/td>\n<td>Geopolitical Entity<\/td>\n<td>Pakistani<\/td>\n<\/tr>\n<tr>\n<td>tim<\/td>\n<td>Time<\/td>\n<td>Wednesday<\/td>\n<\/tr>\n<tr>\n<td>art<\/td>\n<td>Artifact<\/td>\n<td>Pentastar<\/td>\n<\/tr>\n<tr>\n<td>eve<\/td>\n<td>Event<\/td>\n<td>Armistice<\/td>\n<\/tr>\n<tr>\n<td>nat<\/td>\n<td>Natural Phenomenon<\/td>\n<td>H5N1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>The following example shows the application of IOB on the class labels:<\/p>\n<figure class=\"wp-block-table aligncenter\">\n<table>\n<tbody>\n<tr>\n<td><strong>Token<\/strong><\/td>\n<td><strong>Tag<\/strong><\/td>\n<td><strong>Meaning<\/strong><\/td>\n<\/tr>\n<tr>\n<td>George<\/td>\n<td>B-PER<\/td>\n<td>Beginning of a chunk (B tag), classified as Person<\/td>\n<\/tr>\n<tr>\n<td>is<\/td>\n<td>O<\/td>\n<td>token belongs to no chunk<\/td>\n<\/tr>\n<tr>\n<td>travelling<\/td>\n<td>O<\/td>\n<td>token belongs to no chunk<\/td>\n<\/tr>\n<tr>\n<td>to<\/td>\n<td>O<\/td>\n<td>token belongs to no chunk<\/td>\n<\/tr>\n<tr>\n<td>England<\/td>\n<td>I-GEO<\/td>\n<td>Inside of a chunk (I tag), classified as Geography<\/td>\n<\/tr>\n<tr>\n<td>on<\/td>\n<td>O<\/td>\n<td>token belongs to no chunk<\/td>\n<\/tr>\n<tr>\n<td>Sunday<\/td>\n<td>I-TIM<\/td>\n<td>Inside of a chunk, classified as Time<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<h3>The CRF model<\/h3>\n<p>Conditional random field (CRF) is a statistical model well suited for handling NER problems, because it takes context into account. In other words, when a CRF model makes a prediction, it factors in the impact of neighbouring samples by modelling the prediction as a graphical model. For example, a linear chain CRF is a popular type of a CRF model, which assumes that the tag for the present word is dependent only on the tag of just one previous word (this is somewhat similar to Hidden Markov Models, although CRF\u2019s topology is an undirected graph).<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/linear-crf-1024x367.png\" alt=\"A linear CRF is modelled as undirected graph - there are n nodes representing the inputs, connected to n nodes above representing the outputs (y's). There are lateral undirected connections between the outputs but none between the inputs.\" class=\"wp-image-7609\" width=\"512\" height=\"184\"><figcaption>Figure 1 : A simple linear-chain conditional random fields model. The model takes an input sequence x (words) and target sequence y (IOB tags)<\/figcaption><\/figure>\n<\/div>\n<p>One problem with the linear chain CRFs (Figure 1) is that they are capable of capturing the dependencies between labels in the forward direction only. If the model encounters an entity like \u201cJohns Hopkins University\u201d it will likely tag the Hopkins token as a name, because the model is \u201cblind\u201d to the university token that appears downstream. One way to resolve this challenge is to introduce a bidirectional LSTM (BiLSTM) network between the inputs (words) and the CRF. The bidirectional LSTM consists of two LSTM networks \u2013 one takes the input in a forward direction, and a second one taking the input in a backward direction. Combining the outputs of the two networks yields a context that provides information on samples surrounding each individual token. The output of the BiLSTM is then fed to a linear chain CRF, which can generate predictions using this improved context. This combination of CRF and BiLSTM is often referred to as a BiLSTM-CRF model (Lample et al 2016), and its architecture is shown in Figure 2.<\/p>\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/lstm-crf-1024x738.png\" alt=\"The lowest layer contains the inputs x1, x2, ..., xn. They are connected to nodes l1,l2, ..., ln above with lateral connections in the right direction (this is the left context). Above is a corresponding set of nodes r1,r2,...,rn with lateral connections pointing from right to left (this is the right context). Both the r and l nodes feed to a layer above with nodes c1,c2,...,cn that captures both the right and left context. The c-nodes then feed to the y1,y2,...,yn outputs. The c and y nodes represent a CRF, and the l and r nodes are the two LSTMs.\" class=\"wp-image-7610\" width=\"768\" height=\"554\"><figcaption>Figure 2 \u2013 Architecture of a BiLSTM-CRF model<\/figcaption><\/figure>\n<h3>Data exploration and preparation<\/h3>\n<p>We start by importing all the libraries needed for the ingestion, exploratory data analysis, and model building.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pickle\nimport operator \nimport re\nimport string\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nfrom plot_keras_history import plot_history\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import multilabel_confusion_matrix\nfrom keras_contrib.utils import save_load_utils\n\nfrom keras import layers \nfrom keras import optimizers\n\nfrom keras.models import Model\nfrom keras.models import Input\n\nfrom keras_contrib.layers import CRF\nfrom keras_contrib import losses\nfrom keras_contrib import metrics\n<\/pre>\n<\/div>\n<p>Next, we read and take a peek at the annotated dataset.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndata_df = pd.read_csv(\"dataset\/ner_dataset.csv\", encoding=\"iso-8859-1\", header=0)\ndata_df.head()\n<\/pre>\n<\/div>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"279\" height=\"183\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/ner_head.png\" alt=\"Table with the first 5 observations. The columns are Sentnece #, Word, POS, Tag\" class=\"wp-image-7611\"><\/figure>\n<p>The meaning of the attributes is as follows:<\/p>\n<ul>\n<li>Sentence # \u2013 sentence ID<\/li>\n<li>Word \u2013 contains all words that form individual sentences<\/li>\n<li>POS \u2013 Part of Speech tag for each word, as defined in the <a href=\"https:\/\/www.sketchengine.eu\/penn-treebank-tagset\/\">Penn Treebank tagset<\/a><\/li>\n<li>Tag \u2013 IOB tag for each word<\/li>\n<\/ul>\n<p>Looking at the data, we see that the sentence ID is given only once per sentence (with the first word of the chunk), and the remaining values for \u201cSentence #\u201d attribute are set to NaN. We will remedy this by repeating the ID for all remaining words, so that we can calculate meaningful statistics.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndata_df = data_df.fillna(method=\"ffill\")\ndata_df[\"Sentence #\"] = data_df[\"Sentence #\"].apply(lambda s: s[9:])\ndata_df[\"Sentence #\"] = data_df[\"Sentence #\"].astype(\"int32\")\ndata_df.head()\n<\/pre>\n<\/div>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"279\" height=\"183\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/ner_head_fillna.png\" alt=\"The first 5 observations of the dataset. This is identical to the previous image, but has the missing values in the Sentence # column filled with the correct value.\" class=\"wp-image-7612\"><\/figure>\n<p>Now let\u2019s calculate some statistics about the data.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nprint(\"Total number of sentences in the dataset: {:,}\".format(data_df[\"Sentence #\"].nunique()))\nprint(\"Total words in the dataset: {:,}\".format(data_df.shape[0]))\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nTotal number of sentences in the dataset: 47,959\nTotal words in the dataset: 1,048,575\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndata_df[\"POS\"].value_counts().plot(kind=\"bar\", figsize=(10,5));\n<\/pre>\n<\/div>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"633\" height=\"331\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/ner_pos_barplot.png\" alt=\"A bar plot of the POS attribute.\" class=\"wp-image-7613\"><\/figure>\n<p>We notice that the top 5 parts of speech in the corpus are:<\/p>\n<ul>\n<li>NN \u2013 noun (e.g. table)<\/li>\n<li>NNP \u2013 proper noun (e.g. John)<\/li>\n<li>IN \u2013 preposition (e.g. in, of, like)<\/li>\n<li>DT \u2013 determiner (the)<\/li>\n<li>JJ \u2013 adjective (e.g. green)<\/li>\n<\/ul>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndata_df[data_df[\"Tag\"]!=\"O\"][\"Tag\"].value_counts().plot(kind=\"bar\", figsize=(10,5))\n<\/pre>\n<\/div>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"633\" height=\"331\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/ner_tags_barplot.png\" alt=\"Bar plot of the Tag attribute\" class=\"wp-image-7614\"><\/figure>\n<p>Based on the plot above we learn that many of our sentences start with a geography, time, organisation, or a person.<\/p>\n<p>We can now look at the distribution of words per sentence.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nword_counts = data_df.groupby(\"Sentence #\")[\"Word\"].agg([\"count\"])\nword_counts = word_counts.rename(columns={\"count\": \"Word count\"})\nword_counts.hist(bins=50, figsize=(8,6));\n<\/pre>\n<\/div>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"500\" height=\"376\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/ner_wordcount_hist.png\" alt=\"Histogram of the word count per sentence.\" class=\"wp-image-7615\"><\/figure>\n<p>We see that the average sentence in the dataset contains about 21-22 words.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nMAX_SENTENCE = word_counts.max()[0]\nprint(\"Longest sentence in the corpus contains {} words.\".format(MAX_SENTENCE))\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nLongest sentence in the corpus contains 104 words.\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nlongest_sentence_id = word_counts[word_counts[\"Word count\"]==MAX_SENTENCE].index[0]\nprint(\"ID of the longest sentence is {}.\".format(longest_sentence_id))\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nID of the longest sentence is 22480.\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nlongest_sentence = data_df[data_df[\"Sentence #\"]==longest_sentence_id][\"Word\"].str.cat(sep=' ')\nprint(\"The longest sentence in the corpus is:n\")\nprint(longest_sentence)\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nThe longest sentence in the corpus is:\n\nFisheries in 2006 - 7 landed 1,26,976 metric tons , of which 82 % ( 1,04,586 tons ) was krill ( Euphausia superba ) and 9.5 % ( 12,027 tons ) Patagonian toothfish ( Dissostichus eleginoides - also known as Chilean sea bass ) , compared to 1,27,910 tons in 2005 - 6 of which 83 % ( 1,06,591 tons ) was krill and 9.7 % ( 12,396 tons ) Patagonian toothfish ( estimated fishing from the area covered by the Convention of the Conservation of Antarctic Marine Living Resources ( CCAMLR ) , which extends slightly beyond the Southern Ocean area ) .\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nall_words = list(set(data_df[\"Word\"].values))\nall_tags = list(set(data_df[\"Tag\"].values))\n\nprint(\"Number of unique words: {}\".format(data_df[\"Word\"].nunique()))\nprint(\"Number of unique tags : {}\".format(data_df[\"Tag\"].nunique()))\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nNumber of unique words: 35178\nNumber of unique tags : 17\n<\/pre>\n<\/div>\n<p>Now that we are slightly more familiar with the data, we can proceed with implementing the necessary feature engineering. The first step is to build a dictionary (word2index) that assigns a unique integer value to every word from the corpus. We also construct a reversed dictionary that maps indices to words (index2word).<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nword2index = {word: idx + 2 for idx, word in enumerate(all_words)}\nword2index[\"--UNKNOWN_WORD--\"]=0\nword2index[\"--PADDING--\"]=1\n\nindex2word = {idx: word for word, idx in word2index.items()}\n<\/pre>\n<\/div>\n<p>Let\u2019s look at the first 10 entries in the dictionary. Note that we have included 2 extra entries at the start \u2013 one for unknown words and one for padding.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfor k,v in sorted(word2index.items(), key=operator.itemgetter(1))[:10]:\n    print(k,v)\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n--UNKNOWN_WORD-- 0\n--PADDING-- 1\ntruck 2\n87.61 3\nHAMSAT 4\ngene 5\nNotre 6\nSamaraweera 7\nFrattini 8\nnine-member 9\n<\/pre>\n<\/div>\n<p>Let\u2019s confirm that the word-to-index and index-to-word mapping works as expected.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\ntest_word = \"Scotland\"\n\ntest_word_idx = word2index[test_word]\ntest_word_lookup = index2word[test_word_idx]\n\nprint(\"The index of the word {} is {}.\".format(test_word, test_word_idx))\nprint(\"The word with index {} is {}.\".format(test_word_idx, test_word_lookup))\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nThe index of the word Scotland is 15147.\nThe word with index 15147 is Scotland.\n<\/pre>\n<\/div>\n<p>Let\u2019s now build a similar dictionary for the various tags.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\ntag2index = {tag: idx + 1 for idx, tag in enumerate(all_tags)}\ntag2index[\"--PADDING--\"] = 0\n\nindex2tag = {idx: word for word, idx in tag2index.items()}\n<\/pre>\n<\/div>\n<p>Next, we write a custom function that will iterate over each sentence, and form a tuple consisting of each token, the part of speech the token represents, and its tag. We apply this function to the entire dataset and then see what the transformed version of the first sentence in the corpus looks like.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndef to_tuples(data):\n    iterator = zip(data[\"Word\"].values.tolist(),\n                   data[\"POS\"].values.tolist(),\n                   data[\"Tag\"].values.tolist())\n    \n    return [(word, pos, tag) for word, pos, tag in iterator]\n\nsentences = data_df.groupby(\"Sentence #\").apply(to_tuples).tolist()\n\nsentences[0]\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n[('Thousands', 'NNS', 'O'),\n ('of', 'IN', 'O'),\n ('demonstrators', 'NNS', 'O'),\n ('have', 'VBP', 'O'),\n ('marched', 'VBN', 'O'),\n ('through', 'IN', 'O'),\n ('London', 'NNP', 'B-geo'),\n ('to', 'TO', 'O'),\n ('protest', 'VB', 'O'),\n ('the', 'DT', 'O'),\n ('war', 'NN', 'O'),\n ('in', 'IN', 'O'),\n ('Iraq', 'NNP', 'B-geo'),\n ('and', 'CC', 'O'),\n ('demand', 'VB', 'O'),\n ('the', 'DT', 'O'),\n ('withdrawal', 'NN', 'O'),\n ('of', 'IN', 'O'),\n ('British', 'JJ', 'B-gpe'),\n ('troops', 'NNS', 'O'),\n ('from', 'IN', 'O'),\n ('that', 'DT', 'O'),\n ('country', 'NN', 'O'),\n ('.', '.', 'O')]\n<\/pre>\n<\/div>\n<p>We use this transformed dataset to extract the features (X) and labels (y) for the model. We can see what the first entries in X and y look like, after the two have been populated with words and tags. We can discard the part of speech data, as it is not needed for this specific implementation.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nX = [[word[0] for word in sentence] for sentence in sentences]\ny = [[word[2] for word in sentence] for sentence in sentences]\nprint(\"X[0]:\", X[0])\nprint(\"y[0]:\", y[0])\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nX[0]: ['Thousands', 'of', 'demonstrators', 'have', 'marched', 'through', 'London', 'to', 'protest', 'the', 'war', 'in', 'Iraq', 'and', 'demand', 'the', 'withdrawal', 'of', 'British', 'troops', 'from', 'that', 'country', '.']\ny[0]: ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']\n<\/pre>\n<\/div>\n<p>We also need to replace each word with its corresponding index from the dictionary.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nX = [[word2index[word] for word in sentence] for sentence in X]\ny = [[tag2index[tag] for tag in sentence] for sentence in y]\nprint(\"X[0]:\", X[0])\nprint(\"y[0]:\", y[0])\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nX[0]: [19995, 10613, 3166, 12456, 20212, 9200, 27, 24381, 28637, 2438, 4123, 7420, 34783, 18714, 14183, 2438, 26166, 10613, 29344, 1617, 10068, 12996, 26619, 14571]\ny[0]: [17, 17, 17, 17, 17, 17, 4, 17, 17, 17, 17, 17, 4, 17, 17, 17, 17, 17, 13, 17, 17, 17, 17, 17]\n<\/pre>\n<\/div>\n<p>We see that the dataset has now been indexed. We also need to pad each sentence to the maximal sentence length in the corpus, as the LSTM model expects a fixed length input. This is where the extra \u201c\u2013PADDING\u2013\u201d key in the dictionary comes into play.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nX = [sentence + [word2index[\"--PADDING--\"]] * (MAX_SENTENCE - len(sentence)) for sentence in X]\ny = [sentence + [tag2index[\"--PADDING--\"]] * (MAX_SENTENCE - len(sentence)) for sentence in y]\nprint(\"X[0]:\", X[0])\nprint(\"y[0]:\", y[0])\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nX[0]: [19995, 10613, 3166, 12456, 20212, 9200, 27, 24381, 28637, 2438, 4123, 7420, 34783, 18714, 14183, 2438, 26166, 10613, 29344, 1617, 10068, 12996, 26619, 14571, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\ny[0]: [17, 17, 17, 17, 17, 17, 4, 17, 17, 17, 17, 17, 4, 17, 17, 17, 17, 17, 13, 17, 17, 17, 17, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n<\/pre>\n<\/div>\n<p>The last transformation we need to perform is to one-hot encode the labels.:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nTAG_COUNT = len(tag2index)\ny = [ np.eye(TAG_COUNT)[sentence] for sentence in y]\nprint(\"X[0]:\", X[0])\nprint(\"y[0]:\", y[0])\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nX[0]: [19995, 10613, 3166, 12456, 20212, 9200, 27, 24381, 28637, 2438, 4123, 7420, 34783, 18714, 14183, 2438, 26166, 10613, 29344, 1617, 10068, 12996, 26619, 14571, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\ny[0]: [[0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0. 1.]\n [0. 0. 0. ... 0. 0. 1.]\n ...\n [1. 0. 0. ... 0. 0. 0.]\n [1. 0. 0. ... 0. 0. 0.]\n [1. 0. 0. ... 0. 0. 0.]]\n<\/pre>\n<\/div>\n<p>Finally, we split the resulting dataset into a training and hold-out set, so that we can measure the performance of the classifier on unseen data.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1234)\n\nprint(\"Number of sentences in the training dataset: {}\".format(len(X_train)))\nprint(\"Number of sentences in the test dataset    : {}\".format(len(X_test)))\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nNumber of sentences in the training dataset: 43163\nNumber of sentences in the test dataset    : 4796\n<\/pre>\n<\/div>\n<p>We can also convert everything into NumPy arrays, as this makes feeding the data to the model simpler.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nX_train = np.array(X_train)\nX_test = np.array(X_test)\ny_train = np.array(y_train)\ny_test = np.array(y_test)\n<\/pre>\n<\/div>\n<h3>Modelling<\/h3>\n<p>We start by calculating the maximal word length. We also set the following model hyperparameters:<\/p>\n<ul>\n<li>DENSE_EMBEDDING \u2013 Dimension of the dense embedding<\/li>\n<li>LSTM_UNITS \u2013 Dimensionality of the LSTM output space<\/li>\n<li>LSTM_DROPOUT \u2013 Fraction of the LSTM units to drop for the linear transformation of the recurrent state<\/li>\n<li>DENSE_UNITS \u2013 Number of fully connected units for each temporal slice<\/li>\n<li>BATCH_SIZE \u2013 Number of samples in a training batch<\/li>\n<li>MAX_EPOCHS \u2013 Maximum number of training epochs<\/li>\n<\/ul>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nWORD_COUNT = len(index2word)\n\nDENSE_EMBEDDING = 50\nLSTM_UNITS = 50\nLSTM_DROPOUT = 0.1\nDENSE_UNITS = 100\nBATCH_SIZE = 256\nMAX_EPOCHS = 5\n<\/pre>\n<\/div>\n<p>We proceed by defining the architecture of the model. We add an input layer, an embedding layer (to transform the indexes into dense vectors, a bidirectional LSTM layer, and a time-distributed layer (to apply the dense output layer to each temporal slice). We then pipe this to a CRF layer, and finally construct the model by defining its input as the input layer and its output as the output of the CRF layer.<\/p>\n<p>We also set a loss function (for linear chain Conditional Random Fields this is simply the negative log-likelihood) and specify \u201caccuracy\u201d as the metric that we\u2019ll be monitoring. The optimiser is set to Adam (Kingma and Ba, 2015) with a learning rate of 0.001.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\ninput_layer = layers.Input(shape=(MAX_SENTENCE,))\nmodel = layers.Embedding(WORD_COUNT, DENSE_EMBEDDING, embeddings_initializer=\"uniform\", input_length=MAX_SENTENCE)(input_layer)\nmodel = layers.Bidirectional(layers.LSTM(LSTM_UNITS, recurrent_dropout=LSTM_DROPOUT, return_sequences=True))(model)\nmodel = layers.TimeDistributed(layers.Dense(DENSE_UNITS, activation=\"relu\"))(model)\n\ncrf_layer = CRF(units=TAG_COUNT)\noutput_layer = crf_layer(model)\n\nner_model = Model(input_layer, output_layer)\n\nloss = losses.crf_loss\nacc_metric = metrics.crf_accuracy\nopt = optimizers.Adam(lr=0.001)\n\nner_model.compile(optimizer=opt, loss=loss, metrics=[acc_metric])\n\nner_model.summary()\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nModel: \"model_1\"\n_________________________________________________________________\nLayer (type)                 Output Shape              Param #   \n=================================================================\ninput_1 (InputLayer)         (None, 104)               0         \n_________________________________________________________________\nembedding_1 (Embedding)      (None, 104, 50)           1759000   \n_________________________________________________________________\nbidirectional_1 (Bidirection (None, 104, 100)          40400     \n_________________________________________________________________\ntime_distributed_1 (TimeDist (None, 104, 100)          10100     \n_________________________________________________________________\ncrf_1 (CRF)                  (None, 104, 18)           2178      \n=================================================================\nTotal params: 1,811,678\nTrainable params: 1,811,678\nNon-trainable params: 0\n_________________________________________________________________\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nhistory = ner_model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=MAX_EPOCHS, validation_split=0.1, verbose=2)\n<\/pre>\n<\/div>\n<p>Our model has 1.8 million parameters, so it is expected that training will take awhile.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nTrain on 38846 samples, validate on 4317 samples\nEpoch 1\/5\n - 117s - loss: 0.4906 - crf_accuracy: 0.8804 - val_loss: 0.1613 - val_crf_accuracy: 0.9666\nEpoch 2\/5\n - 115s - loss: 0.1438 - crf_accuracy: 0.9673 - val_loss: 0.1042 - val_crf_accuracy: 0.9679\nEpoch 3\/5\n - 115s - loss: 0.0746 - crf_accuracy: 0.9765 - val_loss: 0.0579 - val_crf_accuracy: 0.9825\nEpoch 4\/5\n - 115s - loss: 0.0451 - crf_accuracy: 0.9868 - val_loss: 0.0390 - val_crf_accuracy: 0.9889\nEpoch 5\/5\n - 115s - loss: 0.0314 - crf_accuracy: 0.9909 - val_loss: 0.0316 - val_crf_accuracy: 0.9908\n<\/pre>\n<\/div>\n<h3>Evaluation and testing<\/h3>\n<p>We can inspect the loss and accuracy plots from the model training. They both look acceptable and it doesn\u2019t appear that the model is overfitting. The model training could definitely benefit from some hyperparameter optimisation, but this type of fine-tuning is out of scope for this post.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nplot_history(history.history)\n<\/pre>\n<\/div>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"750\" height=\"366\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/07\/ner_history.png\" alt=\"training loss and accuracy of the model over training epochs. last train value loss:0.0314, last test value loss: .0316. Last train accuracy value: 0.99. Last test value loss: 0.99\" class=\"wp-image-7616\"><\/figure>\n<p>We can also test how well the model generalises by measuring the prediction accuracy on the hold-out set.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\ny_pred = ner_model.predict(X_test)\ny_pred = np.argmax(y_pred, axis=2)\n\ny_test = np.argmax(y_test, axis=2)\n\naccuracy = (y_pred == y_test).mean()\n\nprint(\"Accuracy: {:.4f}n\".format(accuracy))\n<\/pre>\n<\/div>\n<p>It appears that the model is doing quite well, however this is slightly misleading. This is a highly imbalanced dataset because of the very high number of O-tags that are present in the training and test data. There is further imbalance between the samples including the various tag classes. A better inspection would be to construct confusion matrices for each tag and judge the model performance based on those. We can construct a simple Python function to assist with inspection of the confusion matrices for individual tags. We use two randomly selected tags to give us a sense of what the confusion matrices for individual tags would look like.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndef tag_conf_matrix(cm, tagid):\n    tag_name = index2tag[tagid]\n    print(\"Tag name: {}\".format(tag_name))\n    print(cm[tagid])\n    tn, fp, fn, tp = cm[tagid].ravel()\n    tag_acc = (tp + tn) \/ (tn + fp + fn + tp)\n    print(\"Tag accuracy: {:.3f} n\".format(tag_acc))\n    \nmatrix = multilabel_confusion_matrix(y_test.flatten(), y_pred.flatten())\n\ntag_conf_matrix(matrix, 8)\ntag_conf_matrix(matrix, 14)\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nTag name: B-per\n[[496974    185]\n [   441   1184]]\nTag accuracy: 0.999 \n\nTag name: I-art\n[[498750      0]\n [    34      0]]\nTag accuracy: 1.000 \n<\/pre>\n<\/div>\n<p>Finally, we run a manual test by constructing a sample sentence and getting predictions for the detected entities. We tokenize, pad, and convert all words to indices. Then we call the model and print the predicted tags. The sentence we use for our test is \u201cPresident Obama became the first sitting American president to visit Hiroshima\u201d.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nsentence = \"President Obama became the first sitting American president to visit Hiroshima\"\n\nre_tok = re.compile(f\"([{string.punctuation}\u201c\u201d\u00a8\u00ab\u00bb\u00ae\u00b4\u00b7\u00ba\u00bd\u00be\u00bf\u00a1\u00a7\u00a3\u20a4\u2018\u2019])\")\nsentence = re_tok.sub(r' 1 ', sentence).split()\n\npadded_sentence = sentence + [word2index[\"--PADDING--\"]] * (MAX_SENTENCE - len(sentence))\npadded_sentence = [word2index.get(w, 0) for w in padded_sentence]\n\npred = ner_model.predict(np.array([padded_sentence]))\npred = np.argmax(pred, axis=-1)\n\nretval = \"\"\nfor w, p in zip(sentence, pred[0]):\n    retval = retval + \"{:15}: {:5}\".format(w, index2tag[p]) + \"n\"\n\nprint(retval)\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nPresident      : B-per\nObama          : I-per\nbecame         : O    \nthe            : O    \nfirst          : O    \nsitting        : O    \nAmerican       : B-gpe\npresident      : O    \nto             : O    \nvisit          : O    \nHiroshima      : B-geo\n<\/pre>\n<\/div>\n<p>In this blog post we covered use -cases and challenges around Named Entity Recognition, and we presented a possible solution using a BiLSTM-CRF model. The fitted model performs fairly well and is able to predict unseen data with relatively high accuracy.\u00a0<\/p>\n<p>Ramshaw and Marcus, Text Chunking using Transformation-Based Learning, 1995, <a href=\"https:\/\/arxiv.org\/abs\/cmp-lg\/9505040\">arXiv:cmp-lg\/9505040<\/a>.<\/p>\n<p>Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer, Neural Architectures for Named Entity Recognition, 2016, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260-270, <a href=\"https:\/\/www.aclweb.org\/anthology\/N16-1030\/\">https:\/\/www.aclweb.org\/anthology\/N16-1030\/<\/a>\u00a0<\/p>\n<p>Diederik P. Kingma and Jimmy Ba, Adam: A Method for Stochastic Optimization, 3rd International Conference for Learning Representations, San Diego, 2015, <a href=\"https:\/\/arxiv.org\/abs\/1412.6980\">https:\/\/arxiv.org\/abs\/1412.6980<\/a><\/p>\n<p><!-- relpost-thumb-wrapper --><!-- close relpost-thumb-wrapper -->    <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blog.dominodatalab.com\/named-entity-recognition-ner-challenges-and-model\/<\/p>\n","protected":false},"author":0,"featured_media":8376,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8375"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8375"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8375\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8376"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8375"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8375"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8375"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}