{"id":8135,"date":"2021-03-08T04:35:23","date_gmt":"2021-03-08T04:35:23","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/03\/08\/seq2seq-models-and-simple-attention-mechanism-backbones-of-nlp-tasks\/"},"modified":"2021-03-08T04:35:23","modified_gmt":"2021-03-08T04:35:23","slug":"seq2seq-models-and-simple-attention-mechanism-backbones-of-nlp-tasks","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/03\/08\/seq2seq-models-and-simple-attention-mechanism-backbones-of-nlp-tasks\/","title":{"rendered":"Seq2seq models and simple attention mechanism: backbones of NLP tasks"},"content":{"rendered":"<div>\n<p>This is the second article of my article series <a href=\"https:\/\/data-science-blog.com\/blog\/2020\/12\/30\/transformer\/\">\u201cInstructions on Transformer for people outside NLP field, but with examples of NLP.\u201d<\/a><\/p>\n<h3>1 Machine translation and seq2seq models<\/h3>\n<p>I think machine translation is one of the most iconic and commercialized tasks of NLP. With modern machine translation you can translate relatively complicated sentences, if you tolerate some grammatical errors. As I mentioned in the third article of my series on RNN, research on machine translation already started in the early 1950s, and their focus was translation between English and Russian, highly motivated by Cold War. In the initial phase, machine translation was rule-based, like most students do in their foreign language classes. They just implemented a lot of rules for translations. In the next phase, machine translation was statistics-based. They achieved better performance with statistics for constructing sentences. At any rate, both of them highly relied on feature engineering, I mean, you need to consider numerous rules of translation and manually implement them. After those endeavors of machine translation, neural machine translation appeared. The advent of neural machine translation was an earthshaking change of machine translation field. Neural machine translation soon outperformed the conventional techniques, and it is still state of the art. Some of you might felt that machine translation became more or less reliable around that time.<\/p>\n<div id=\"attachment_5453\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-5453\" loading=\"lazy\" class=\"wp-image-5453 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/Latin_lesson-1030x625.png\" alt=\"\" width=\"1030\" height=\"625\"><\/p>\n<p id=\"caption-attachment-5453\" class=\"wp-caption-text\">Source: Monty Python\u2019s Life of Brian (1979)<\/p>\n<\/div>\n<p>I think you have learnt at least one foreign or classical language in school. I don\u2019t know how good you were at the classes, but I think you had to learn some conjugations of them and I believe that was tiresome to most of students. For example, as a foreign person, I still cannot use \u201cder\u201d, \u201cdie\u201d, \u201cdas\u201d properly. Some of my friends recommended I do not care them for the time being while I speak, but I usually care grammar very much. But this method of learning language is close to the rule base machine translation, and modern neural machine translation basically does not rely on such rules.<\/p>\n<p>As far as I understand, machine translation is pattern recognition learned from a large corpus. Basically no one implicitly teach computers how grammar works. Machine translation learns very complicated mapping from a source language to a target language, based on a lot of examples of word or sentence pairs. I am not sure, but this might be close to how bilingual kids learn how the two languages are related. You do not need to navigate the translator to learn specific grammatical rules.<\/p>\n<div id=\"attachment_5452\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-5452\" loading=\"lazy\" class=\"wp-image-5452 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/Italian_lesson_2-1030x329.png\" alt=\"\" width=\"1030\" height=\"329\"><\/p>\n<p id=\"caption-attachment-5452\" class=\"wp-caption-text\">Source: Monty Python\u2019s Flying Circus (1969)<\/p>\n<\/div>\n<p>Since machine translation does not rely on manually programming grammatical rules, basically you do not need to prepare another specific network architecture for another pair of languages. The same method can be applied to any pairs of languages, as long as you have an enough size of corpus for that. You do not have to think about translation rules between other pairs of languages.<\/p>\n<div id=\"attachment_5451\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-5451\" loading=\"lazy\" class=\"wp-image-5451 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/Italian_lesson-1030x336.png\" alt=\"\" width=\"1030\" height=\"336\"><\/p>\n<p id=\"caption-attachment-5451\" class=\"wp-caption-text\">Source: Monty Python\u2019s Flying Circus (1969)<\/p>\n<\/div>\n<p>*I do not follow the cutting edge studies on machine translation, so I am not sure, but I guess there are some heuristic methods for machine translation. That is, designing a network depending on the pair of languages could be effective. When it comes grammatical word orders, English and Japanese have totally different structures, I mean English is basically SVO and Japanese is basically SOV. In many cases, the structures of sentences with the same meaning in both of the languages are almost like reflections in a mirror. A lot of languages have similar structures to English, even in Asia, for example Chinese. On the other hand relatively few languages have Japanese-like structures, for example Korean, Turkish. I guess there would be some grammatical-structure-aware machine translation networks.<\/p>\n<p>Not only machine translations, but also several other NLP tasks, such as summarization, question answering, use a model named <em><strong>seq2seq model (sequence to sequence model)<\/strong><\/em>. As well as other deep learning techniques, seq2seq models are composed of an encoder and a decoder. In the case of seq2seq models, you use RNNs in both the encoder and decoder parts. For the RNN cells, you usually use a gated RNN such as LSTM or GRU because simple RNNs would suffer from vanishing gradient problem when inputs or outputs are long, and those in translation tasks are long enough. In the encoder part, you just pass input sentences. To be exact, you input them from the first time step to the last time step, every time giving an output, and passing information to the next cell via recurrent connections.<\/p>\n<p>*I think you would be confused without some understandings on how RNNs propagate forward. You do not need to understand this part that much if you just want to learn Transformer. In order to learn Transformer model, attention mechanism, which I explain in the next section is more important. If you want to know how basic RNNs work, <a href=\"https:\/\/data-science-blog.com\/blog\/2020\/06\/17\/simple-rnn-the-first-foothold-for-understanding-lstm\/\">an article of mine<\/a> should help you.<\/p>\n<p>*In the encoder part of the figure below, the cell also propagate information backward. I assumed an encoder part with bidirectional RNNs, and they \u201cforward propagate\u201d information backwards. But in the codes below, we do not consider such complex situation. Please just keep it in mind that seq2seq model could use bidirectional RNNs.<\/p>\n<p>At the last time step in the encoder part, you pass the hidden state of the RNN to the decoder part, which I show as a yellow cell in the figure below, and the yellow cell\/layer is the initial hidden layer of the first RNN cell of the decoder part. Just as normal RNNs, the decoder part start giving out outputs, and passing information via reccurent connections. At every time step you choose a token to give out from the vocabulary you use in the task. That means, each cell of decoder RNNs does a classification task and decides which word to write out at the time step. Also, very importantly, in the decoder part, the output at one time step is the input at the next time step, as I show as dotted lines in the figure below.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5450 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/general_seq2seq-1030x531.png\" alt=\"\" width=\"907\" height=\"468\"><\/p>\n<p>*The translation algorithm I explained depends on greedy decoding, which has to decide a token at every time step. However it is easy to imagine that that is not how you translate a word. You usually erase the earlier words or you construct some possibilities in your mind. Actually, for better translations you would need decoding strategies such as beam search, but it is out of the scope of at least this article. Thus we are going to make a very simplified translator based on greedy decoding.<\/p>\n<h3>2 Learning by making<\/h3>\n<p>*It would take some hours on your computer to train <a href=\"https:\/\/github.com\/YasuThompson\/Transformer_blog_codes\/blob\/main\/rnn_translation_attention_modified.ipynb\">the translator<\/a> if you do not use a GPU. I recommend you to run it at first and continue reading this article.<\/p>\n<p>Seq2seq models do not have that complicated structures, and for now you just need to understand the points I mentioned above. Rather than just formulating the models, I think it would be better to understand this model by actually writing codes. If you copy and paste the codes in <a href=\"https:\/\/github.com\/YasuThompson\/Transformer_blog_codes\/blob\/main\/rnn_translation_attention_modified.ipynb\">this Github page<\/a> or <a href=\"https:\/\/www.tensorflow.org\/tutorials\/text\/nmt_with_attention\">the official Tensorflow tutorial<\/a>, installing necessary libraries, it would start training the seq2seq model for Spanish-English translator. In <a href=\"https:\/\/github.com\/YasuThompson\/Transformer_blog_codes\/blob\/main\/rnn_translation_attention_modified.ipynb\">the Github page<\/a>, I just added comments to the codes in the official tutorial so that they are more understandable. If you can understand the codes in the tutorial without difficulty, I have to say this article itself is not compatible to your level. Otherwise, I am going to help you understand the tutorial with my original figures. I made this article so that it would help you read the next article. If you have no idea what RNN is, at least the second article of my RNN series should be helpful to some extent.<\/p>\n<p>*If you try to read the the whole article series of mine on RNN, I think you should get prepared. I mean, you should prepare some pieces of paper and a pen. It would be nice if you have some stocks of coffee and snacks. Though I do not think you have to do that to read this article.<\/p>\n<h4>2.1 The corpus and datasets<\/h4>\n<p>In the codes in <a href=\"https:\/\/github.com\/YasuThompson\/Transformer_blog_codes\/blob\/main\/rnn_translation_attention_modified.ipynb\">the Github page<\/a>, please ignore the part sandwiched by \u201c######\u201d.\u00a0 Handling language data is not the focus of this article. All you have to know is that the codes below first create datasets from the Spanish-English corpus in <a href=\"http:\/\/www.manythings.org\/anki\/\">http:\/\/www.manythings.org\/anki\/<\/a> , and you datasets for training the translator as the tensors below.<\/p>\n<p>Each token is encoded with integers as the codes below, thus after encoding, the Spanish sentence \u201cTodo sobre mi madre.\u201d is [1, 74, 514, 19, 237, 3, 2].<\/p>\n<h4>2.2 The encoder<\/h4>\n<p>The encoder part is relatively simple. All you have to keep in mind is that you put input sentences, and pass the hidden layer of the last cell to the decoder part. To be more concrete, an RNN cell receives an input word every time step, and gives out an output vector at each time step, passing hidden states to the next cell. You make a chain of RNN cells by the process, like in the figure below. In this case \u201ctime steps\u201d means the indexes of the order of the words. If you more or less understand how RNNs work, I think this is nothing difficult. The encoder part passes the hidden state, which is in yellow in the figure below, to the decoder part.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5449 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/rnn_translation_encoder-1030x423.png\" alt=\"\" width=\"813\" height=\"334\"><\/p>\n<p>Let\u2019s see how encoders are implemented in the code below. We use a type of RNN named GRU (Gated Recurrent Unit). GRU is simpler than LSTM (Long Short-Term Memory). One GRU cell gets an input every time step, and passes one hidden state via recurrent connections. As well as LSTM, GRU is a gated RNN so that it can mitigate vanishing gradient problems. GRU was invented after LSTM for smaller computation costs. At time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-30883a806cb6a9b63288b239836e4842_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"18\"> one GRU cell gets an input <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25bd838f69c67f952fda444255f1db1d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{x}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"17\" width=\"26\"> and passes its hidden state\/vector <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-752de8888e722fbdf743becd356428e4_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{h}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"26\"> to the next cell like the figure below. But in the implementation, you put the whole input sentence as a 16 dimensional vector whose elements are integers, as you saw in the figure in the last subsection 2.1. That means, the \u2018Encoder\u2019 class in the implementation below makes a chain of 16 GRU cells every time you put an input sentence in Spanish, even if input sentences have less than 16 tokens.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5448 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/GRU_cell-1030x399.png\" alt=\"\" width=\"559\" height=\"212\"><\/p>\n<p>*<strong>TO BE\u00a0 VERY HONEST<\/strong>, I am not sure why the encoder part of\u00a0 seq2seq models are implemented this way in the codes below. In the implementation below, the number of total time steps in the encoder part is fixed to 16. If input sentences have less than 16 tokens, it seems the RNN cells get no inputs after the time step of the token \u201c&lt;end&gt;\u201d. As far as I could check, if RNN cells get no inputs, they repeats giving out similar 1024-d vectors. I think in this implementation, RNN cells after the &lt;end&gt; token, which I showed as the dotted RNN cells in the figure above, do not change so much. And the encoder part passes the hidden state of the 16th RNN cell, which is in yellow, to the decoder.<\/p>\n<h4>2.3 The decoder<\/h4>\n<p>The decoder part is also not that hard to understand. As I briefly explained in the last section, you initialize the first cell of the decoder, using the hidden layer of the last cell the encoder. During decoding, I mean while writing a translation, at the beginning you put the token \u201c&lt;start&gt;\u201d as the first input of the decoder. Given the input \u201c&lt;start&gt;\u201d, the first cell outputs \u201call\u201d in the example in the figure below, and the output \u201call\u201d is the input of the next cell. The output of the next cell \u201cabout\u201d is also passed to the next cell, and you repeat this till the decoder gives out the token \u201c&lt;end&gt;\u201d.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-5447 \" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/rnn_decoder-1030x655.png\" alt=\"\" width=\"681\" height=\"433\"><\/p>\n<p>A more important point is how to get losses in the decoder part during training. We use a technique named teacher enforcing during training the decoder part of a seq2seq model. This is also quite simple: you just have to make sure you input a correct answer to RNN cells, regardless of the outputs generated by the cell last time step. You force the decoder to get the correct input every time step, and that is what teacher forcing is all about.<\/p>\n<p>You can see how the decoder part and teacher forcing is implemented in the codes below. You have to keep it in mind that unlike the \u2018Encoder\u2019 class, you put a token into a \u2018Decoder\u2019 class every time step. To be exact you also need the outputs of the encoder part to calculate attentions in the decoder part. I am going to explain that in the next subsection.<\/p>\n<h4>2.4 Attention mechanism<\/h4>\n<p>I think you have learned at least one foreign language, and usually you have to translate some sentences. Remember the processes of writing a translation of a sentence in another language. Imagine that you are about to write a new word after writing some. If you are not used to translations in the language, you must have cared about which parts of the original language correspond to the very new word you are going to write. You have to pay \u201cattention\u201d to the original sentence. This is what attention mechanism is all about.<\/p>\n<p>*I would like you to pay \u201cattention\u201d to this section. As you can see from the fact that the original paper on Transformer model is named \u201cAttention Is All You Need,\u201d attention mechanism is a crucial idea of Transformer.<\/p>\n<p>In the decoder part you initialize the hidden layer with the last hidden layer of the encoder, and its first input is \u201c&lt;start&gt;\u201d.\u00a0 The decoder part start decoding, , as I explained in the last subsection. If you use attention mechanism in the seq2seq model, you calculate attentions every times step.\u00a0 Let\u2019s consider an example in the figure below, where the next input in the decoder is \u201cmy\u201d, and given the token \u201cmy\u201d, the GRU cell calculates a hidden state at the time step. The hidden state is the \u201cquery\u201d in this case, and you compare the \u201cquery\u201d with the 6 outputs of the encoder, which are \u201ckeys\u201d. You get weights\/scores, I mean \u201cattentions\u201d, which is the histogram in the figure below.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5444 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/rnn_attention_mechanism_2-1030x727.png\" alt=\"\" width=\"884\" height=\"624\">And you reweight the \u201cvalues\u201d with the weights in the histogram. In this case the \u201cvalues\u201d are the outputs of the encoder themselves. You used use the reweighted \u201cvalues\u201d to calculate the hidden state of the decoder at the times step again. And you used the hidden state updated by the attentions to predict the next word.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5443 size-large aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/rnn_attention_mechanism-1030x493.png\" alt=\"\" width=\"1030\" height=\"493\"><\/p>\n<p>*In the implementation, however, the size of the output of the \u2018Encoder\u2019 class is always (16, 2024). You calculate attentions for all those 16 output vectors, but virtually only the first 6 1024-d output vectors important.<\/p>\n<p>Summing up the points I have explained, you compare the \u201cquery\u201d with the \u201ckeys\u201d and get a scores\/weights for the \u201cvalues.\u201d Each score\/weight is in short relevance between the \u201cquery\u201d and each \u201ckey\u201d. And you reweight the \u201cvalues\u201d with the scores\/weights.\u00a0 In the case of attention mechanism in this article, we can say that \u201cvalues\u201d and \u201ckeys\u201d are the same. You would also see that more clearly in the implementation below.<\/p>\n<p>You especially have to pay attention to the terms \u201cquery\u201d, \u201ckey\u201d, and \u201cvalue.\u201d \u201cKeys\u201d and \u201cvalues\u201d are basically in the same language, and in the case above, they are in Spanish. \u201cQueries\u201d and \u201ckeys\u201d can be in either different or the same. In the example above, the \u201cquery\u201d is in English, and the \u201ckeys\u201d are in Spanish.<\/p>\n<p>You can compare a \u201cquery\u201d with \u201ckeys\u201d in various ways. The implementation uses the one called\u00a0 Bahdanau\u2019s additive style, and in Transformer, you use more straightforward ways. You do not have to care about how Bahdanau\u2019s additive style calculates those attentions. It is much more important to learn the relations of \u201cqueries\u201d, \u201ckeys\u201d, and \u201cvalues\u201d for now.<\/p>\n<p>*A problem is that Bahdanau\u2019s additive style is slightly different from the figure above. It seems in Bahdanau\u2019s additive style, at the time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-30883a806cb6a9b63288b239836e4842_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"18\"> in the decoder part, the query is the hidden state at the time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25d023a88c1d1a280c7bb696a442216a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t-1)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"49\">. You would notice that if you closely look at the implementation below.As you can see in the figure above, you can see that you have to calculate the hidden state of the decoder cell two times at the time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-30883a806cb6a9b63288b239836e4842_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"18\">: first in order to generate a \u201cquery\u201d, second in order to predict the translated word at the time step. That would not be so computationally efficient, and I guess that is why Bahdanau\u2019s additive style uses the hidden layer at the last time step as a query rather than calculating hidden layers twice.<\/p>\n<h4>2.5 Translating and displaying attentions<\/h4>\n<p>After training the translator for 20 epochs, I could translate Spanish sentences, and the implementation also displays attention scores for between the input and output sentences. For example the translation of the inputs \u201cTodo sobre mi madre.\u201d and \u201cHabre con ella.\u201d were \u201call about my mother .\u201d and \u201ci talked to her .\u201d respectively, and the results seem fine. One powerful advantage of using attention mechanism is you can display this type of word alignment, I mean correspondences of words in a sentence, easily as in the heat maps below. The yellow parts shows high scores of attentions, and you can see that the distributions of relatively highs scores are more or less diagonal, which implies that English and Spanish have similar word orders.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5445 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/almodovar_1-e1611837659493-1030x575.png\" alt=\"\" width=\"797\" height=\"445\"><\/p>\n<p>For other inputs like \u201cMujeres al borde de un ataque de nervious.\u201d or \u201cVolver.\u201d, the translations are not good.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5446 size-large aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/02\/almodovar_2-e1611837723402-1030x575.png\" alt=\"\" width=\"1030\" height=\"575\"><\/p>\n<p>You might have noticed there is one big problem in this implementation: you can use only the words appeared in the corpus. And actually I had to manually add some pairs of sentences with the word \u201cborde\u201d to the corpus to get the translation in the figure.<\/p>\n<p><em>* I make study materials on<\/em><em> machine learning, sponsored by <a href=\"https:\/\/www.datanomiq.de\/\">DATANOMIQ<\/a>. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.<\/em><\/p>\n<h4><\/h4>\n<div id=\"author-bio-box\">\n<h3><a href=\"https:\/\/data-science-blog.com\/en\/blog\/author\/yasuto\/\" title=\"All posts by Yasuto Tamura\" rel=\"author\">Yasuto Tamura<\/a><\/h3>\n<div class=\"bio-gravatar\"><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/03\/yasuto-tamura-80x80.png\" width=\"70\" height=\"70\" alt=\"Yasuto Tamura\" class=\"avatar avatar-70 wp-user-avatar wp-user-avatar-70 alignnone photo\"><\/div>\n<p><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"http:\/\/www.datanomiq.de\" class=\"bio-icon bio-icon-website\"><\/a><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"https:\/\/www.linkedin.com\/in\/yasuto-tamura-7689b418b\/\" class=\"bio-icon bio-icon-linkedin\"><\/a><\/p>\n<p class=\"bio-description\">Data Science Intern at <a href=\"http:\/\/www.datanomiq.io\">DATANOMIQ<\/a>.<br \/>\nMajoring in computer science. Currently studying mathematical sides of deep learning, such as densely connected layers, CNN, RNN, autoencoders, and making study materials on them. Also started aiming at Bayesian deep learning algorithms.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/data-science-blog.com\/en\/blog\/2021\/02\/17\/sequence-to-sequence-models-back-bones-of-various-nlp-tasks\/<\/p>\n","protected":false},"author":0,"featured_media":8136,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8135"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8135"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8135\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8136"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8135"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8135"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8135"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}