{"id":553,"date":"2020-08-21T11:48:17","date_gmt":"2020-08-21T11:48:17","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/21\/understanding-lstm-forward-propagation-in-two-ways\/"},"modified":"2020-08-21T11:48:17","modified_gmt":"2020-08-21T11:48:17","slug":"understanding-lstm-forward-propagation-in-two-ways","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/21\/understanding-lstm-forward-propagation-in-two-ways\/","title":{"rendered":"Understanding LSTM forward propagation in two ways"},"content":{"rendered":"<div>\n<p>*This article is only for the sake of understanding the equations in the second page of the paper named<a href=\"https:\/\/arxiv.org\/abs\/1503.04069\"> \u201cLSTM: A Search Space Odyssey\u201d.<\/a> If you have no trouble understanding the equations of LSTM forward propagation, I recommend you to skip this article and go the the next article.<\/p>\n<h3><strong>1. Preface<\/strong><\/h3>\n<p>I\u00a0 heard that in Western culture, smart people write textbooks so that other normal people can understand difficult stuff, and that is why textbooks in Western countries tend to be bulky, but also they are not so difficult as they look. On the other hand in Asian culture, smart people write puzzling texts on esoteric topics, and normal people have to struggle to understand what noble people wanted to say. Publishers also require the authors to keep the texts as short as possible, so even though the textbooks are thin, usually students have to repeat reading the textbooks several times because usually they are too abstract.<\/p>\n<p>Both styles have cons and pros, and usually I prefer Japanese textbooks because they are concise, and sometimes it is annoying to read Western style long texts with concrete straightforward examples to reach one conclusion. But a problem is that when it comes to explaining LSTM, almost all the text books are like Asian style ones. Every study material seems to skip the proper steps necessary for \u201cnormal people\u201d to understand its algorithms. But after actually making concrete slides on mathematics on LSTM, I understood why: if you write down all the equations on LSTM forward\/back propagation, that is going to be massive, and actually I had to make 100-page PowerPoint animated slides to make it understandable to people like me.<\/p>\n<p>I already had a feeling that \u201cDoes it help to understand only LSTM with this precision? I should do more practical codings.\u201d For example <span class=\"st\">Fran\u00e7ois Chollet, the developer of Keras, in his book, said as below.<\/span><img loading=\"lazy\" class=\"alignnone wp-image-5117 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/shut_up_and_code-1030x316.png\" alt=\"\" width=\"1030\" height=\"316\"><\/p>\n<p>\u00a0<\/p>\n<p>For me that sounds like \u201cWe have already implemented RNNs for you, so just shut up and use Tensorflow\/Keras.\u201d Indeed, I have never cared about the architecture of my Mac Book Air, but I just use it every day, so I think he is to the point. To make matters worse, for me, a promising algorithm called <em><strong>Transformer <\/strong><\/em>seems to be replacing the position of LSTM in <em><strong>natural language processing<\/strong><\/em>. But in this article series and in my PowerPoint slides, I tried to explain as much as possible, contrary to his advice.<\/p>\n<p>But I think, or rather hope,\u00a0 it is still meaningful to understand this 23-year-old algorithm, which is as old as me. I think LSTM did build a generation of algorithms for sequence data, and actually <a href=\"https:\/\/www.iarai.ac.at\/news\/sepp-hochreiter-receives-ieee-cis-neural-networks-pioneer-award-2021\/?fbclid=IwAR27cwT5MfCw4Tqzs3MX_W9eahYDcIFuoGymATDR1A-gbtVmDpb8ExfQ87A\">Sepp Hochreiter, the inventor of LSTM, has received Neural Network Pioneer Award 2021 for his work<\/a>.<\/p>\n<p>I hope those who study sequence data processing in the future would come to this article series, and study basics of RNN just as I also study classical machine learning algorithms.<\/p>\n<p><span>\u00a0*In this article \u201cDensely Connected Layers\u201d is written as \u201cDCL,\u201d and \u201cConvolutional Neural Network\u201d as \u201cCNN.\u201d<\/span><\/p>\n<h3><strong>2. Why LSTM?<\/strong><\/h3>\n<p>First of all, let\u2019s take a brief look at what I said about the structures of RNNs,\u00a0 in <a href=\"https:\/\/data-science-blog.com\/blog\/2020\/06\/01\/prerequisites-for-understanding-rnn-at-a-more-mathematical-level\/\">the first<\/a> and <a href=\"https:\/\/data-science-blog.com\/blog\/2020\/06\/17\/simple-rnn-the-first-foothold-for-understanding-lstm\/\">the second article<\/a>. A simple RNN is basically densely connected network with a few layers. But the RNN gets an input every time step, and it gives out an output at the time step. Part of information in the middle layer are succeeded to the next time step, and in the next time step, the RNN also gets an input and gives out an output. Therefore, virtually a simple RNN behaves almost the same way as densely connected layers with many layers during forward\/back propagation if you focus on its recurrent connections.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-5118 \" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/simple_rnn_vanishing_gradient-1030x787.png\" alt=\"\" width=\"561\" height=\"380\"><\/p>\n<p>That is why simple RNNs suffer from vanishing\/exploding gradient problems, where the information exponentially vanishes or explodes when its gradients are multiplied many times through many layers during back propagation. To be exact, I think you need to consider this problem precisely like you can see in <a href=\"http:\/\/proceedings.mlr.press\/v28\/pascanu13.pdf\">this paper.<\/a> But for now, please at least keep it in mind that when you calculate a gradient of an error function with respect to parameters of simple neural networks, you have to multiply parameters many times like below, and this type of calculation usually leads to vanishing\/exploding gradient problem. <img loading=\"lazy\" class=\"aligncenter wp-image-5041 \" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/vanishing_gradient_equation-1030x215.png\" alt=\"\" width=\"636\" height=\"133\"><\/p>\n<p>LSTM was invented as a way to tackle such problems as I mentioned in <a href=\"https:\/\/data-science-blog.com\/blog\/2020\/07\/16\/a-brief-history-of-neural-nets-everything-you-should-know-before-learning-lstm\/\">the last article<\/a>.<\/p>\n<h3><strong>3. How to display LSTM<\/strong><\/h3>\n<p>I would like you to just go to image search on Google, Bing, or Yahoo!, and type in \u201cLSTM.\u201d I think you will find many figures, but basically LSTM charts are roughly classified into two types: in this article I call them \u201cSpace Odyssey type\u201d and \u201celectronic circuit type\u201d, and in conclusion, I highly recommend you to understand LSTM as the \u201celectronic circuit type.\u201d<\/p>\n<p><em>*I just randomly came up with the terms \u201cSpace Odyssey type\u201d and \u201celectronic circuit type\u201d because the former one is used in the paper I mentioned, and the latter one looks like an electronic circuit to me. You do not have to take how I call them seriously.<\/em><\/p>\n<p>However, not that all the well-made explanations on LSTM use the \u201celectronic circuit type,\u201d and I am sure you sometimes have to understand LSTM as the \u201cspace odyssey type.\u201d And the paper \u201c<a href=\"https:\/\/arxiv.org\/abs\/1503.04069\">LSTM: A Search Space Odyssey<\/a>,\u201d which I learned a lot about LSTM from,\u00a0 also adopts the \u201cSpace Odyssey type.\u201d<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_comparison.png\"><img loading=\"lazy\" class=\"aligncenter size-large wp-image-5046\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_comparison-1-1030x506.png\" alt=\"LSTM architectur visualization\" width=\"1030\" height=\"506\"><\/a><\/p>\n<p>The main reason why I recommend the \u201celectronic circuit type\u201d is that its behaviors look closer to that of simple RNNs, which you would have seen if you read my former articles.<\/p>\n<p>*Behaviors of both of them look different, but of course they are doing the same things.<\/p>\n<p>If you have some understanding on DCL, I think it was not so hard to understand how simple RNNs work because simple RNNs\u00a0 are mainly composed of linear connections of neurons and weights, whose structures are the same almost everywhere. And basically they had only straightforward linear connections as you can see below.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/wp-content\/uploads\/2020\/07\/linear_algebra_1.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-9681\" src=\"https:\/\/data-science-blog.com\/wp-content\/uploads\/2020\/07\/linear_algebra_1-1030x465.png\" alt=\"\" width=\"345\" height=\"156\"><\/a><\/p>\n<p>But from now on, I would like you to give up the ideas that LSTM is composed of connections of neurons like the head image of this article series. If you do that, I think that would be chaotic and I do not want to make a figure of it on Power Point. In short, sooner or later you have to understand equations of LSTM.<\/p>\n<h3><strong>4. Forward propagation of LSTM in \u201celectronic circuit type\u201d<\/strong><\/h3>\n<p>*For further understanding of mathematics of LSTM forward\/back propagation, I recommend you to<a href=\"https:\/\/www.linkedin.com\/learning\/powerpoint-2016-visualisierung?trk=slideshare_sv_learning&amp;originalSubdomain=de\"> download my slides<\/a>.<\/p>\n<p>The behaviors of an LSTM block is quite similar to that of a simple RNN block: an RNN block gets an input every time step and gets information from the RNN block of the last time step, via recurrent connections. And the block succeeds information to the next block.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/rnn_comparison.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5042\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/rnn_comparison-1030x619.png\" alt=\"\" width=\"470\" height=\"255\"><\/a><\/p>\n<p>Let\u2019s look at the simplified architecture of\u00a0 an LSTM block. First of all, you should keep it in mind that LSTM have two streams of information: the one going through all the <em><strong>gates<\/strong><\/em>, and the one going through <em><strong>cell connections<\/strong><\/em>, the \u201chighway\u201d of LSTM block. For simplicity, we will see the architecture of an LSTM block without <strong><em>peephole connections<\/em><\/strong>, the lines in blue. The flow of information through cell connections is relatively uninterrupted. This helps LSTMs to retain information for a long time.<\/p>\n<p><strong><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_1.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5039 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_1-1030x443.png\" alt=\"\" width=\"1030\" height=\"443\"><\/a><\/strong><\/p>\n<p>In a LSTM block, the input and the output of the former time step separately go through sections named \u201cgates\u201d: input gate, forget gate, output gate, and block input. The outputs of the forget gate, the input gate, and the block input join the highway of cell connections to renew the value of the cell.<\/p>\n<p><em>*The small two dots on the cell connections are the \u201con-ramp\u201d of cell conection highway. <\/em><\/p>\n<p><em>*You would see the terms \u201cinput gate,\u201d \u201cforget gate,\u201d \u201coutput gate\u201d almost everywhere, but how to call the \u201cblock gate\u201d depends on textbooks.<\/em><\/p>\n<p>Let\u2019s look at the structure of an LSTM block a bit more concretely. An LSTM block at the time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-30883a806cb6a9b63288b239836e4842_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"18\"> gets <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-28b7d29a6043feca92592bde198078ba_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t-1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"43\">, the output at the last time step,\u00a0 and <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ff25c11691c3a02797e1a6e395102950_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t-1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"41\">, the information of the cell at the time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25d023a88c1d1a280c7bb696a442216a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t-1)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"49\">, via recurrent connections. The block at time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-30883a806cb6a9b63288b239836e4842_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"18\"> gets the input <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25bd838f69c67f952fda444255f1db1d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{x}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"17\" width=\"26\">, and it separately goes through each gate, together with <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-28b7d29a6043feca92592bde198078ba_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t-1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"43\">. After some calculations and activation, each gate gives out an output. The outputs of the forget gate, the input gate, the block input, and the output gate are respectively <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a7d0005f1334917bb7705916d21ab171_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{f}^{(t)}, boldsymbol{i}^{(t)}, boldsymbol{z}^{(t)}, boldsymbol{o}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"126\">. The outputs of the gates are mixed with <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ff25c11691c3a02797e1a6e395102950_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t-1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"41\"> and the LSTM block gives out an output <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-900a416bfac3667287a4ae3c0961946c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"25\">, and gives <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-900a416bfac3667287a4ae3c0961946c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"25\"> and <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ba3b1b56a6139747edae1fda32addc87_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"23\"> to the next LSTM block via recurrent connections.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_8.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5036 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_8-1030x536.png\" alt=\"\" width=\"1030\" height=\"536\"><\/a><\/p>\n<p>You calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a7d0005f1334917bb7705916d21ab171_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{f}^{(t)}, boldsymbol{i}^{(t)}, boldsymbol{z}^{(t)}, boldsymbol{o}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"126\"> as below.<\/p>\n<p>*You have to keep it in mind that the equations above do not include peephole connections, which I am going to show with blue lines in the end.<\/p>\n<p>The equations above are quite straightforward if you understand forward propagation of simple neural networks. You add linear products of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-900a416bfac3667287a4ae3c0961946c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"25\"> and <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ba3b1b56a6139747edae1fda32addc87_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"23\"> with different weights in each gate. What makes LSTMs different from simple RNNs is how to mix the outputs of the gates with the cell connections. In order to explain that, I need to introduce a mathematical operator called <em><strong>Hadamard product<\/strong><\/em>, which you denote as <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-1ce4c504ce60d246c1f2812a10c3a0b3_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"odot\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"13\">. This is a very simple operator. This operator produces an elementwise product of two vectors or matrices with identical shape.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/Hadamard_product.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5037\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/Hadamard_product-1030x301.png\" alt=\"\" width=\"441\" height=\"89\"><\/a><\/p>\n<p>With this Hadamar product operator, the renewed cell and the output are calculated as below.<\/p>\n<ul>\n<li><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-2caa576d424731d9fce9234634c1bd2c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t)} = boldsymbol{z}^{(t)}odot boldsymbol{i}^{(t)} + boldsymbol{c}^{(t-1)} odot boldsymbol{f}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"234\"><\/li>\n<li><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-df63aa9070886dfdbb9cdb8b75a727fc_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t)} = boldsymbol{o}^{(t)} odot tanh(boldsymbol{c}^{(t)})\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"175\"><\/li>\n<\/ul>\n<p>The values of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a7d0005f1334917bb7705916d21ab171_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{f}^{(t)}, boldsymbol{i}^{(t)}, boldsymbol{z}^{(t)}, boldsymbol{o}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"126\"> are compressed into the range of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-caffaae885a1287e3dfc31bfb1cd0694_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"[0, 1]\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"32\"> or <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-96395345c57f8928c42918c656dd1364_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"[-1, 1]\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"45\"> with activation functions. You can see that the input gate and the block input give new information to the cell. The part <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-e7cd17a7fecfb6dcfdf465206992cb68_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t-1)} odot boldsymbol{f}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"91\"> means that the output of the forget gate \u201cforgets\u201d the cell of the last time step by multiplying the values from 0 to 1 elementwise. And the cell <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ba3b1b56a6139747edae1fda32addc87_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"23\"> is activated with <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-5d0093c6dc60c8e1bac2d27cd80c273d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"tanh()\" title=\"Rendered by QuickLaTeX.com\" height=\"19\" width=\"49\"> and the output of the output gate \u201csuppress\u201d the activated value of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ba3b1b56a6139747edae1fda32addc87_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"23\">. In other words, the output gatedecides how much information to give out as an output of the LSTM block. The output of every gate depends on the input <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25bd838f69c67f952fda444255f1db1d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{x}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"17\" width=\"26\">, and the recurrent connection <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-28b7d29a6043feca92592bde198078ba_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t-1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"43\">. That means an LSTM block learns to forget the cell of the last time step, to renew the cell, and to suppress the output. To describe in an extreme manner, if all the outputs of every gate are always <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-24c57fb7cb6119756d7282ef2a4cbdc3_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(1, 1, \u20261)^T\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"89\">, LSTMs forget nothing, retain information of inputs at every time step, and gives out everything. And\u00a0 if all the outputs of every gate are always <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-d5139ace50b16c68d73a00086c4001e7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(0, 0, \u20260)^T\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"89\">, LSTMs forget everything, receive no inputs, and give out nothing.<\/p>\n<p>This model has one problem: the outputs of each gate do not directly depend on the information in the cell. To solve this problem, some LSTM models introduce some flows of information from the cell to each gate, which are shown as lines in blue in the figure below.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_7-1.png\"><img loading=\"lazy\" class=\"aligncenter size-large wp-image-5123\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_7-1-1030x498.png\" alt=\"LSTM inner architecture\" width=\"1030\" height=\"498\"><\/a><\/p>\n<p>LSTM models, for example the one with or without peephole connection, depend on the library you use, and the model I have showed is one of standard LSTM structure. However no matter how complicated structure of an LSTM block looks, you usually cover it with a black box as below and show its behavior in a very simplified way.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/LSTM_blackbox-1.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5122 size-medium\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/LSTM_blackbox-1-296x300.png\" alt=\"\" width=\"296\" height=\"300\"><\/a><\/p>\n<h3><strong>5. Space Odyssey type<\/strong><\/h3>\n<p>I personally think there is no advantages of understanding how LSTMs work with this Space Odyssey type chart, but in several cases you would have to use this type of chart. So I will briefly explain how to look at that type of chart, based on understandings of LSTMs you have gained through this article.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_9.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5035\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/lstm_9-1030x754.png\" alt=\"\" width=\"709\" height=\"519\"><\/a><\/p>\n<p>In Space Odyssey type of LSTM chart, at the center is a cell. Electronic circuit type of chart, which shows the flow of information of the cell as an uninterrupted \u201chighway\u201d in an LSTM block. On the other hand, in a Spacey Odyssey type of chart, the information of the cell rotate at the center. And each gate gets the information of the cell through peephole connections,\u00a0 <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25bd838f69c67f952fda444255f1db1d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{x}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"17\" width=\"26\">, the input at the time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-30883a806cb6a9b63288b239836e4842_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"18\"> , sand <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-28b7d29a6043feca92592bde198078ba_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t-1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"43\">, the output at the last time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25d023a88c1d1a280c7bb696a442216a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t-1)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"49\">, which came through recurrent connections. In Space Odyssey type of chart, you can more clearly see that the information of the cell go to each gate through the peephole connections in blue. Each gate calculates its output.<\/p>\n<p>Just as the charts you have seen, the dotted line denote the information from the past. First, the information of the cell at the time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25d023a88c1d1a280c7bb696a442216a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t-1)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"49\"> goes to the forget gate and get mixed with the output of the forget cell In this process the cell is partly \u201cforgotten.\u201d Next, the input gate and the block input are mixed to generate part of new value of the the cell at time step\u00a0 <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-30883a806cb6a9b63288b239836e4842_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"18\">. And the partly \u201cforgotten\u201d <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ff25c11691c3a02797e1a6e395102950_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t-1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"41\"> goes back to the center of the block and it is mixed with the output of the input gate and the block input. That is how <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ba3b1b56a6139747edae1fda32addc87_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"23\"> is renewed. And the value of new cell flow to the top of the chart, being mixed with the output of the output gate. Or you can also say the information of new cell is \u201csuppressed\u201d with the output gate.<\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/space_odyssey_flow_chart.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5119 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/08\/space_odyssey_flow_chart-1030x426.png\" alt=\"\" width=\"1030\" height=\"426\"><\/a><\/p>\n<p>I have finished the first four articles of this article series, and finally I am gong to write about back propagation of LSTM in the next article. I have to say what I have written so far is all for the next article, and my long long Power Point slides.<\/p>\n<p>\u00a0<\/p>\n<p>[References]<\/p>\n<p>[1] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutn\u00edk, Bas R. Steunebrink, J\u00fcrgen Schmidhuber, \u201cLSTM: A Search Space Odyssey,\u201d (2017)<\/p>\n<p>[2] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 202-204<\/p>\n<p>[3] \u201cSepp Hochreiter receives IEEE CIS Neural Networks Pioneer Award 2021\u201d, Institute of advanced research in artificial intelligence, (2020)<br \/>URL: https:\/\/www.iarai.ac.at\/news\/sepp-hochreiter-receives-ieee-cis-neural-networks-pioneer-award-2021\/?fbclid=IwAR27cwT5MfCw4Tqzs3MX_W9eahYDcIFuoGymATDR1A-gbtVmDpb8ExfQ87A<\/p>\n<p>[4] Oketani Takayuki, \u201cMachine Learning Professional Series: Deep Learning,\u201d (2015), pp. 120-125<br \/>\u5ca1\u8c37\u8cb4\u4e4b \u8457, \u300c\u6a5f\u68b0\u5b66\u7fd2\u30d7\u30ed\u30d5\u30a7\u30c3\u30b7\u30e7\u30ca\u30eb\u30b7\u30ea\u30fc\u30ba \u6df1\u5c64\u5b66\u7fd2\u300d, (2015), pp. 120-125<\/p>\n<p>[5] Harada Tatsuya, \u201cMachine Learning Professional Series: Image Recognition,\u201d (2017), pp. 252-257<br \/>\u539f\u7530\u9054\u4e5f \u8457, \u300c\u6a5f\u68b0\u5b66\u7fd2\u30d7\u30ed\u30d5\u30a7\u30c3\u30b7\u30e7\u30ca\u30eb\u30b7\u30ea\u30fc\u30ba \u753b\u50cf\u8a8d\u8b58\u300d, (2017), pp. 252-257<\/p>\n<p>[6] \u201cUnderstandable LSTM ~ With the Current Trends,\u201d Qiita, (2015)<br \/>\u300c\u308f\u304b\u308bLSTM \uff5e \u6700\u8fd1\u306e\u52d5\u5411\u3068\u5171\u306b\u300d, Qiita, (2015)<br \/>URL: https:\/\/qiita.com\/t_Signull\/items\/21b82be280b46f467d1b<\/p>\n<div id=\"author-bio-box\">\n<h3><a href=\"https:\/\/data-science-blog.com\/en\/blog\/author\/yasuto\/\" title=\"All posts by Yasuto Tamura\" rel=\"author\">Yasuto Tamura<\/a><\/h3>\n<div class=\"bio-gravatar\"><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/03\/yasuto-tamura-80x80.png\" width=\"70\" height=\"70\" alt=\"Yasuto Tamura\" class=\"avatar avatar-70 wp-user-avatar wp-user-avatar-70 alignnone photo\"><\/div>\n<p><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"http:\/\/www.datanomiq.de\" class=\"bio-icon bio-icon-website\"><\/a><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"https:\/\/www.linkedin.com\/in\/yasuto-tamura-7689b418b\/\" class=\"bio-icon bio-icon-linkedin\"><\/a><\/p>\n<p class=\"bio-description\">Data Science Intern at <a href=\"http:\/\/www.datanomiq.io\">DATANOMIQ<\/a>.<br \/>\nMajoring in computer science. Currently studying mathematical sides of deep learning, such as densely connected layers, CNN, RNN, autoencoders, and making study materials on them. Also started aiming at Bayesian deep learning algorithms.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/data-science-blog.com\/en\/blog\/2020\/08\/21\/into-gated-rnn-the-childhoods-end-of-simple-rnn\/<\/p>\n","protected":false},"author":0,"featured_media":554,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/553"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=553"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/553\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/554"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=553"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=553"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=553"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}