{"id":1109,"date":"2020-09-07T14:16:00","date_gmt":"2020-09-07T14:16:00","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/07\/lstm-back-propagation-following-the-flows-of-variables\/"},"modified":"2020-09-07T14:16:00","modified_gmt":"2020-09-07T14:16:00","slug":"lstm-back-propagation-following-the-flows-of-variables","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/07\/lstm-back-propagation-following-the-flows-of-variables\/","title":{"rendered":"LSTM back propagation: following the flows of variables"},"content":{"rendered":"<div>\n<p>First of all, the summary of this article is: <em>please just download my <a href=\"https:\/\/l.facebook.com\/l.php?u=https%3A%2F%2Fde.slideshare.net%2FYasutoTamura1%2Fprecise-lstm-algorithm-237074916%3Ffbclid%3DIwAR3IbWmAcNY__wDgWZ1arNsIDEmj6GMkjAtRbjwll8eFLh4NB1JFxuLMeYs&amp;h=AT3rXnFhpd0nIbZCNqNz__lF2qVhK-7dj9qdYE09469jjYcLJSzMwQmDN8MeZGoMdjpU_9ux9BToMVQAi8PGWvPh_hpqoQEmirYxr6xjdwP2VmMfK9IFM6OtS4jeCfgb81VkmYUQFkwrMC2anB11nfjpaw&amp;__tn__=-UK-R&amp;c%5B0%5D=AT0sDwnKrUjbeKcWZt3tBMtbnW3Klq6ynJQpmPM2VbEAYYGP_nAMauu7zukTQEBn7np95Ps3EDDt0HWathpZWrROmeaRyxpc93DH3WExlNDVaZO-Gx4Zsh4dYpSfij_dTGRYqeK6OMjpgq5V4PJfsdU2w02Jup-j\">Power Point slides which I made<\/a> and be patient, following the equations.<\/em><\/p>\n<p>I am not supposed to use so many mathematics when I write articles on Data Science Blog. However using little mathematics when I talk about LSTM backprop is like writing German, never caring about \u201cder,\u201d \u201cdie,\u201d \u201cdas,\u201d or speaking little English in English classes (which most high school English teachers in Japan do) or writing Japanese without using any Chinese characters (which looks like a terrible handwriting by a drug addict). In short, that is ridiculous. And all the precise equations of LSTM backprop, written on a Blog is not a comfortable thing to see. So basically the whole of this article is an advertisement on <a href=\"https:\/\/de.slideshare.net\/YasutoTamura1\/precise-lstm-algorithm-237074916?fbclid=IwAR2TtEmqzgwvJeifUFPabijeE4YVez_gYnJoiHK2G1yRDub1IA-_PXJsr5w\">my PowerPoint slides<\/a>, sponsored by <a href=\"https:\/\/www.datanomiq.de\/\">DATANOMIQ<\/a>, and I can just give you some tips to get ready for the most tiresome part of understanding LSTM here.<\/p>\n<p>*This article is the fifth article of \u201c<a href=\"https:\/\/data-science-blog.com\/blog\/2020\/05\/01\/recurrent-neural-network\/\">A gentle introduction to the tiresome part of understanding RNN<\/a>.\u201d<\/p>\n<p><span>\u00a0*In this article \u201cDensely Connected Layers\u201d is written as \u201cDCL,\u201d and \u201cConvolutional Neural Network\u201d as \u201cCNN.\u201d<\/span><\/p>\n<h3>1. Chain rules<\/h3>\n<p>This article is virtually an article on chain rules of differentiation. Even if you have clear understandings on chain rules, I recommend you to take a look at this section. If you have written down all the equations of back propagation of DCL, you would have seen what chain rules are. Even simple chain rules for backprop of normal DCL can be difficult to some people, but when it comes to backprop of LSTM, it is a pure torture.\u00a0 I think using graphical models would help you understand what chain rules are like. Graphical models are basically used to describe the relations of variables and functions in probabilistic models, so to be exact I am going to use \u201csomething like graphical models\u201d in this article. Not that this is a common way to explain chain rules.<\/p>\n<p>First, let\u2019s think about the simplest type of chain rule. Assume that you have a function <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-42987712926c99f2c7438a971b810ac3_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"f=f(x)=f(x(y))\" title=\"Rendered by QuickLaTeX.com\" height=\"19\" width=\"149\">, and relations of the functions are displayed as the graphical model at the left side of the figure below. Variables are a type of function, so you should think that every node in graphical models denotes a function. Arrows in purple in the right side of the chart show how information propagate in differentiation.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5173 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/chain_rule_1-1030x236.png\" alt=\"\" width=\"833\" height=\"166\"><\/p>\n<p>Next, if you a function <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-9c09a708375fde2676da319bcdfe8b24_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"f\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"10\"> , which has two variances\u00a0 <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-01a7b7b5dca66cb33a1207e1f39c1140_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"x_1\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"16\"> and <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-f1cd6be340b4fce14489cf5b565a169e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"x_2\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"17\">. And both of the variances also share two variances\u00a0 <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-d48db79635c1ec05bd332325e278f268_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"y_1\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"15\"> and <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-445c2fb0b8c763c14a5736dc7f43a558_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"y_2\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"16\">. When you take partial differentiation of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-9c09a708375fde2676da319bcdfe8b24_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"f\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"10\"> with respect to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-d48db79635c1ec05bd332325e278f268_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"y_1\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"15\"> or <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-445c2fb0b8c763c14a5736dc7f43a558_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"y_2\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"16\">, the formula is a little tricky. Let\u2019s think about how to calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-8015cf37969558194915efeeee3c4b20_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial f}{partial y_1}\" title=\"Rendered by QuickLaTeX.com\" height=\"27\" width=\"22\">. The variance <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-d48db79635c1ec05bd332325e278f268_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"y_1\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"15\"> propagates to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-9c09a708375fde2676da319bcdfe8b24_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"f\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"10\"> via <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-01a7b7b5dca66cb33a1207e1f39c1140_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"x_1\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"16\"> and <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-f1cd6be340b4fce14489cf5b565a169e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"x_2\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"17\">. In this case the partial differentiation has two terms as below.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5172 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/chain_rule_2-1030x340.png\" alt=\"\" width=\"736\" height=\"173\"><\/p>\n<p>In chain rules, you have to think about all the routes where a variance can propagate through. If you generalize chain rules, that is like below, and you need to understand chain rules in this way to understanding any types of back propagation.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5171 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/chain_rule_3-1030x398.png\" alt=\"\" width=\"769\" height=\"293\"><\/p>\n<p>The figure above shows that if you calculate partial differentiation of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-9c09a708375fde2676da319bcdfe8b24_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"f\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"10\"> with respect to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-bb3c186e5c65fcd066bb23dec8f4e48a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"y_i\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"14\">, the partial differentiation has <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-b170995d512c659d8668b4e42e1fef6b_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"n\" title=\"Rendered by QuickLaTeX.com\" height=\"8\" width=\"11\"> terms in total because <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-bb3c186e5c65fcd066bb23dec8f4e48a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"y_i\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"14\"> propagates to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-9c09a708375fde2676da319bcdfe8b24_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"f\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"10\"> via <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-b170995d512c659d8668b4e42e1fef6b_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"n\" title=\"Rendered by QuickLaTeX.com\" height=\"8\" width=\"11\"> variances. In order to understand backprop of LSTM, you constantly have to care about the flow of variances, which I showed as arrows in purple above.<\/p>\n<h3>2. Chain rules in LSTM<\/h3>\n<p>I would like you to remember the figure below, which I used in <a href=\"https:\/\/data-science-blog.com\/blog\/2020\/06\/17\/simple-rnn-the-first-foothold-for-understanding-lstm\/\">the second article<\/a> to show how errors propagate backward during backprop of simple RNNs. After forward propagation, first of all, you need to calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-6e8ba594c3d76da5e76e43867e740a93_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J}{partial boldsymbol{theta}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"27\" width=\"31\">, gradients of the error function with respect to parameters, at every time step. But you have to be careful that even though these gradients depend on time steps, the parameters <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ddf302e914ec9de856caff481f22374e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{theta}\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"10\"> do not depend on time steps.<\/p>\n<p>*As I mentioned in <a href=\"https:\/\/data-science-blog.com\/blog\/2020\/06\/17\/simple-rnn-the-first-foothold-for-understanding-lstm\/\">the second article<\/a> I personally think <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-6e8ba594c3d76da5e76e43867e740a93_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J}{partial boldsymbol{theta}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"27\" width=\"31\"> should be rather denoted as <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a1d14e31d7f56e3caeb9fd2207d28582_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(frac{partial J}{partial boldsymbol{theta}})^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"23\" width=\"48\"> because parameters themselves do not depend on time. <a href=\"https:\/\/www.deeplearningbook.org\/contents\/rnn.html\">The textbook by MIT press<\/a> also partly use the former notation. And you are likely to encounter this type of notation, so I think it is not bad to get ready for both.<\/p>\n<p>The errors at time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-30883a806cb6a9b63288b239836e4842_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(t)\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"18\"> propagate backward to all the <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-76e4154bff3b7c1f63f61fe19ba305e5_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{h} ^{(s)}, (s leq t)\" title=\"Rendered by QuickLaTeX.com\" height=\"22\" width=\"88\">. Conversely, in order to calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-6e8ba594c3d76da5e76e43867e740a93_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J}{partial boldsymbol{theta}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"27\" width=\"31\"> errors flowing from <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-f7bc1db2f42c333f622b568b27c2662b_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J^{(s)},  (s geq t)\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"87\">. In the chart you need arrows of errors in purple for the gradient in a purple frame, orange arrows for gradients in orange frame, red arrows for gradients in red frame. And you need to sum up <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-6e8ba594c3d76da5e76e43867e740a93_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J}{partial boldsymbol{theta}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"27\" width=\"31\"> to calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-83bbe007788cee105e7594ede049b667_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J}{partial boldsymbol{theta}} = sum_{t}{frac{partial J}{partial boldsymbol{theta}^{(t)}}}\" title=\"Rendered by QuickLaTeX.com\" height=\"26\" width=\"103\">, and you need this gradient <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-827da51489999abbf32e4928f59cb202_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J}{partial boldsymbol{theta}}\" title=\"Rendered by QuickLaTeX.com\" height=\"24\" width=\"17\"> to renew parameters, one time.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-9673 aligncenter\" src=\"https:\/\/data-science-blog.com\/wp-content\/uploads\/2020\/07\/simple_rnn_backprop_flow_3-1030x744.png\" alt=\"\" width=\"480\" height=\"386\"><\/p>\n<p>At an RNN block level, the flows of errors and how to renew parameters are the same in LSTM backprop, but the flow of errors inside each block is much more complicated in LSTM backprop. And in this article and <a href=\"https:\/\/de.slideshare.net\/YasutoTamura1\/precise-lstm-algorithm-237074916?fbclid=IwAR2TtEmqzgwvJeifUFPabijeE4YVez_gYnJoiHK2G1yRDub1IA-_PXJsr5w\">my PowerPoint slides<\/a>, I use a special notation to denote errors: <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-bb1fc2f7f8e710041263aae53d7ae8cc_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta star  ^{(t)}= frac{partial J^{(t)}}{partial star}\" title=\"Rendered by QuickLaTeX.com\" height=\"26\" width=\"91\"><\/p>\n<p>* Again, please be careful of what <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-0bc9ede4118f4296c0f32dc0479da6bd_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta star  ^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"32\"> means. Neurons depend on time steps, but parameters do not depend on time steps. So if <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ce5ce4f1ed0448af8dbee67daac86254_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"star\" title=\"Rendered by QuickLaTeX.com\" height=\"9\" width=\"9\"> are neurons,\u00a0 <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-075dc3785bb0ad075d479e526a4429d0_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta star  ^{(t)}= frac{partial J}{ partial star ^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"89\">, but when <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ce5ce4f1ed0448af8dbee67daac86254_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"star\" title=\"Rendered by QuickLaTeX.com\" height=\"9\" width=\"9\"> are parameters, <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-c29f384b948d07608095fbd80e9b5141_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta star  ^{(t)}= frac{partial J^{(t)}}{ partial star}\" title=\"Rendered by QuickLaTeX.com\" height=\"26\" width=\"91\"> should be rather denoted like <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-3e52a46ee6ee3535b73ab0f159b8e9ac_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta star  ^{(t)}= (frac{partial J}{ partial star ^{(t)}})\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"104\">. In the <a href=\"https:\/\/arxiv.org\/abs\/1503.04069\">Space Odyssey paper<\/a>,\u00a0 <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-d85e106fc7ceaf73112db540ca6104d9_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{star}\" title=\"Rendered by QuickLaTeX.com\" height=\"9\" width=\"10\"> are not used as parameters, but in <a href=\"https:\/\/de.slideshare.net\/YasutoTamura1\/precise-lstm-algorithm-237074916?fbclid=IwAR1I8ZqSxVPqUpOxGpJEi9RgLVb5J5fe7jz1A7RG4FLIjmQeNI5mMUtNg6I\">my PowerPoint slides<\/a> and some other materials, <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-d85e106fc7ceaf73112db540ca6104d9_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{star}\" title=\"Rendered by QuickLaTeX.com\" height=\"9\" width=\"10\"> are used also as parameteres.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5164 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/lstm_backprop_flow-1030x517.png\" alt=\"\" width=\"702\" height=\"302\"><\/p>\n<p>As I wrote in the last article, you calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a7d0005f1334917bb7705916d21ab171_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{f}^{(t)}, boldsymbol{i}^{(t)}, boldsymbol{z}^{(t)}, boldsymbol{o}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"126\"> as below. Unlike the last article, I also added the terms of peephole connections in the equations below, and I also added the variances <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-605144bca07f1a8afff0eee255d54e35_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bar{boldsymbol{f}^{(t)}}, bar{boldsymbol{i}^{(t)}}, bar{boldsymbol{z}^{(t)}}, bar{boldsymbol{o}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"24\" width=\"126\"> for convenience.<\/p>\n<p>With\u00a0 Hadamar product operator, the renewed cell and the output are calculated as below.<\/p>\n<ul>\n<li><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-2caa576d424731d9fce9234634c1bd2c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t)} = boldsymbol{z}^{(t)}odot boldsymbol{i}^{(t)} + boldsymbol{c}^{(t-1)} odot boldsymbol{f}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"234\"><\/li>\n<li><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-df63aa9070886dfdbb9cdb8b75a727fc_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t)} = boldsymbol{o}^{(t)} odot tanh(boldsymbol{c}^{(t)})\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"175\"><\/li>\n<\/ul>\n<p>In this article I would rather give instructions on how to read my PowerPoint slide. Just as general backprop, you need to calculate gradients of error functions with respect to parameters, such as <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-6590bef3999183c5edd13f424b50b1c0_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{W}_{star}, delta boldsymbol{R}_{star}, delta boldsymbol{p}_{star}, delta boldsymbol{b}_{star}\" title=\"Rendered by QuickLaTeX.com\" height=\"17\" width=\"146\">, where <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ce5ce4f1ed0448af8dbee67daac86254_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"star\" title=\"Rendered by QuickLaTeX.com\" height=\"9\" width=\"9\"> is either of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-8a164636fe6cbffb6574c369c9421124_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"{z, in, for, out }\" title=\"Rendered by QuickLaTeX.com\" height=\"19\" width=\"117\">. And just as backprop of simple RNNs, in order to calculate gradients with respect to parameters, you need to calculate errors of neurons, that is gradients of error functions with respect to neurons, such as <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-eda486d1ddfcf867d7568bc6c0f6f1ff_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{f}^{(t)}, delta boldsymbol{i}^{(t)}, delta boldsymbol{z}^{(t)}, delta boldsymbol{o}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"21\" width=\"161\">.<\/p>\n<p>*Again and again, keep it in mind that neurons depend on time steps, but parameters do not depend on time steps.<\/p>\n<p>When you calculate gradients with respect to neurons, you can first calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-1c1842d21823c343ea4e98397a97bd3f_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"34\">, but the equation for this error is the most difficult, so I recommend you to put it aside for now. After calculating <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-1c1842d21823c343ea4e98397a97bd3f_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"34\">, you can next calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-6516389d215081f326bc71b50de16830_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta bar{boldsymbol{o}}^{(t)}= frac{partial J^{(t)}}{ partial bar{boldsymbol{o}}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"28\" width=\"93\">. If you see the LSTM block below as a graphical model which I introduced, the information of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-c816c064e0cebe2b032d7df89c1b0174_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bar{boldsymbol{o}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"24\"> flow like the purple arrows. That means, <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-c816c064e0cebe2b032d7df89c1b0174_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bar{boldsymbol{o}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"24\"> affects <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ad28cd88e33b638907ffae49cfe60953_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"11\"> only via <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-900a416bfac3667287a4ae3c0961946c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"25\">, and this structure is equal to the first graphical model which I have introduced above. And if you calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-c816c064e0cebe2b032d7df89c1b0174_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bar{boldsymbol{o}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"24\"> element-wise, you get the equation <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-76ffabe8fc7f42e6247717717c474701_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta bar{o}_{k}^{(t)}=frac{partial J}{partial bar{o}_{k}^{(t)}}= frac{partial J}{partial y_{k}^{(t)}} frac{partial y_{k}^{(t)}}{partial bar{o}_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"40\" width=\"181\">.<\/p>\n<p>*The <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-3422b6bb5c160593658b7c39425d9880_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"k\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"9\"> is an index of an element of vectors. If you can calculate element-wise gradients, it is easy to understand that as differentiation of vectors and matrices.<\/p>\n<h3><img loading=\"lazy\" class=\"wp-image-5168 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/lstm_backprop_1-1030x612.png\" alt=\"\" width=\"591\" height=\"351\"><\/h3>\n<p>Next you can calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-5025269c02bd807fec139273a6550ea5_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"32\">, and chain rules are very important in this process. The flow of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-5025269c02bd807fec139273a6550ea5_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"32\"> to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ad28cd88e33b638907ffae49cfe60953_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"11\"> can be roughly divided into two streams: the one flows to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ad28cd88e33b638907ffae49cfe60953_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"11\"> as <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ccc200cf3fee7d89f8718272b98046a2_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bodlsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"23\">, and the one flows to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ad28cd88e33b638907ffae49cfe60953_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"11\"> as <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-21ba017076150d373e28ec49601b7649_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bodlsymbol{c}^{(t+1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"40\">. And the stream from <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-f668567e4b72cd5fd6c4e91d5addcf8a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bodlsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"22\"> to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ccc200cf3fee7d89f8718272b98046a2_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bodlsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"23\"> also have two branches: the one via <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-c816c064e0cebe2b032d7df89c1b0174_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bar{boldsymbol{o}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"24\"> and the one which directly converges as\u00a0 <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ccc200cf3fee7d89f8718272b98046a2_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bodlsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"23\">. Just as well, the stream from <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-f668567e4b72cd5fd6c4e91d5addcf8a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bodlsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"22\"> to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-21ba017076150d373e28ec49601b7649_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bodlsymbol{c}^{(t+1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"40\"> also have three branches: the ones via <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-5adf595a2aaf627b9e1bbde5bb7fa4f3_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bar{boldsymbol{f}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"24\" width=\"25\">, <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-9de800951f86b37e2f240f523fa08392_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bar{boldsymbol{i}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"21\">, and the one which directly converges as <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-21ba017076150d373e28ec49601b7649_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"bodlsymbol{c}^{(t+1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"40\">.<\/p>\n<h3><img loading=\"lazy\" class=\"alignnone wp-image-5169 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/lstm_backprop_2-1030x302.png\" alt=\"\" width=\"1030\" height=\"302\"><\/h3>\n<p>If you see see these flows as graphical a graphical model, that would be like the figure below.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5163 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/lstm_graphical_model-1030x650.png\" alt=\"\" width=\"671\" height=\"427\"><\/p>\n<p>According to this graphical model, you can calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-18a48b22afddacedbcfeae71e7daceae_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{c} ^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"32\"> element-wise as below.<img loading=\"lazy\" class=\"aligncenter wp-image-5160 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/lstm_gradient_c-1030x216.png\" alt=\"\" width=\"1030\" height=\"216\"><\/p>\n<p>* <strong>TO BE VERY HONEST<\/strong> I still do not fully understand why we can apply chain rules like above to calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-5025269c02bd807fec139273a6550ea5_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{c}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"32\">. When you apply the formula of chain rules, which I showed in the first section, to this case, you have to be careful of where to apply partial differential operators <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-05222da6d9ebfb44198b7ed146e557ab_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial}{ partial c_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"32\" width=\"29\">. In the case above, in the part <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-c007ca1718038c2237f9499a21a17cd4_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial y_{k}^{(t)}}{partial c_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"40\" width=\"31\"> the partial differential operator only affects <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25e83f708c801db90933192953889801_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"tanh(c_{k}^{(t)})\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"74\"> of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-c2d99ebf50e94c3632939097c15b4ad9_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"o_{k}^{(t)} cdot tanh(c_{k}^{(t)})\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"112\">, and in the part <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-b38757a852b120de147ddf1b8cdeba62_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial c_{k}^{(t+1)}}{partial c_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"40\" width=\"44\">, the partial differential operator <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-bfb56446fb4e7f95cd304a6c3f5bfccf_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial}{partial c_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"32\" width=\"29\"> only affects the part <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-46fc3df8b21342513237767a35016935_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"c_{k}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"22\"> of the term <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-bb1e2336c91962d635847eed2d830a9d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"c^{t}_{k} cdot f_{k}^{(t+1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"71\">. In the <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-116afac756e06ace0b23799ff51beae7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial bar{o}_{k}^{(t)}}{partial c_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"40\" width=\"30\"> part, only <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a462db0de2f17084226c6689ba5b1e3a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(p_{out})_{k} cdot c_{k}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"85\">,\u00a0 in the <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-36d7e4f380eedda11bcf39a7ec07c292_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial bar{i}_{k}^{(t+1)}}{partial c_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"40\" width=\"43\"> part, only <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-7e4a6fca210e128c149353cfdd9cb5ce_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(p_{in})_{k} cdot c_{k}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"79\">, and in the <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-78be56e1ae7dea51f2796ceb8f812e1d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial bar{f}_{k}^{(t+1)}}{partial c_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"40\" width=\"46\"> part, only <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-7e4a6fca210e128c149353cfdd9cb5ce_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"(p_{in})_{k} cdot c_{k}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"79\">. But some other parts, which are not affected by <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-05222da6d9ebfb44198b7ed146e557ab_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial}{ partial c_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"32\" width=\"29\"> are also functions of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-46fc3df8b21342513237767a35016935_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"c_{k}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"22\">. For example <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-25f722ad49c8a4cf6f95c35c025fad2e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"o_{k}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"23\"> of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-c2d99ebf50e94c3632939097c15b4ad9_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"o_{k}^{(t)} cdot tanh(c_{k}^{(t)})\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"112\"> is also a function of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-46fc3df8b21342513237767a35016935_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"c_{k}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"22\">. And I am still not sure about the logic behind where to affect those partial differential operators.<\/p>\n<p>*But at least, these are the only decent equations for LSTM backprop which I could find, and <a href=\"https:\/\/arxiv.org\/abs\/1503.04069\">a frequently cited paper on LSTM<\/a> uses implementation based on these equations. Computer science is more of practical skills, rather than rigid mathematical logic. It\u00a0 If you have any comments or advice on this point, please let me know.<\/p>\n<p>Calculating <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-462a890e897bf5e3fffe1440f52a06b4_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta bar{boldsymbol{f}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"24\" width=\"35\">, <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ea04d7338876b7701aac1785efdaad50_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta bar{boldsymbol{i}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"30\">, <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-e38c947d73ea02c2daea4ab2f8472455_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta bar{boldsymbol{z}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"33\"> are also relatively straigtforward as calculating <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-063c40296706144713b8ab3b6d8b0d40_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta bar{boldsymbol{o}}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"33\">. They all use the first type of chain rule in the first section. Thereby you can get these gradients: <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-0e24a19ee62977ce6c479e1d636842a1_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta bar{f}_{k}^{(t)}=frac{partial J}{ partial bar{f}_{k}^{(t)}} =frac{partial J}{partial c_{k}^{(t)}} frac{partial c_{k}^{(t)}}{ partial bar{f}_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"40\" width=\"183\">, <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-6227fb7e4e503fe28e97a8be81e85bb6_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta bar{i}_{k}^{(t)}=frac{partial J}{partial bar{i}_{k}^{(t)}} =frac{partial J}{partial c_{k}^{(t)}} frac{partial c_{k}^{(t)}}{ partial bar{i}_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"40\" width=\"173\">, and <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-e4fa2c41b2d020fc4363e95e8fa36175_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta bar{z}_{k}^{(t)}=frac{partial J}{partial bar{z}_{k}^{(t)}} =frac{partial J}{partial c_{k}^{(t)}} frac{partial c_{k}^{(t)}}{ partial bar{i}_{k}^{(t)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"40\" width=\"179\">.<\/p>\n<h3><img loading=\"lazy\" class=\"alignnone wp-image-5162 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/lstm_backprop_flow_2-1030x610.png\" alt=\"\" width=\"1030\" height=\"610\"><\/h3>\n<p>All the gradients which we have calculated use the error <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-1c1842d21823c343ea4e98397a97bd3f_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"34\">, but when it comes to calculating <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-1c1842d21823c343ea4e98397a97bd3f_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"34\">\u2026.. I can only say \u201cPlease be patient. I did my best in <a href=\"https:\/\/de.slideshare.net\/YasutoTamura1\/precise-lstm-algorithm-237074916?fbclid=IwAR2TtEmqzgwvJeifUFPabijeE4YVez_gYnJoiHK2G1yRDub1IA-_PXJsr5w\">my PowerPoint slides<\/a> to explain that.\u201d It is not a kind of process which I want to explain on Word Press. In conclusion you get an error like this: <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-13a209fe5c44748c15c3f2d2e5bbac24_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"delta boldsymbol{y}^{(t)}=frac{partial J^{(t)}}{partial boldsymbol{y}^{(t)}} + boldsymbol{R}_{for}^{T} delta bar{boldsymbol{f}}^{(t+1)} + boldsymbol{R}_{in}^{T}delta bar{boldsymbol{i}}^{(t+1)} + boldsymbol{R}_{out}^{T}delta bar{boldsymbol{o}}^{(t+1)} + boldsymbol{R}_{z}^{T}delta bar{boldsymbol{z}}^{(t+1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"31\" width=\"520\">, and the flows of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-900a416bfac3667287a4ae3c0961946c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{y}^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"25\"> are as blow.<\/p>\n<h3><img loading=\"lazy\" class=\"alignnone wp-image-5170 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/lstm_backprop_6-1030x303.png\" alt=\"\" width=\"1030\" height=\"303\"><\/h3>\n<p>Combining the gradients we have got so far, we can calculate gradients with respect to parameters. For concrete results, please check <a href=\"https:\/\/arxiv.org\/abs\/1503.04069\">the Space Odyssey paper<\/a> or my <a href=\"https:\/\/de.slideshare.net\/YasutoTamura1\/precise-lstm-algorithm-237074916?fbclid=IwAR3Iak_DO1rOdyYCehhAXSwGCwaKSImhNfwjG0CSTfe0dSQgcOvL1abIklI\">PowerPoint slide<\/a>.<\/p>\n<h3>3. How LSTMs tackle exploding\/vanishing gradients problems<\/h3>\n<p>*If you are allergic to mathematics, you should not read this section or download <a href=\"https:\/\/de.slideshare.net\/YasutoTamura1\/precise-lstm-algorithm-237074916?fbclid=IwAR2TtEmqzgwvJeifUFPabijeE4YVez_gYnJoiHK2G1yRDub1IA-_PXJsr5w\">my PowerPoint slide<\/a>.<\/p>\n<p>*Part of this section is more or less subjective, so if you really want to know how LSTM mitigate the problems, I highly recommend you to also refer to other materials. But at least I did my best for this article.<\/p>\n<p>LSTMs do not completely solve, vanishing gradient problems. They mitigate vanishing\/exploding gradient problems. I am going to roughly explain why they can tackle those problems. I think you find many explanations on that topic, but many of them seems to have some mathematical mistakes (even the slide used in a lecture in Stanford University) and I could not partly agree with some statements. I also could not find any papers or materials which show the whole picture of how LSTMs can tackle those problems. So in this article I am only going to give instructions on the most mainstream way to explain this topic.<\/p>\n<p>First let\u2019s see how gradients actually \u201cvanish\u201d or \u201cexplode\u201d in simple RNNs. As I in the second article of this series, simple RNNs propagate forward as the equations below.<\/p>\n<p>And every time step, you get an error function <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-0d7127db95a4f1d7589e644567bd7933_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"25\">. Let\u2019s consider the gradient of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-0d7127db95a4f1d7589e644567bd7933_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"25\"> with respect to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-76bea701c45ebccbe8a77c1de7ece86e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{h}^{(k)}\" title=\"Rendered by QuickLaTeX.com\" height=\"17\" width=\"29\">, that is the error flowing from <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-0d7127db95a4f1d7589e644567bd7933_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J^{(t)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"25\"> to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-76bea701c45ebccbe8a77c1de7ece86e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{h}^{(k)}\" title=\"Rendered by QuickLaTeX.com\" height=\"17\" width=\"29\">. This error is the most used to calculate gradients of the parameters.<\/p>\n<h3><img loading=\"lazy\" class=\"wp-image-5161 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/vanishing_gradient.png\" alt=\"\" width=\"256\" height=\"652\"><\/h3>\n<p>If you calculate this error more concretely, <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-32a5ca728d535a9519fb31537bfee8cf_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J^{(t)}}{partial boldsymbol{h}^{(k)}} = frac{partial J^{(t)}}{partial boldsymbol{h}^{(t)}} frac{partial boldsymbol{h}^{(t)}}{partial boldsymbol{h}^{(t-1)}} cdots frac{partial boldsymbol{h}^{(k+2)}}{partial boldsymbol{h}^{(k+1)}} frac{partial boldsymbol{h}^{(k+1)}}{partial boldsymbol{h}^{(k)}} = frac{partial J^{(t)}}{partial boldsymbol{h}^{(t)}} prod_{k&lt; s leq t} frac{partial boldsymbol{h}^{(s)}}{partial boldsymbol{h}^{(s-1)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"29\" width=\"453\">, where <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-d1fdcbd312ad77529e7efbf44673e0b9_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial boldsymbol{h}^{(s)}}{partial boldsymbol{h}^{(s-1)}} = boldsymbol{W} ^T cdot diag(g'(boldsymbol{b} + boldsymbol{W}cdot boldsymbol{h}^{(s-1)} + boldsymbol{U}cdot boldsymbol{x}^{(s)})) = boldsymbol{W} ^T cdot diag(g'(boldsymbol{a}^{(s)}))\" title=\"Rendered by QuickLaTeX.com\" height=\"29\" width=\"561\">.<\/p>\n<p>* If you see the figure as a type of graphical model, you should be able to understand the why chain rules can be applied as the equation above.<\/p>\n<p>*According to this paper <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-73d8ac6fc005bc3e81ded76a085865a1_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial boldsymbol{h}^{(s)}}{partial boldsymbol{h}^{(s-1)}}  = boldsymbol{W} ^T cdot diag(g'(boldsymbol{a}^{(s)}))\" title=\"Rendered by QuickLaTeX.com\" height=\"29\" width=\"223\">, but it seems that many study materials and web sites are mistaken in this point.<\/p>\n<p>Hence <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a928e6c04d919a6d4cddff9e887d7cd7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J^{(t)}}{partial boldsymbol{h}^{(k)}} = frac{partial J^{(t)}}{partial boldsymbol{h}^{(t)}} prod_{k&lt; s leq t} boldsymbol{W} ^T cdot diag(g'(boldsymbol{a}^{(s)})) = frac{partial J^{(t)}}{partial boldsymbol{h}^{(t)}} (boldsymbol{W} ^T )^{(t - k)} prod_{k&lt; s leq t} diag(g'(boldsymbol{a}^{(s)}))\" title=\"Rendered by QuickLaTeX.com\" height=\"58\" width=\"580\">. If you take norms of the members you get an equality <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-87b4691702ec5df7b7bedfa4d11237b7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"leftlVert frac{partial J^{(t)}}{partial boldsymbol{h}^{(k)}} rightrVert leq leftlVert frac{partial J^{(t)}}{partial boldsymbol{h}^{(t)}} rightrVert leftlVert boldsymbol{W} ^T rightrVert ^{(t - k)} prod_{k&lt; s leq t} leftlVert diag(g'(boldsymbol{a}^{(s)}))rightrVert\" title=\"Rendered by QuickLaTeX.com\" height=\"33\" width=\"411\">. I will not go into detail anymore, but it is known that according to this inequality, multiplication of weight vectors exponentially converge to 0 or to infinite number.<\/p>\n<p>We have seen that the error <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-88103bd107f4afd5451a4d47f873457d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J^{(t)}}{partial boldsymbol{h}^{(k)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"29\" width=\"35\"> is the main factor causing vanishing\/exploding gradient problems. In case of LSTM, <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-65ac120c3b1b97173c1a75f28361402e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J^{(t)}}{partial boldsymbol{c}^{(k)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"28\" width=\"32\"> is an equivalent. For simplicity, let\u2019s calculate only <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-bb49e897707270c74d54f8dfcba1f1a6_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial boldsymbol{c}^{(t)}}{partial boldsymbol{c}^{(t-1)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"28\" width=\"46\">, which is equivalent to <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-75bbbe9acca968892eec7a9f13d1b4d1_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial boldsymbol{h}^{(t)}}{partial boldsymbol{h}^{(t-1)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"29\" width=\"48\"> of simple RNN backprop.<\/p>\n<p><img loading=\"lazy\" class=\"alignnone wp-image-5165 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/lstm_vanishing_gradient_2-1030x252.png\" alt=\"\" width=\"1030\" height=\"252\"><\/p>\n<p>* Just as I noted above, you have to be careful of which part the partial differential operator <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a5edbc43d5110f89da2432e6e101d24e_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial}{partial boldsymbol{c}^{(t-1)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"46\"> affects in the chain rule above. That is, you need to calculate <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-4e4983958bb4003a46878e12eb647148_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial}{partial boldsymbol{c}^{(t-1)}} (boldsymbol{c}^{(t-1)} odot boldsymbol{f}^{(t)})\" title=\"Rendered by QuickLaTeX.com\" height=\"25\" width=\"153\">, and the partial differential operator only affects <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ff25c11691c3a02797e1a6e395102950_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{c}^{(t-1)}\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"41\">. I think this is not a correct mathematical notation, but please forgive me for doing this for convenience.<\/p>\n<p>If you continue calculating the equation above more concretely, you get the equation below.<\/p>\n<p><img loading=\"lazy\" class=\"alignnone wp-image-5166 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/lstm_vanishing_gradient-1030x239.png\" alt=\"\" width=\"1030\" height=\"239\"><\/p>\n<p>I cannot mathematically explain why, but it is known that this characteristic of gradients of LSTM backprop mitigate the vanishing\/exploding gradient problem. We have seen that if you take a norm of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-88103bd107f4afd5451a4d47f873457d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial J^{(t)}}{partial boldsymbol{h}^{(k)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"29\" width=\"35\">, that is equal or smaller than repeated multiplication of the norm of the same weight matrix, and that soon leads to vanishing\/exploding gradient problem. But according to the equation above, even if you take a norm of repeatedly multiplied <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-bb49e897707270c74d54f8dfcba1f1a6_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial boldsymbol{c}^{(t)}}{partial boldsymbol{c}^{(t-1)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"28\" width=\"46\">, its norm cannot be evaluated with a simple value like repeated multiplication of the norm of the same weight matrix. The outputs of each gate are different from time steps to time steps, and that adjust the value of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-bb49e897707270c74d54f8dfcba1f1a6_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial boldsymbol{c}^{(t)}}{partial boldsymbol{c}^{(t-1)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"28\" width=\"46\">.<\/p>\n<p>*I personally guess the item <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-9b4b653d0cdeec1f352d350f8495e80d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"diag(boldsymbol{f}^{(t)})\" title=\"Rendered by QuickLaTeX.com\" height=\"22\" width=\"75\"> is every effective. The unaffected value of can directly <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-9b4b653d0cdeec1f352d350f8495e80d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"diag(boldsymbol{f}^{(t)})\" title=\"Rendered by QuickLaTeX.com\" height=\"22\" width=\"75\"> adjust the value of <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-bb49e897707270c74d54f8dfcba1f1a6_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{partial boldsymbol{c}^{(t)}}{partial boldsymbol{c}^{(t-1)}}\" title=\"Rendered by QuickLaTeX.com\" height=\"28\" width=\"46\">. And as a matter of fact, it is known that performances of LSTM drop the most when you gite rid of forget gates.<\/p>\n<p>When it comes to tackling exploding gradient problems, there is a much easier technique called <em><strong>gradient clipping<\/strong><\/em>. This algorithm is very simple: you just have to adjust the size of gradient so that the absolute value of gradient is under a threshold at every time step. Imagine that you decide in which direction to move by calculating gradients, but when the footstep is going to be too big, you just adjust the size of footstep to the threshold size you have set. In a pseudo code, write a gradient clipping part only with two line code as below.<\/p>\n<p><img loading=\"lazy\" class=\"size-medium wp-image-5158 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/gradient_clipping_2-300x103.png\" alt=\"\" width=\"300\" height=\"103\"><\/p>\n<p>*<img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ec71a16513c25a3dbfb483225da4aba8_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{g}\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"10\"> is a gradient at the time step <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-c3f45801ee34a6a07b9e73c78e692b64_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"threshold\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"76\"> is the maximum size of the \u201cstep.\u201d<\/p>\n<p>The figure below, cited from a deep learning text from MIT press textbook, is a good and straightforward explanation on gradient clipping.It is known that a strongly nonlinear function, such as error functions of RNN, can have very steep or plain areas. If you artificially visualize the idea in 3-dimensional space, as the surface of a loss function <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ad28cd88e33b638907ffae49cfe60953_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"11\"> with two variants <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-45450b1aad2fb461335dcb0951d18dc7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"w, b\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"29\">, that means the loss function <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-ad28cd88e33b638907ffae49cfe60953_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"J\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"11\"> has plain areas and very steep cliffs like in the figure.Without gradient clipping, at the left side, you can see that the black dot all of a sudden climb the cliff and could jump to an unexpected area. But with gradient clipping, you avoid such \u201cbig jumps\u201d on error functions.<\/p>\n<div id=\"attachment_5159\" class=\"wp-caption aligncenter\">\n<p><img aria-describedby=\"caption-attachment-5159\" loading=\"lazy\" class=\"wp-image-5159 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/09\/gradient_clipping-1030x451.png\" alt=\"\" width=\"1030\" height=\"451\"><\/p>\n<p id=\"caption-attachment-5159\" class=\"wp-caption-text\">Source: Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, 409p<\/p>\n<\/div>\n<p>\u00a0<\/p>\n<p>I am glad that I have finally finished this article series. I am not sure how many of the readers would have read through all of the articles in this series, including my PowerPoint slides. I bet that is not so many. I spent a great deal of my time for making these contents, but sadly even when I was studying LSTM, it was becoming old-fashioned, at least in natural language processing (NLP) field: a very promising algorithm named <em><strong>Transformer<\/strong><\/em> has been replacing the position of LSTM. Deep learning is a very fast changing field. I also would like to make illustrative introductions on attention mechanism in NLP, from seq2seq model to Transformer. And I think LSTM would still remain as one of the algorithms in sequence data processing, such as hidden <em><strong>Hidden Markov model<\/strong><\/em>, or <strong><em>particle filter<\/em><\/strong>.<\/p>\n<div id=\"author-bio-box\">\n<h3><a href=\"https:\/\/data-science-blog.com\/en\/blog\/author\/yasuto\/\" title=\"All posts by Yasuto Tamura\" rel=\"author\">Yasuto Tamura<\/a><\/h3>\n<div class=\"bio-gravatar\"><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/03\/yasuto-tamura-80x80.png\" width=\"70\" height=\"70\" alt=\"Yasuto Tamura\" class=\"avatar avatar-70 wp-user-avatar wp-user-avatar-70 alignnone photo\"><\/div>\n<p><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"http:\/\/www.datanomiq.de\" class=\"bio-icon bio-icon-website\"><\/a><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"https:\/\/www.linkedin.com\/in\/yasuto-tamura-7689b418b\/\" class=\"bio-icon bio-icon-linkedin\"><\/a><\/p>\n<p class=\"bio-description\">Data Science Intern at <a href=\"http:\/\/www.datanomiq.io\">DATANOMIQ<\/a>.<br \/>\nMajoring in computer science. Currently studying mathematical sides of deep learning, such as densely connected layers, CNN, RNN, autoencoders, and making study materials on them. Also started aiming at Bayesian deep learning algorithms.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/data-science-blog.com\/en\/blog\/2020\/09\/07\/back-propagation-of-lstm\/<\/p>\n","protected":false},"author":0,"featured_media":1110,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1109"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=1109"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1109\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/1110"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=1109"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=1109"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=1109"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}