{"id":8190,"date":"2021-04-07T21:34:10","date_gmt":"2021-04-07T21:34:10","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/07\/multi-head-attention-mechanism-queries-keys-and-values-over-and-over-again\/"},"modified":"2021-04-07T21:34:10","modified_gmt":"2021-04-07T21:34:10","slug":"multi-head-attention-mechanism-queries-keys-and-values-over-and-over-again","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/07\/multi-head-attention-mechanism-queries-keys-and-values-over-and-over-again\/","title":{"rendered":"Multi-head attention mechanism: \u201cqueries\u201d, \u201ckeys\u201d, and \u201cvalues,\u201d over and over again"},"content":{"rendered":"<div>\n<p>This is the third article of my article series named <a href=\"https:\/\/data-science-blog.com\/blog\/2020\/12\/30\/transformer\/\">\u201cInstructions on Transformer for people outside NLP field, but with examples of NLP.\u201d<\/a><\/p>\n<p>In <a href=\"https:\/\/data-science-blog.com\/blog\/2021\/02\/17\/sequence-to-sequence-models-back-bones-of-various-nlp-tasks\/\">the last article<\/a>, I explained how attention mechanism works in simple seq2seq models with RNNs, and it basically calculates correspondences of the hidden state at every time step, with all the outputs of the encoder. However I would say the attention mechanisms of RNN seq2seq models use only one standard for comparing them. Using only one standard is not enough for understanding languages, especially when you learn a foreign language. You would sometimes find it difficult to explain how to translate a word in your language to another language. Even if a pair of languages are very similar to each other, translating them cannot be simple switching of vocabulary. Usually a single token in one language is related to several tokens in the other language, and vice versa. How they correspond to each other depends on several criteria, for example \u201cwhat\u201d, \u201cwho\u201d, \u201cwhen\u201d, \u201cwhere\u201d, \u201cwhy\u201d, and \u201chow\u201d. It is easy to imagine that you should compare tokens with several criteria.<\/p>\n<p>Transformer model was first introduced in the original paper named \u201cAttention Is All You Need,\u201d and from the title you can easily see that attention mechanism plays important roles in this model. When you learn about Transformer model, you will see the figure below, which is used in the original paper on Transformer.\u00a0 This is the simplified overall structure of one layer of Transformer model, and you stack this layer N times. In one layer of Transformer, there are three <em><strong>multi-head attention<\/strong><\/em>, which are displayed as boxes in orange. These are the very parts which compare the tokens on several standards. I made the head article of this article series inspired by this multi-head attention mechanism. <img loading=\"lazy\" class=\"wp-image-5532 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/img_original_paper-726x1030.png\" alt=\"\" width=\"307\" height=\"493\"><\/p>\n<p>The figure below is also from the original paper on Transfromer. If you can understand how multi-head attention mechanism works with the explanations in the paper, and if you have no troubles understanding the codes in <a href=\"https:\/\/www.tensorflow.org\/tutorials\/text\/transformer\">the official Tensorflow tutorial<\/a>, I have to say this article is not for you. However I bet that is not true of majority of people, and at least I need one article to clearly explain how multi-head attention works. Please keep it in mind that this article covers only the architectures of the two figures below. However multi-head attention mechanisms are crucial components of Transformer model, and throughout this article, you would not only see how they work but also get a little control over it at an implementation level.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5531 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/mha_img_original.png\" alt=\"\" width=\"485\" height=\"277\"><\/p>\n<h3>1 Multi-head attention mechanism<\/h3>\n<p>When you learn Transformer model, I recommend you first to pay attention to multi-head attention. And when you learn multi-head attentions, before seeing what scaled dot-product attention is, you should understand the whole structure of multi-head attention, which is at the right side of the figure above. In order to calculate attentions with a \u201cquery\u201d, as I said in the last article, \u201cyou compare the \u2018query\u2019 with the \u2018keys\u2019 and get scores\/weights for the \u2018values.\u2019 Each score\/weight is in short the relevance between the \u2018query\u2019 and each \u2018key\u2019. And you reweight the \u2018values\u2019 with the scores\/weights, and take the summation of the reweighted \u2018values\u2019.\u201d Sooner or later, you will notice I would be just repeating these phrases over and over again throughout this article, in several ways.<\/p>\n<p>*Even if you are not sure what \u201creweighting\u201d means in this context, please keep reading. I think you would little by little see what it means especially in the next section.<\/p>\n<p>The overall process of calculating multi-head attention, displayed in the figure above, is as follows (Please just keep reading. Please do not think too much.): first you split the V: \u201cvalues\u201d, K: \u201ckeys\u201d, and Q: \u201cqueries\u201d, and second you transform those divided \u201cvalues\u201d, \u201ckeys\u201d, and \u201cqueries\u201d with densely connected layers (\u201cLinear\u201d in the figure). Next you calculate attention weights and reweight the \u201cvalues\u201d and take the summation of the reiweighted \u201cvalues\u201d, and you concatenate the resulting summations. At the end you pass the concatenated \u201cvalues\u201d through another densely connected layers. The mechanism of scaled dot-product attention is just a matter of how to concretely calculate those attentions and reweight the \u201cvalues\u201d.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5529 alignleft\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/transformer_2.png\" alt=\"\" width=\"453\" height=\"374\"><\/p>\n<p>*In the last article I briefly mentioned that \u201ckeys\u201d and \u201cqueries\u201d can be in the same language. They can even be the same sentence in the same language, and in this case the resulting attentions are called<em><strong> self-attentions<\/strong><\/em>, which we are mainly going to see. I think most people calculate \u201cself-attentions\u201d unconsciously when they speak. You constantly care about what \u201cshe\u201d, \u201cit\u201d , \u201cthe\u201d, or \u201cthat\u201d refers to in you own sentence, and we can say self-attention is how these everyday processes is implemented.<\/p>\n<p>Let\u2019s see the whole process of calculating multi-head attention at a little abstract level. From now on, we consider an example of calculating multi-head self-attentions, where the input is a sentence \u201cAnthony Hopkins admired Michael Bay as a great director.\u201d In this example, the number of tokens is 9, and each token is encoded as a 512-dimensional embedding vector. And the number of heads is 8. In this case, as you can see in the figure below, the input sentence \u201cAnthony Hopkins admired Michael Bay as a great director.\u201d is implemented as a <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-2feca5c110739284e239dd1b5bc1d276_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"9times 512\" title=\"Rendered by QuickLaTeX.com\" height=\"13\" width=\"57\"> matrix. You first split each token into <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-757a5f2d57db49761c1cec7e1b090e21_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"512\/8=64\" title=\"Rendered by QuickLaTeX.com\" height=\"19\" width=\"86\"> dimensional, 8 vectors in total, as I colored in the figure below. In other words, the input matrix is divided into 8 colored chunks, which are all <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a1ea08b304c79652d5abbbe55585a3de_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"9times 64\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"49\"> matrices, but each colored matrix expresses the same sentence. And you calculate self-attentions of the input sentence independently in the 8 heads, and you reweight the \u201cvalues\u201d according to the attentions\/weights. After this, you stack the sum of the reweighted \u201cvalues\u201d\u00a0 in each colored head, and you concatenate the stacked tokens of each colored head. The size of each colored chunk does not change even after reweighting the tokens. According to Ashish Vaswani, who invented Transformer model, each head compare \u201cqueries\u201d and \u201ckeys\u201d on each standard. If the a Transformer model has 4 layers with 8-head multi-head attention , at least its encoder has <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-eda58bd568e57f1c7fae29e627ad5172_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"4times 8 = 32\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"80\"> heads, so the encoder learn the relations of tokens of the input on 32 different standards.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-5519 \" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/mha_3-1030x608.png\" alt=\"\" width=\"842\" height=\"497\"><\/p>\n<p>I think you now have rough insight into how you calculate multi-head attentions. In the next section I am going to explain the process of reweighting the tokens, that is, I am finally going to explain what those colorful lines in the head image of this article series are.<\/p>\n<p>*Each head is randomly initialized, so they learn to compare tokens with different criteria. The standards might be straightforward like \u201cwhat\u201d or \u201cwho\u201d, or maybe much more complicated. In attention mechanisms in deep learning, you do not need feature engineering for setting such standards.<\/p>\n<h3>2 Calculating attentions and reweighting \u201cvalues\u201d<\/h3>\n<p>If you have read the last article or if you understand attention mechanism to some extent, you should already know that attention mechanism calculates attentions, or relevance between \u201cqueries\u201d and \u201ckeys.\u201d In the last article, I showed the idea of weights as a histogram, and in that case the \u201cquery\u201d was the hidden state of the decoder at every time step, whereas the \u201ckeys\u201d were the outputs of the encoder. In this section, I am going to explain attention mechanism in a more abstract way, and we consider comparing more general \u201ctokens\u201d, rather than concrete outputs of certain networks. In this section each <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a935a6ae352397cdde28cd5115cc275a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"[ cdots ]\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"30\"> denotes a token, which is usually an embedding vector in practice.<\/p>\n<p>Please remember this mantra of attention mechanism: \u201cyou compare the \u2018query\u2019 with the \u2018keys\u2019 and get scores\/weights for the \u2018values.\u2019 Each score\/weight is in short the relevance between the \u2018query\u2019 and each \u2018key\u2019. And you reweight the \u2018values\u2019 with the scores\/weights, and take the summation of the reweighted \u2018values\u2019.\u201d The figure below shows an overview of a case where \u201cMichael\u201d is a query. In this case you compare the query with the \u201ckeys\u201d, that is, the input sentence \u201cAnthony Hopkins admired Michael Bay as a great director.\u201d and you get the histogram of attentions\/weights. Importantly the sum of the weights 1. With the attentions you have just calculated, you can reweight the \u201cvalues,\u201d which also denote the same input sentence. After that you can finally take a summation of the reweighted values. And you use this summation.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5520 size-large aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/qkv-1030x370.png\" alt=\"\" width=\"1030\" height=\"370\">*I have been repeating the phrase \u201creweighting \u2018values\u2019\u00a0 with attentions,\u201d\u00a0 but you in practice calculate the sum of those reweighted \u201cvalues.\u201d<\/p>\n<p>Assume that compared to the \u201cquery\u201d\u00a0 token \u201cMichael\u201d, the weights of the \u201ckey\u201d tokens \u201cAnthony\u201d, \u201cHopkins\u201d, \u201cadmired\u201d, \u201cMichael\u201d, \u201cBay\u201d, \u201cas\u201d, \u201ca\u201d, \u201cgreat\u201d, and \u201cdirector.\u201d are respectively 0.06, 0.09, 0.05, 0.25, 0.18, 0.06, 0.09, 0.06, 0.15. In this case the sum of the reweighted token is 0.06\u2033Anthony\u201d + 0.09\u2033Hopkins\u201d + 0.05\u2033admired\u201d + 0.25\u2033Michael\u201d + 0.18\u2033Bay\u201d + 0.06\u2033as\u201d + 0.09\u2033a\u201d + 0.06\u2033great\u201d 0.15\u2033director.\u201d, and this sum is the what wee actually use.<\/p>\n<p>*Of course the tokens are embedding vectors in practice. You calculate the reweighted vector in actual implementation.<\/p>\n<p>You repeat this process for all the \u201cqueries.\u201d\u00a0 As you can see in the figure below, you get summations of 9 pairs of reweighted \u201cvalues\u201d because you use every token of the input sentence \u201cAnthony Hopkins admired Michael Bay as a great director.\u201d as a \u201cquery.\u201d You stack the sum of reweighted \u201cvalues\u201d like the matrix in purple in the figure below, and this is the output of a one head multi-head attention.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-5518 \" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/qkv_2-1030x430.png\" alt=\"\" width=\"777\" height=\"324\"><\/p>\n<h3>3 Scaled-dot product<\/h3>\n<p>This section is a only a matter of linear algebra. Maybe this is not even so sophisticated as linear algebra. You just have to do lots of Excel-like operations. <a href=\"http:\/\/jalammar.github.io\/illustrated-transformer\/\">A tutorial on Transformer by Jay Alammar<\/a> is also a very nice study material to understand this topic with simpler examples. I tried my best so that you can clearly understand multi-head attention at a more mathematical level, and all you need to know in order to read this section is how to calculate products of matrices or vectors, which you would see in the first some pages of textbooks on linear algebra.<\/p>\n<p>We have seen that in order to calculate multi-head attentions, we prepare 8 pairs of \u201cqueries\u201d, \u201ckeys\u201d , and \u201cvalues\u201d, which I showed in 8 different colors in the figure in the first section. We calculate attentions and reweight \u201cvalues\u201d independently in 8 different heads, and in each head the reweighted \u201cvalues\u201d are calculated with this very simple formula of scaled dot-product: <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-3c3acfdb7d9300502df999eaa0f90fbb_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"Attention(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})\" title=\"Rendered by QuickLaTeX.com\" height=\"19\" width=\"156\"><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-6f5d25980577d9673be9b1b577dae16c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"=softmax(frac{boldsymbol{Q} boldsymbol{K} ^T}{sqrt{d}_k})boldsymbol{V}\" title=\"Rendered by QuickLaTeX.com\" height=\"31\" width=\"157\">. Let\u2019s take an example of calculating a scaled dot-product in the blue head.<\/p>\n<p>At the left side of the figure below is a figure from the original paper on Transformer, which explains one-head of multi-head attention. If you have read through this article so far, the figure at the right side would be more straightforward to understand. You divide the input sentence into 8 chunks of matrices, and you independently put those chunks into eight head. In one head, you convert the input matrix by three different fully connected layers, which is \u201cLinear\u201d in the figure below, and prepare three matrices <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-d8ccb151617e5e8b8a9c1a3a6523ba9f_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"Q, K, V\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"59\">, which are \u201cqueries\u201d, \u201ckeys\u201d, and \u201cvalues\u201d respectively.<\/p>\n<p>*Whichever color attention heads are in, the processes are all the same.<\/p>\n<p>*You divide <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-85573ac20c58f9ec09e55a891874bafe_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{boldsymbol{Q} boldsymbol{K}} ^T\" title=\"Rendered by QuickLaTeX.com\" height=\"23\" width=\"27\"> by <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-0d04611f3b79835dbd23a32df7422176_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"sqrt{d}_k\" title=\"Rendered by QuickLaTeX.com\" height=\"19\" width=\"30\"> in the formula. According to the original paper, it is known that re-scaling <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-85573ac20c58f9ec09e55a891874bafe_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{boldsymbol{Q} boldsymbol{K}} ^T\" title=\"Rendered by QuickLaTeX.com\" height=\"23\" width=\"27\"> by <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-0d04611f3b79835dbd23a32df7422176_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"sqrt{d}_k\" title=\"Rendered by QuickLaTeX.com\" height=\"19\" width=\"30\"> is found to be effective. I am not going to discuss why in this article.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5522 aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/scaled_dot_product-1030x351.png\" alt=\"\" width=\"848\" height=\"289\"><\/p>\n<p>As you can see in the figure below, calculating <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-3c3acfdb7d9300502df999eaa0f90fbb_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"Attention(boldsymbol{Q}, boldsymbol{K}, boldsymbol{V})\" title=\"Rendered by QuickLaTeX.com\" height=\"19\" width=\"156\"> is virtually just multiplying three matrices with the same size (Only K is transposed though). The resulting <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a1ea08b304c79652d5abbbe55585a3de_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"9times 64\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"49\"> matrix is the output of the head.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5525 size-large aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/scaled_dot_attention-1030x233.png\" alt=\"\" width=\"1030\" height=\"233\"><\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-414d6095c26ca73dbbf949ef385965a5_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"softmax(frac{boldsymbol{Q} boldsymbol{K} ^T}{sqrt{d}_k})\" title=\"Rendered by QuickLaTeX.com\" height=\"31\" width=\"121\"> is calculated like in the figure below. The softmax function regularize each row of the re-scaled product <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-e3566cc725985884087a49aafbf24600_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"frac{boldsymbol{Q} boldsymbol{K} ^T}{sqrt{d}_k}\" title=\"Rendered by QuickLaTeX.com\" height=\"31\" width=\"36\">, and the resulting <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-215e21edfefe35cdf483574f7dcaffc7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"9times 9\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"40\"> matrix is a kind a heat map of self-attentions.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-5524 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/self_attention_map-1030x238.png\" alt=\"\" width=\"1030\" height=\"238\"><\/p>\n<p>The process of comparing one \u201cquery\u201d with \u201ckeys\u201d is done with simple multiplication of a vector and a matrix, as you can see in the figure below. You can get a histogram of attentions for each query, and the resulting 9 dimensional vector is a list of attentions\/weights, which is a list of blue circles in the figure below. That means, in Transformer model, you can compare a \u201cquery\u201d and a \u201ckey\u201d only by calculating an inner product. After re-scaling the vectors by dividing them with <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-0887ac312e8d3a39ae4869b2e0e0c478_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"sqrt{d_k}\" title=\"Rendered by QuickLaTeX.com\" height=\"18\" width=\"32\"> and regularizing them with a softmax function, you stack those vectors, and the stacked vectors is the heat map of attentions.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5526 size-full aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/mha_query_and_keys-1.png\" alt=\"\" width=\"2743\" height=\"889\"><\/p>\n<p>You can reweight \u201cvalues\u201d with the heat map of self-attentions, with simple multiplication. It would be more straightforward if you consider a transposed scaled dot-product <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-5ac3aa184a448b7347fb992d657de8f3_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{V}^T cdot softmax(frac{boldsymbol{Q} boldsymbol{K} ^T}{sqrt{d}_k})^T\" title=\"Rendered by QuickLaTeX.com\" height=\"31\" width=\"171\">. This also should be easy to understand if you know basics of linear algebra.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-5527 size-large aligncenter\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/scaled_dot_product_transposed-1030x304.png\" alt=\"\" width=\"1030\" height=\"304\"><\/p>\n<p>One column of the resulting matrix (<img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-5ac3aa184a448b7347fb992d657de8f3_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"boldsymbol{V}^T cdot softmax(frac{boldsymbol{Q} boldsymbol{K} ^T}{sqrt{d}_k})^T\" title=\"Rendered by QuickLaTeX.com\" height=\"31\" width=\"171\">) can be calculated with a simple multiplication of a matrix and a vector, as you can see in the figure below. This corresponds to the process or \u201ctaking a summation of reweighted \u2018values\u2019,\u201d which I have been repeating. And I would like you to remember that you got those weights (blue) circles by comparing a \u201cquery\u201d with \u201ckeys.\u201d<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-5517 size-large\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/mha_reweighting-1030x295.png\" alt=\"\" width=\"1030\" height=\"295\"><\/p>\n<p>Again and again, let\u2019s repeat the mantra of attention mechanism together: \u201cyou compare the \u2018query\u2019 with the \u2018keys\u2019 and get scores\/weights for the \u2018values.\u2019 Each score\/weight is in short the relevance between the \u2018query\u2019 and each \u2018key\u2019. And you reweight the \u2018values\u2019 with the scores\/weights, and take the summation of the reweighted \u2018values\u2019.\u201d If you have been patient enough to follow my explanations, I bet you have got a clear view on how multi-head attention mechanism works.<\/p>\n<p>We have been seeing the case of the blue head, but you can do exactly the same procedures in every head, at the same time, and this is what enables parallelization of multi-head attention mechanism. You concatenate the outputs of all the heads, and you put the concatenated matrix through a fully connected layers. <img loading=\"lazy\" class=\"aligncenter wp-image-5515 \" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/mha_visualization-930x1030.png\" alt=\"\" width=\"489\" height=\"595\"><\/p>\n<p>If you are reading this article from the beginning, I think this section is also showing the same idea which I have repeated, and I bet more or less you no have clearer views on how multi-head attention mechanism works. In the next section we are going to see how this is implemented.<\/p>\n<h3>4 Tensorflow implementation of multi-head attention<\/h3>\n<p>Let\u2019s see how multi-head attention is implemented in the <a href=\"https:\/\/www.tensorflow.org\/tutorials\/text\/transformer\">Tensorflow official tutorial<\/a>. If you have read through this article so far, this should not be so difficult. I also added codes for displaying heat maps of self attentions. With the codes in <a href=\"https:\/\/github.com\/YasuThompson\/Transformer_blog_codes\/blob\/main\/examine_self_attention.ipynb\">this Github page<\/a>, you can display self-attention heat maps for any input sentences in English.<\/p>\n<p>The multi-head attention mechanism is implemented as below. If you understand Python codes and Tensorflow to some extent, I think this part is relatively easy.\u00a0 The multi-head attention part is implemented as a class because you need to train weights of some fully connected layers. Whereas, scaled dot-product is just a function.<\/p>\n<p>*I am going to explain the create_padding_mask() and create_look_ahead_mask() functions in upcoming articles. You do not need them this time.<\/p>\n<p>Let\u2019s see a case of using multi-head attention mechanism on a (1, 9, 512) sized input tensor, just as we have been considering in throughout this article. The first axis of (1, 9, 512) corresponds to the batch size, so this tensor is virtually a (9, 512) sized tensor, and this means the input is composed of 9 512-dimensional vectors. In the results below, you can see how the shape of input tensor changes after each procedure of calculating multi-head attention. Also you can see that the output of the multi-head attention is the same as the input, and you get a <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-215e21edfefe35cdf483574f7dcaffc7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"9times 9\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"40\"> matrix of attention heat maps of each attention head.<\/p>\n<p>I guess the most complicated part of this implementation above is the split_head() function, especially if you do not understand tensor arithmetic. This part corresponds to splitting the input tensor to 8 different colored matrices as in one of the figures above. If you cannot understand what is going on in the function, I recommend you to prepare a sample tensor as below.<\/p>\n<p>This is just a simple (1, 9, 512) sized tensor with sequential integer elements. The first row (1, 2, \u2026., 512) corresponds to the first input token, and (4097, 4098, \u2026 , 4608) to the last one. You should try converting this sample tensor to see how multi-head attention is implemented. For example you can try the operations below.<\/p>\n<p>These operations correspond to splitting the input into 8 heads, whose sizes are all (9, 64). And the second axis of the resulting (1, 8, 9, 64) tensor corresponds to the index of the heads. Thus sample_sentence[0][0] corresponds to the first head, the blue <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-a1ea08b304c79652d5abbbe55585a3de_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"9times 64\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"49\"> matrix. Some Tensorflow functions enable linear calculations in each attention head, independently as in the codes below.<\/p>\n<p>Very importantly, we have been only considering the cases of calculating self attentions, where all \u201cqueries\u201d, \u201ckeys\u201d, and \u201cvalues\u201d come from the same sentence in the same language. However, as I showed in the last article, usually \u201cqueries\u201d are in a different language from \u201ckeys\u201d and \u201cvalues\u201d in translation tasks, and \u201ckeys\u201d and \u201cvalues\u201d are in the same language. And as you can imagine, usualy \u201cqueries\u201d have different number of tokens from \u201ckeys\u201d or \u201cvalues.\u201d You also need to understand this case, which is not calculating self-attentions. If you have followed this article so far, this case is not that hard to you. Let\u2019s briefly see an example where the input sentence in the source language is composed 9 tokens, on the other hand the output is composed 12 tokens.<\/p>\n<p>As I mentioned, one of the outputs of each multi-head attention class is <img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/ql-cache\/quicklatex.com-215e21edfefe35cdf483574f7dcaffc7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"9times 9\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"40\"> matrix of attention heat maps, which I displayed as a matrix composed of blue circles in the last section. The the implementation in the Tensorflow official tutorial, I have added codes to display actual heat maps of any input sentences in English.<\/p>\n<p>*If you want to try displaying them by yourself, download or just copy and paste codes in <a href=\"https:\/\/github.com\/YasuThompson\/Transformer_blog_codes\/blob\/main\/show_self_attention_en_es.ipynb\">this Github page<\/a>. Please maker \u201cdatasets\u201d directory in the same directory as the code. Please download \u201cspa-eng.zip\u201d from <a href=\"http:\/\/www.manythings.org\/anki\/\">this page<\/a>, and unzip it. After that please put \u201cspa.txt\u201d on the \u201cdatasets\u201d directory. Also, please download the \u201ccheckpoints_en_es\u201d folder from <a href=\"https:\/\/drive.google.com\/drive\/folders\/1MAENmyL9Hq8a0B1N7XoNcm-8UbJMyZ2v?usp=sharing\">this link<\/a>, and place the folder in the same directory as the file in the Github page. In the upcoming articles, you would need similar processes to run my codes.<\/p>\n<p>After running codes in the Github page, you can display heat maps of self attentions. Let\u2019s input the sentence \u201cAnthony Hopkins admired Michael Bay as a great director.\u201d You would get a heat maps like this.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-5516 \" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2021\/04\/mha_Michael_Bay_0-1030x455.png\" alt=\"\" width=\"1030\" height=\"455\"><\/p>\n<p>In fact, my toy implementation cannot handle proper nouns such as \u201cAnthony\u201d or \u201cMichael.\u201d Then let\u2019s consider a simple input sentence \u201cHe admired her as a great director.\u201d In each layer, you respectively get 8 self-attention heat maps.<\/p>\n<p>I think we can see some tendencies in those heat maps. The heat maps in the early layers, which are close to the input, are blurry. And the distributions of the heat maps come to concentrate more or less diagonally. At the end, presumably they learn to pay attention to the start and the end of sentences.<\/p>\n<p>You have finally finished reading this article. Congratulations.<\/p>\n<p>You should be proud of having been patient, and you passed the most tiresome part of learning Transformer model. You must be ready for making a toy English-German translator in the upcoming articles. Also I am sure you have understood that Michael Bay is a great director, no matter what people say.<\/p>\n<p>*Hannibal Lecter, I mean Athony Hopkins, also wrote a letter to the staff of \u201cBreaking Bad,\u201d and he told them the tv show let him regain his passion. He is a kind of admiring around, and I am a little worried that he might be getting <span class=\"answer\">senile<\/span>. He played a role of a father forgetting his daughter in his new film \u201cThe Father.\u201d I must see it to check if that is really an acting, or not.<\/p>\n<h3>[References]<\/h3>\n<p>[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, \u201cAttention Is All You Need\u201d (2017)<\/p>\n<p>[2] \u201cTransformer model for language understanding,\u201d Tensorflow Core<br \/>https:\/\/www.tensorflow.org\/overview<\/p>\n<p>[3] \u201cNeural machine translation with attention,\u201d Tensorflow Core<br \/>https:\/\/www.tensorflow.org\/tutorials\/text\/nmt_with_attention<\/p>\n<p>[4] Jay Alammar, \u201cThe Illustrated Transformer,\u201d<br \/>http:\/\/jalammar.github.io\/illustrated-transformer\/<\/p>\n<p>[5] \u201cStanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 \u2013 Transformers and Self-Attention,\u201d stanfordonline, (2019)<br \/>https:\/\/www.youtube.com\/watch?v=5vcj8kSwBCY<\/p>\n<p>[6]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, \u201cMachine Learning Professional Series: Natural Language Processing with Deep Learning,\u201d (2017), pp. 91-94<br \/>\u576a\u4e95\u7950\u592a\u3001\u6d77\u91ce\u88d5\u4e5f\u3001\u9234\u6728\u6f64 \u8457, \u300c\u6a5f\u68b0\u5b66\u7fd2\u30d7\u30ed\u30d5\u30a7\u30c3\u30b7\u30e7\u30ca\u30eb\u30b7\u30ea\u30fc\u30ba \u6df1\u5c64\u5b66\u7fd2\u306b\u3088\u308b\u81ea\u7136\u8a00\u8a9e\u51e6\u7406\u300d, (2017), pp. 191-193<\/p>\n<p>[7]\u201dStanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 \u2013 Translation, Seq2Seq, Attention\u201d, stanfordonline, (2019)<br \/>https:\/\/www.youtube.com\/watch?v=XXtpJxZBa2c<\/p>\n<p>[8]Rosemary Rossi, \u201cAnthony Hopkins Compares \u2018Genius\u2019 Michael Bay to Spielberg, Scorsese,\u201d yahoo! entertainment, (2017)<br \/>https:\/\/www.yahoo.com\/entertainment\/anthony-hopkins-transformers-director-michael-bay-guy-genius-010058439.html<\/p>\n<p><em>* I make study materials on<\/em><em> machine learning, sponsored by <a href=\"https:\/\/www.datanomiq.de\/\">DATANOMIQ<\/a>. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.<\/em><\/p>\n<div id=\"author-bio-box\">\n<h3><a href=\"https:\/\/data-science-blog.com\/en\/blog\/author\/yasuto\/\" title=\"All posts by Yasuto Tamura\" rel=\"author\">Yasuto Tamura<\/a><\/h3>\n<div class=\"bio-gravatar\"><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/03\/yasuto-tamura-80x80.png\" width=\"70\" height=\"70\" alt=\"Yasuto Tamura\" class=\"avatar avatar-70 wp-user-avatar wp-user-avatar-70 alignnone photo\"><\/div>\n<p><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"http:\/\/www.datanomiq.de\" class=\"bio-icon bio-icon-website\"><\/a><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"https:\/\/www.linkedin.com\/in\/yasuto-tamura-7689b418b\/\" class=\"bio-icon bio-icon-linkedin\"><\/a><\/p>\n<p class=\"bio-description\">Data Science Intern at <a href=\"http:\/\/www.datanomiq.io\">DATANOMIQ<\/a>.<br \/>\nMajoring in computer science. Currently studying mathematical sides of deep learning, such as densely connected layers, CNN, RNN, autoencoders, and making study materials on them. Also started aiming at Bayesian deep learning algorithms.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/data-science-blog.com\/en\/blog\/2021\/04\/07\/multi-head-attention-mechanism\/<\/p>\n","protected":false},"author":0,"featured_media":8191,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8190"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8190"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8190\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8191"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8190"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8190"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8190"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}