{"id":151,"date":"2023-11-22T11:51:43","date_gmt":"2023-11-22T11:51:43","guid":{"rendered":"https:\/\/gpt-jordan.com\/?p=151"},"modified":"2023-11-22T11:51:43","modified_gmt":"2023-11-22T11:51:43","slug":"the-illustrated-transformer","status":"publish","type":"post","link":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/","title":{"rendered":"The Illustrated Transformer From Scratch"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Transformers are a very exciting family of machine learning architectures. Many good tutorials exist (e.g. [1, 2]) but in the last few years, transformers have mostly become simpler, so that it is now much more straightforward to explain how modern architectures work. This post is an attempt to explain directly how modern transformers work, and why, without some of the historical baggage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I will assume a basic understanding of neural networks and backpropagation. If you\u2019d like to brush up,&nbsp;<a href=\"https:\/\/mlvu.github.io\/lecture06\/\">this lecture<\/a>&nbsp;will give you the basics of neural networks and&nbsp;<a href=\"https:\/\/mlvu.github.io\/lecture07\/\">this one<\/a>&nbsp;will explain how these principles are applied in modern deep learning systems.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A&nbsp;<a href=\"https:\/\/pytorch.org\/tutorials\/beginner\/deep_learning_60min_blitz.html\">working knowledge of Pytorch<\/a>&nbsp;is required to understand the programming examples, but these can also be safely skipped.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"self-attention\">Self-attention<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The fundamental operation of any transformer architecture is the&nbsp;<em>self-attention operation<\/em>.We&#8217;ll explain where the name &#8220;self-attention&#8221; comes from later. For now, don&#8217;t read too much in to it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. Let\u2019s call the input vectors&nbsp;\ud835\udc311,\ud835\udc312,\u2026,\ud835\udc31t&nbsp;and the corresponding output vectors&nbsp;\ud835\udc321,\ud835\udc322,\u2026,\ud835\udc32t. The vectors all have dimension&nbsp;k.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To produce output vector&nbsp;\ud835\udc32i, the self attention operation simply takes&nbsp;<em>a weighted average over all the input vectors<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\ud835\udc32i=\u2211jwij\ud835\udc31j.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Where&nbsp;j&nbsp;indexes over the whole sequence and the weights sum to one over all&nbsp;j. The weight&nbsp;wij&nbsp;is not a parameter, as in a normal neural net, but it is&nbsp;<em>derived<\/em>&nbsp;from a function over&nbsp;\ud835\udc31i&nbsp;and&nbsp;\ud835\udc31j. The simplest option for this function is the dot product:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">w\u2032ij=\ud835\udc31iT\ud835\udc31j.Note that&nbsp;\ud835\udc31i&nbsp;is the input vector at the same position as the current output vector&nbsp;\ud835\udc32i. For the next output vector, we get an entirely new series of dot products, and a different weighted sum.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The dot product gives us a value anywhere between negative and positive infinity, so we apply a softmax to map the values to&nbsp;[0,1]&nbsp;and to ensure that they sum to 1 over the whole sequence:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">wij=exp&nbsp;w\u2032ij\u2211jexp&nbsp;w\u2032ij.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And that\u2019s the basic operation of self attention.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/self-attention.svg\" alt=\"\"\/><figcaption class=\"wp-element-caption\">A visual illustration of basic self-attention. Note that the softmax operation over the&nbsp;weights&nbsp;is not illustrated.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A few other ingredients are needed for a complete transformer, which we\u2019ll discuss later, but this is the fundamental operation. More importantly, this is the only operation in the whole architecture that propagates information&nbsp;<em>between<\/em>&nbsp;vectors. Every other operation in the transformer is applied to each vector in the input sequence without interactions between vectors.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"understanding-why-self-attention-works\">Understanding why self-attention works<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Despite its simplicity, it\u2019s not immediately obvious why self-attention should work so well. To build up some intuition, let\u2019s look first at the standard approach to&nbsp;<em>movie recommendation<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s say you run a movie rental business and you have some movies, and some users, and you would like to recommend movies to your users that they are likely to enjoy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One way to go about this, is to create manual features for your movies, such as how much romance there is in the movie, and how much action, and then to design corresponding features for your users: how much they enjoy romantic movies and how much they enjoy action-based movies. If you did this, the dot product between the two feature vectors would give you a score for how well the attributes of the movie match what the user enjoys.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/dot-product.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If the signs of a feature match for the user and the movie\u2014the movie is romantic and the user loves romance or the movie is&nbsp;<em>unromantic<\/em>&nbsp;and the user hates romance\u2014then the resulting dot product gets a positive term for that feature. If the signs don\u2019t match\u2014the movie is romantic and the user hates romance or vice versa\u2014the corresponding term is negative.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Furthermore, the&nbsp;<em>magnitudes<\/em>&nbsp;of the features indicate how much the feature should contribute to the total score: a movie may be a little romantic, but not in a noticeable way, or a user may simply prefer no romance, but be largely ambivalent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Of course, gathering such features is not practical. Annotating a database of millions of movies is very costly, and annotating users with their likes and dislikes is pretty much impossible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What happens instead is that we make the movie features and user features&nbsp;<em>parameters<\/em>&nbsp;of the model. We then ask users for a small number of movies that they like and we optimize the user features and movie features so that their dot product matches the known likes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Even though we don\u2019t tell the model what any of the features should mean, in practice, it turns out that after training the features do actually reflect meaningful semantics about the movie content.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/movie-features.svg\" alt=\"\"\/><figcaption class=\"wp-element-caption\">The first two learned features from a basic matrix factorization model. The model had no access to any information about the content of the movies, only which users liked them. Note that movies are arranged from low-brow to high-brow horizontally, and from mainstream to quirky vertically. From [4].<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">See&nbsp;<a href=\"https:\/\/mlvu.github.io\/lecture12\/\">this lecture<\/a>&nbsp;for more details on recommender systems. For now, this suffices as an explanation of how the dot product helps us to represent objects and their relations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is the basic principle at work in the self-attention. Let\u2019s say we are faced with a sequence of words. To apply self-attention, we simply assign each word&nbsp;t&nbsp;in our vocabulary an&nbsp;<em>embedding vector<\/em>&nbsp;\ud835\udc2ft&nbsp;(the values of which we\u2019ll learn). This is what\u2019s known as an&nbsp;<em>embedding layer<\/em>&nbsp;in sequence modeling. It turns the word sequence&nbsp;the,cat,walks,on,the,street&nbsp;into the vector sequence<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\ud835\udc2fthe,\ud835\udc2fcat,\ud835\udc2fwalks,\ud835\udc2fon,\ud835\udc2fthe,\ud835\udc2fstreet.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we feed this sequence into a self-attention layer, the output is another sequence of vectors<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\ud835\udc32the,\ud835\udc32cat,\ud835\udc32walks,\ud835\udc32on,\ud835\udc32the,\ud835\udc32streetwhere&nbsp;\ud835\udc32cat&nbsp;is a weighted sum over all the embedding vectors in the first sequence, weighted by their (normalized) dot-product with&nbsp;\ud835\udc2fcat.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Since we are&nbsp;<em>learning<\/em>&nbsp;what the values in&nbsp;\ud835\udc2ft&nbsp;should be, how &#8220;related&#8221; two words are is entirely determined by the task. In most cases, the definite article&nbsp;the&nbsp;is not very relevant to the interpretation of the other words in the sentence; therefore, we will likely end up with an embedding&nbsp;\ud835\udc2fthe&nbsp;that has a low or negative dot product with all other words. On the other hand, to interpret what&nbsp;walks&nbsp;means in this sentence, it&#8217;s very helpful to work out&nbsp;<em>who<\/em>&nbsp;is doing the walking. This is likely expressed by a noun, so for nouns like&nbsp;cat&nbsp;and verbs like&nbsp;walks, we will likely learn embeddings&nbsp;\ud835\udc2fcat&nbsp;and&nbsp;\ud835\udc2fwalks&nbsp;that have a high, positive dot product together.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is the basic intuition behind self-attention. The dot product expresses how related two vectors in the input sequence are, with \u201crelated\u201d defined by the learning task, and the output vectors are weighted sums over the whole input sequence, with the weights determined by these dot products.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Before we move on, it\u2019s worthwhile to note the following properties, which are unusual for a sequence-to-sequence operation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>There are no parameters (yet). What the basic self-attention actually does is entirely determined by whatever mechanism creates the input sequence. Upstream mechanisms, like an embedding layer, drive the self-attention by learning representations with particular dot products (although we\u2019ll add a few parameters later).<\/li>\n\n\n\n<li>Self attention sees its input as a\u00a0<em>set<\/em>, not a sequence. If we permute the input sequence, the output sequence will be exactly the same, except permuted also (i.e. self-attention is\u00a0<em>permutation equivariant<\/em>). We will mitigate this somewhat when we build the full transformer, but the self-attention by itself actually\u00a0<em>ignores<\/em>\u00a0the sequential nature of the input.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"in-pytorch-basic-self-attention\">In Pytorch: basic self-attention<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">What I cannot create, I do not understand, as Feynman said. So we\u2019ll build a simple transformer as we go along. We\u2019ll start by implementing this basic self-attention operation in Pytorch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The first thing we should do is work out how to express the self attention in matrix multiplications. A naive implementation that loops over all vectors to compute the weights and outputs would be much too slow.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ll represent the input, a sequence of&nbsp;t&nbsp;vectors of dimension&nbsp;k&nbsp;as a&nbsp;t&nbsp;by&nbsp;k&nbsp;matrix&nbsp;\ud835\udc17. Including a minibatch dimension&nbsp;b, gives us an input tensor of size&nbsp;(b,t,k).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The set of all raw dot products&nbsp;w\u2032ij&nbsp;forms a matrix, which we can compute simply by multiplying&nbsp;\ud835\udc17&nbsp;by its transpose:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import torch\nimport torch.nn.functional as F\n\n# assume we have some tensor x with size (b, t, k)\nx = ...\n\nraw_weights = torch.bmm(x, x.transpose(1, 2))\n# - torch.bmm is a batched matrix multiplication. It\n#   applies matrix multiplication over batches of\n#   matrices.<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then, to turn the raw weights&nbsp;w\u2032ij&nbsp;into positive values that sum to one, we apply a&nbsp;<em>row-wise<\/em>&nbsp;softmax:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>weights = F.softmax(raw_weights, dim=2)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, to compute the output sequence, we just multiply the weight matrix by&nbsp;\ud835\udc17. This results in a batch of output matrices&nbsp;\ud835\udc18&nbsp;of size&nbsp;<code>(b, t, k)<\/code>&nbsp;whose rows are weighted sums over the rows of&nbsp;\ud835\udc17.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>y = torch.bmm(weights, x)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That\u2019s all. Two matrix multiplications and one softmax gives us a basic self-attention.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"additional-tricks\">Additional tricks<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The actual self-attention used in modern transformers relies on three additional tricks.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"1-queries-keys-and-values\">1) Queries, keys and values<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Every input vector&nbsp;\ud835\udc31i&nbsp;is used in three different ways in the self attention operation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>It is compared to every other vector to establish the weights for its own output\u00a0\ud835\udc32i<\/li>\n\n\n\n<li>It is compared to every other vector to establish the weights for the output of the\u00a0j-th vector\u00a0\ud835\udc32j<\/li>\n\n\n\n<li>It is used as part of the weighted sum to compute each output vector once the weights have been established.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">These roles are often called the&nbsp;<strong>query<\/strong>, the&nbsp;<strong>key<\/strong>&nbsp;and the&nbsp;<strong>value<\/strong>&nbsp;(we&#8217;ll explain where these names come from later). In the basic self-attention we&#8217;ve seen so far, each input vector must play all three roles. We make its life a little easier by deriving new vectors for each role, by applying a linear transformation to the original input vector. In other words, we add three&nbsp;k\u00d7k&nbsp;weight matrices&nbsp;\ud835\udc16q,&nbsp;\ud835\udc16k,\ud835\udc16v&nbsp;and compute three linear transformations of each&nbsp;xi, for the three different parts of the self attention:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\ud835\udc2ai=\ud835\udc16q\ud835\udc31i\ud835\udc24i=\ud835\udc16k\ud835\udc31i\ud835\udc2fi=\ud835\udc16v\ud835\udc31i<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">w\u2032ijwij\ud835\udc32i=\ud835\udc2aiT\ud835\udc24j=softmax(w\u2032ij)=\u2211jwij\ud835\udc2fj.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This gives the self-attention layer some controllable parameters, and allows it to modify the incoming vectors to suit the three roles they must play.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/key-query-value.svg\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Illustration of the self-attention with&nbsp;key,&nbsp;query&nbsp;and&nbsp;value&nbsp;transformations.<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"2-scaling-the-dot-product\">2) Scaling the dot product<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The softmax function can be sensitive to very large input values. These kill the gradient, and slow down learning, or cause it to stop altogether. Since the average value of the dot product grows with the embedding dimension&nbsp;k, it helps to scale the dot product back a little to stop the inputs to the softmax function from growing too large:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">w\u2032ij=\ud835\udc2aiT\ud835\udc24jk\u2212\u2212\u221aWhy&nbsp;k\u2212\u2212\u221a? Imagine a vector in&nbsp;\u211dk&nbsp;with values all&nbsp;c. Its Euclidean length is&nbsp;k\u2212\u2212\u221ac. Therefore, we are dividing out the amount by which the increase in dimension increases the length of the average vectors.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"3-multi-head-attention\">3) Multi-head attention<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we must account for the fact that a word can mean different things to different neighbours. Consider the following example.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">mary,gave,roses,to,susan<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We see that the word&nbsp;gave&nbsp;has different relations to different parts of the sentence.&nbsp;mary&nbsp;expresses who\u2019s doing the giving,&nbsp;roses&nbsp;expresses what\u2019s being given, and&nbsp;susan&nbsp;expresses who the recipient is.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In a single self-attention operation, all this information just gets summed together. The inputs&nbsp;\ud835\udc31mary&nbsp;and&nbsp;\ud835\udc31susan&nbsp;can influence the output&nbsp;\ud835\udc32gave&nbsp;by different amounts, depending on their dot-product with&nbsp;\ud835\udc31gave, but they can\u2019t influence it&nbsp;<em>in different ways<\/em>. If, for instance, we want the information about who gave the roses and who received them to end up in different parts of&nbsp;\ud835\udc32gave, we need a little more flexibility.This leaves aside how we figure out who gave the roses. We can do that based on prior knowledge about Mary and Susan, encoded in the embeddings. We can also look at the order of the words, but we&#8217;ll look at how to achieve that later.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can give the self attention greater power of discrimination, by combining several self-attention mechanisms (which we&#8217;ll index with&nbsp;r), each with different matrices&nbsp;\ud835\udc16rq,&nbsp;\ud835\udc16rk,\ud835\udc16rv. These are called&nbsp;<em>attention heads<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For input&nbsp;\ud835\udc31i&nbsp;each attention head produces a different output vector&nbsp;\ud835\udc32ri. We concatenate these, and pass them through a linear transformation to reduce the dimension back to&nbsp;k.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Efficient multi-head self-attention.<\/strong>&nbsp;The simplest way to understand multi-head self-attention is to see it as a small number of copies of the self-attention mechanism applied in parallel, each with their own key, value and query transformation. This works well, but for&nbsp;R&nbsp;heads, the self-attention operation is&nbsp;R&nbsp;times as slow.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It turns out we can have our cake and eat it too: there is a way to implement multi-head self-attention so that it is roughly as fast as the single-head version, but we still get the benefit of having different self-attention operations in parallel. To accomplish this, each head receives low-dimensional keys queries and values. If the input vector has&nbsp;k=256&nbsp;dimensions, and we have&nbsp;h=4&nbsp;attention heads, we multiply the input vectors by a&nbsp;256\u00d764&nbsp;matrix to project them down to a sequence of 64 dimansional vectors. For every head, we do this 3 times: for the keys, the queries and the values.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is the whole process illustrated in one image.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/multi-head.svg\" alt=\"A diagram showing the operation of multi-head self-attention.\"\/><figcaption class=\"wp-element-caption\">The basic idea of multi-head self-attention with 4 heads. To get our&nbsp;keys,&nbsp;queries&nbsp;and&nbsp;values, we project the input down to vector sequences of smaller dimension.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This requires&nbsp;3h&nbsp;matrices of size&nbsp;k&nbsp;by&nbsp;k\/h. In total, this gives us&nbsp;3hkkh=3k2&nbsp;parameters to compute the inputs to the multi-head self-attention: the same as we had for the single-head self-attention.The only difference is the matrix&nbsp;Wo, used at the end of the multi-head self attention. This adds&nbsp;k2&nbsp;parameters compared to the single-head version. In most transformers, the first thing that happens after each self attention is a feed-forward layer, so this may not be strictly necessary. I&#8217;ve never seen a proper ablation to test whether&nbsp;Wo&nbsp;can be removed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can even implement this with just three&nbsp;k\u00d7k&nbsp;matrix multiplications as in the single-head self-attention. The only extra operation we need is to slice the resulting sequence of vectors into chunks.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/kqv-computation.svg\" alt=\"A diagram showing the efficient computation of key query and value matrices in multi-head self-attention.\"\/><figcaption class=\"wp-element-caption\">To compute multi-head attention efficiently, we combine the computation of the projections down to a lower dimensional representation and the computations of the keys, queries and values into three&nbsp;k\u00d7k&nbsp;matrices.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"in-pytorch-complete-self-attention\">In Pytorch: complete self-attention<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s now implement a self-attention module with all the bells and whistles. We\u2019ll package it into a Pytorch module, so we can reuse it later.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><strong>import<\/strong> torch\n<strong>from<\/strong> torch <strong>import<\/strong> nn\n<strong>import<\/strong> torch.nn.functional <strong>as<\/strong> F\n\n<strong>class<\/strong> <strong>SelfAttention<\/strong>(nn.Module):\n    <strong>def<\/strong> <strong>__init__<\/strong>(self, k, heads=4, mask=False):\n      \n    super().__init__()\n    \n    <strong>assert<\/strong> k % heads == 0\n    \n    self.k, self.heads = k, heads\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Note the assert: the embedding dimension needs to be divisible by the number of heads.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we set up some linear transformations with&nbsp;<code>emb<\/code>&nbsp;by&nbsp;<code>emb<\/code>&nbsp;matrices. The&nbsp;<code>nn.Linear<\/code>&nbsp;module with the bias disabled gives us such a projection, and provides a reasonable initialization for us.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>    \n    # These compute the queries, keys and values for all\n    # heads\n    self.tokeys    = nn.Linear(k, k, bias=False)\n    self.toqueries = nn.Linear(k, k, bias=False)\n    self.tovalues  = nn.Linear(k, k, bias=False)\n\n\t# This will be applied after the multi-head self-attention operation.\n    self.unifyheads = nn.Linear(k, k)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can now implement the computation of the self-attention (the module\u2019s&nbsp;<code>forward<\/code>&nbsp;function). First, we compute the queries, keys and values for all heads:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>  <strong>def<\/strong> <strong>forward<\/strong>(self, x):\n\n    b, t, k = x.size()\n    h = self.heads\n\n    queries = self.toqueries(x)\n    keys    = self.tokeys(x)   \n    values  = self.tovalues(x)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This gives us three vector sequences of the full embedding dimension&nbsp;<code>k<\/code>. As we saw above we can now cut these into&nbsp;<code>h<\/code>&nbsp;chunks. we can do this with a simple view operation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\n\ts = k \/\/ h\n\n\tkeys    = keys.view(b, t, h, s)\n\tqueries = queries.view(b, t, h, s)\n\tvalues  = values.view(b, t, h, s)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This simply reshapes the tensors to add a dimension that iterations over the heads. For a single vector in our sequence you can think of it as reshaping a vector of dimension&nbsp;<code>k<\/code>&nbsp;into a matrix of&nbsp;<code>h<\/code>&nbsp;by&nbsp;<code>k\/\/h<\/code>:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/reshape.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we need to compute the dot products. This is the same operation for every head, so we fold the heads into the batch dimension. This ensures that we can use&nbsp;<code>torch.bmm()<\/code>&nbsp;as before, and the whole collection of keys, queries and values will just be seen as a slightly larger batch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Since the head and batch dimension are not next to each other, we need to transpose before we reshape. (This is costly, but it seems to be unavoidable.)<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>    # - fold heads into the batch dimension\n    keys = keys.transpose(1, 2).contiguous().view(b * h, t, s)\n    queries = queries.transpose(1, 2).contiguous().view(b * h, t, s)\n    values = values.transpose(1, 2).contiguous().view(b * h, t, s)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You can avoid these calls to&nbsp;<code>contiguous()<\/code>&nbsp;by using&nbsp;<code>reshape()<\/code>&nbsp;instead of&nbsp;<code>view()<\/code>&nbsp;but I prefer to make it explicit when we are copying a tensor, and when we are just viewing it. See&nbsp;<a href=\"https:\/\/github.com\/mlvu\/worksheets\/blob\/master\/Worksheet%205%2C%20Pytorch.ipynb\">this notebook<\/a>&nbsp;for an explanation of the difference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As before, the dot products can be computed in a single matrix multiplication, but now between the queries and the keys.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>   \n    # Get dot product of queries and keys, and scale\n    dot = torch.bmm(queries, keys.transpose(1, 2))\n    # -- dot has size (b*h, t, t) containing raw weights\n\n    # scale the dot product\n    dot = dot \/ (s ** (1\/2))\n\t\n    # normalize \n    dot = F.softmax(dot, dim=2)\n    # - dot now contains row-wise normalized weights\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We apply the self attention weights&nbsp;<code>dot<\/code>&nbsp;to the values, giving us the output for each attention head<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>    \n    # apply the self attention to the values\n    out = torch.bmm(dot, values).view(b, h, t, s)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To unify the attention heads, we transpose again, so that the head dimension and the embedding dimension are next to each other, and reshape to get concatenated vectors of dimension&nbsp;<code>e<\/code>. We then pass these through the&nbsp;<code>unifyheads<\/code>&nbsp;layer for a final projection.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>    # swap h, t back, unify heads\n    out = out.transpose(1, 2).contiguous().view(b, t, s * h)\n    \n    <strong>return<\/strong> self.unifyheads(out)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And there you have it: multi-head, scaled dot-product self attention. You can see&nbsp;<a href=\"https:\/\/github.com\/pbloem\/former\/blob\/b438731ceeaf6c468f8b961bb07c2adde3b54a9f\/former\/modules.py#L10\">the complete implementation here<\/a>.The implementation can be made more concise using&nbsp;<a href=\"https:\/\/rockt.github.io\/2018\/04\/30\/einsum\">einsum notation<\/a>&nbsp;(see an example&nbsp;<a href=\"https:\/\/github.com\/pbloem\/former\/issues\/4\">here<\/a>).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"building-transformers\">Building&nbsp;<em>transformers<\/em><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A transformer is not just a self-attention layer, it is an&nbsp;<em>architecture<\/em>. It\u2019s not quite clear what does and doesn\u2019t qualify as a transformer, but here we\u2019ll use the following definition:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Any architecture designed to process a connected set of units\u2014such as the tokens in a sequence or the pixels in an image\u2014where the only interaction between units is through self-attention.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">As with other mechanisms, like convolutions, a more or less standard approach has emerged for how to build self-attention layers up into a larger network. The first step is to wrap the self-attention into a&nbsp;<em>block<\/em>&nbsp;that we can repeat.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"the-transformer-block\">The transformer block<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">There are some variations on how to build a basic transformer block, but most of them are structured roughly like this:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/transformer-block.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">That is, the block applies, in sequence:&nbsp;a self attention layer,&nbsp;layer normalization,&nbsp;a feed forward layer&nbsp;(a single&nbsp;MLP&nbsp;applied independently to each vector), and&nbsp;another layer normalization.&nbsp;Residual connections&nbsp;are added around both, before the normalization. The order of the various components is not set in stone; the important thing is to combine self-attention with a local feedforward, and to add normalization and residual connections.Normalization and residual connections are standard tricks used to help deep neural networks train faster and more accurately. The layer normalization is applied over the embedding dimension only.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s what the transformer block looks like in pytorch.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><strong>class<\/strong> <strong>TransformerBlock<\/strong>(<strong>nn<\/strong>.<strong>Module<\/strong>):\n  <strong>def<\/strong> <strong>__init__<\/strong>(<strong>self<\/strong>, k, heads):\n    <strong>super<\/strong>().__init__()\n\n    <strong>self<\/strong>.attention = SelfAttention(k, heads=heads)\n\n    <strong>self<\/strong>.norm1 = nn.LayerNorm(k)\n    <strong>self<\/strong>.norm2 = nn.LayerNorm(k)\n\n    <strong>self<\/strong>.ff = nn.Sequential(\n      nn.Linear(k, 4 * k),\n      nn.ReLU(),\n      nn.Linear(4 * k, k))\n\n  <strong>def<\/strong> <strong>forward<\/strong>(<strong>self<\/strong>, x):\n    attended = <strong>self<\/strong>.attention(x)\n    x = <strong>self<\/strong>.norm1(attended + x)\n\n    fedforward = <strong>self<\/strong>.ff(x)\n    <strong>return<\/strong> <strong>self<\/strong>.norm2(fedforward + x)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ve made the relatively arbitrary choice of making the hidden layer of the feedforward 4 times as big as the input and output. Smaller values may work as well, and save memory, but it should be bigger than the input\/output layers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"classification-transformer\">Classification transformer<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The simplest transformer we can build is a&nbsp;<em>sequence classifier<\/em>. We\u2019ll use the&nbsp;IMDb&nbsp;sentiment classification dataset: the instances are movie reviews, tokenized into sequences of words, and the classification labels are&nbsp;<code>positive<\/code>&nbsp;and&nbsp;<code>negative<\/code>&nbsp;(indicating whether the review was positive or negative about the movie).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The heart of the architecture will simply be a large chain of transformer blocks. All we need to do is work out how to feed it the input sequences, and how to transform the final output sequence into a a single classification.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The whole experiment can be&nbsp;<a href=\"https:\/\/github.com\/pbloem\/former\/blob\/master\/experiments\/classify.py\">found here<\/a>. We won\u2019t deal with the data wrangling in this blog post. Follow the links in the code to see how the data is loaded and prepared.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"output-producing-a-classification\">Output: producing a classification<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The most common way to build a sequence classifier out of sequence-to-sequence layers, is to apply global average pooling to the final output sequence, and to map the result to a softmaxed class vector.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/classifier.svg\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Overview of a simple sequence classification transformer. The output sequence is&nbsp;averaged&nbsp;to produce a single vector representing the whole sequence. This vector is projected down to a vector with one element per class and softmaxed to produce probabilities.<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"input-using-the-positions\">Input: using the positions<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ve already discussed the principle of an embedding layer. This is what we\u2019ll use to represent the words.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, as we\u2019ve also mentioned already, we\u2019re stacking permutation equivariant layers, and the final global average pooling is permutation&nbsp;<em>in<\/em>variant, so the network as a whole is also permutation invariant. Put more simply: if we shuffle up the words in the sentence, we get the exact same classification, whatever weights we learn. Clearly, we want our state-of-the-art language model to have at least some sensitivity to word order, so this needs to be fixed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The solution is simple: we create a second vector of equal length, that represents the position of the word in the current sentence, and add this to the word embedding. There are two options.position embeddings. We simply\u00a0<em>embed<\/em>\u00a0the positions like we did the words. Just like we created embedding vectors\u00a0\ud835\udc2fcat\u00a0and\u00a0\ud835\udc2fsusan, we create embedding vectors\u00a0\ud835\udc2f12\u00a0and\u00a0\ud835\udc2f25. Up to however long we expect sequences to get. The drawback is that we have to see sequences of every length during training, otherwise the relevant position embeddings don&#8217;t get trained. The benefit is that it works pretty well, and it&#8217;s easy to implement.position encodings. Position encodings work in the same way as embeddings, except that we don&#8217;t\u00a0<em>learn<\/em>\u00a0the position vectors, we just choose some function\u00a0f:\u2115\u2192\u211dk\u00a0to map the positions to real valued vectors, and let the network figure out how to interpret these encodings. The benefit is that for a well chosen function, the network should be able to deal with sequences that are longer than those it&#8217;s seen during training (it&#8217;s unlikely to perform well on them, but at least we can check). The drawbacks are that the choice of encoding function is a complicated hyperparameter, and it complicates the implementation a little.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For the sake of simplicity, we\u2019ll use position embeddings in our implementation.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"pytorch\">Pytorch<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Here is the complete text classification transformer in pytorch.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><strong>class<\/strong> <strong>Transformer<\/strong>(nn.Module):\n    <strong>def<\/strong> <strong>__init__<\/strong>(self, k, heads, depth, seq_length, num_tokens, num_classes):\n        super().__init__()\n\n        self.num_tokens = num_tokens\n        self.token_emb = nn.Embedding(num_tokens, k)\n        self.pos_emb = nn.Embedding(seq_length, k)\n\n\t\t# The sequence of transformer blocks that does all the\n\t\t# heavy lifting\n        tblocks = &#91;]\n        <strong>for<\/strong> i <strong>in<\/strong> range(depth):\n            tblocks.append(TransformerBlock(k=k, heads=heads))\n        self.tblocks = nn.Sequential(*tblocks)\n\n\t\t# Maps the final output sequence to class logits\n        self.toprobs = nn.Linear(k, num_classes)\n\n    <strong>def<\/strong> <strong>forward<\/strong>(self, x):\n        \"\"\"\n        :param x: A (b, t) tensor of integer values representing\n                  words (in some predetermined vocabulary).\n        :return: A (b, c) tensor of log-probabilities over the\n                 classes (where c is the nr. of classes).\n        \"\"\"\n\t\t# generate token embeddings\n        tokens = self.token_emb(x)\n        b, t, k = tokens.size()\n\n\t\t# generate position embeddings\n\t\tpositions = torch.arange(t)\n        positions = self.pos_emb(positions)&#91;None, :, :].expand(b, t, k)\n\n        x = tokens + positions\n        x = self.tblocks(x)\n\n        # Average-pool over the t dimension and project to class\n        # probabilities\n        x = self.toprobs(x.mean(dim=1))\n        <strong>return<\/strong> F.log_softmax(x, dim=1)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">At depth 6, with a maximum sequence length of 512, this transformer achieves an accuracy of about 85%, competitive with results from RNN models, and much faster to train. To see the real near-human performance of transformers, we\u2019d need to train a much deeper model on much more data. More about how to do that later.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"text-generation-transformer\">Text generation transformer<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The next trick we\u2019ll try is an&nbsp;<em>autoregressive<\/em>&nbsp;model. We\u2019ll train a&nbsp;<em>character<\/em>&nbsp;level transformer to predict the next character in a sequence. The training regime is simple (and has been around&nbsp;<a href=\"http:\/\/karpathy.github.io\/2015\/05\/21\/rnn-effectiveness\/\">for far longer than transformers have<\/a>). We give the sequence-to-sequence model a sequence, and we ask it to predict the next character at each point in the sequence. In other words, the target output is the same sequence shifted one character to the left:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/generator.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">With RNNs this is all we need to do, since they cannot look forward into the input sequence: output&nbsp;i&nbsp;depends only on inputs&nbsp;0&nbsp;to&nbsp;i. With a transformer, the output depends on the entire input sequence, so prediction of the next character becomes vacuously easy, just retrieve it from the input.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To use self-attention as an autoregressive model, we\u2019ll need to ensure that it cannot look forward into the sequence. We do this by applying a mask to the matrix of dot products, before the softmax is applied. This mask disables all elements above the diagonal of the matrix.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/masked-attention.svg\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Masking the self attention, to ensure that elements can only attend to input elements that precede them in the sequence. Note that the multiplication symbol is slightly misleading: we actually set the masked out elements (the white squares) to&nbsp;\u2212\u221e<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Since we want these elements to be zero after the softmax, we set them to&nbsp;\u2212\u221e. Here\u2019s how that looks in pytorch:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>dot = torch.bmm(queries, keys.transpose(1, 2))\n\nindices = torch.triu_indices(t, t, offset=1)\ndot&#91;:, indices&#91;0], indices&#91;1]] = float('-inf')\n\ndot = F.softmax(dot, dim=2)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">After we\u2019ve handicapped the self-attention module like this, the model can no longer look forward in the sequence.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We train on the standard&nbsp;<code>enwik8<\/code>&nbsp;dataset (taken from the&nbsp;<a href=\"http:\/\/prize.hutter1.net\/\">Hutter prize<\/a>), which contains&nbsp;108&nbsp;characters of Wikipedia text (including markup). During training, we generate batches by randomly sampling subsequences from the data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We train on sequences of length 256, using a model of 12 transformer blocks and 256 embedding dimension. After about 24 hours training on an RTX 2080Ti (some 170K batches of size 32), we let the model generate from&nbsp;a 256-character seed: for each character, we feed it the preceding 256 characters, and look what it predicts for the next character (the last output vector). We sample from that with a&nbsp;<a href=\"https:\/\/towardsdatascience.com\/how-to-sample-from-language-models-682bceb97277\">temperature<\/a>&nbsp;of 0.5, and move to the next character.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The output looks like this:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1228X Human &amp; Rousseau. Because many of his stories were originally published in long-forgotten magazines and journals, there are a number of [[anthology|anthologies]] by different collators each containing a different selection. His original books have been considered an anthologie in the [[Middle Ages]], and were likely to be one of the most common in the [[Indian Ocean]] in the [[1st century]]. As a result of his death, the Bible was recognised as a counter-attack by the [[Gospel of Matthew]] (1177-1133), and the [[Saxony|Saxons]] of the [[Isle of Matthew]] (1100-1138), the third was a topic of the [[Saxony|Saxon]] throne, and the [[Roman Empire|Roman]] troops of [[Antiochia]] (1145-1148). The [[Roman Empire|Romans]] resigned in [[1148]] and [[1148]] began to collapse. The [[Saxony|Saxons]] of the [[Battle of Valasander]] reported the y<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Note that the Wikipedia link tag syntax is correctly used, that the text inside the links represents reasonable subjects for links. Most importantly, note that there is a rough thematic consistency; the generated text keeps on the subject of the bible, and the Roman empire, using different related terms at different points. While this is far from the performance of a model like&nbsp;<a href=\"https:\/\/openai.com\/blog\/better-language-models\/\">GPT-2<\/a>, the benefits over a similar RNN model are clear already: faster training (a similar RNN model would take many days to train) and better long-term coherence.In case you&#8217;re curious, the Battle of Valasander seems to be an invention of the network.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At this point, the model achieves a compression of 1.343 bits per byte on the validation set, which is not too far off the state of the art of 0.93 bits per byte, achieved by the GPT-2 model (described below).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"design-considerations\">Design considerations<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">To understand why transformers are set up this way, it helps to understand the basic design considerations that went into them. The main point of the transformer was to overcome the problems of the previous state-of-the-art architecture, the RNN (usually an LSTM or a GRU).&nbsp;<a href=\"https:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/\">Unrolled<\/a>, an RNN looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/recurrent-connection.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The big weakness here is the&nbsp;recurrent connection. while this allows information to propagate along the sequence, it also means that we cannot compute the cell at time step&nbsp;i&nbsp;until we\u2019ve computed the cell at timestep&nbsp;i\u22121. Contrast this with a 1D convolution:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/convolutional-connection.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In this model, every output vector can be computed in parallel with every other output vector. This makes convolutions much faster. The drawback with convolutions, however, is that they\u2019re severely limited in modeling&nbsp;<em>long range dependencies<\/em>. In one convolution layer, only words that are closer together than the kernel size can interact with each other. For longer dependence we need to stack many convolutions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The transformer is an attempt to capture the best of both worlds. They can model dependencies over the whole range of the input sequence just as easily as they can for words that are next to each other (in fact, without the position vectors, they can\u2019t even tell the difference). And yet, there are no recurrent connections, so the whole model can be computed in a very efficient feedforward fashion.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The rest of the design of the transformer is based primarily on one consideration: depth. Most choices follow from the desire to train big stacks of transformer blocks. Note for instance that there are only two places in the transformer where non-linearities occur: the softmax in the self-attention and the&nbsp;ReLU&nbsp;in the feedforward layer. The rest of the model is entirely composed of linear transformations, which perfectly preserve the gradient.I suppose the layer normalization is also nonlinear, but that is one nonlinearity that actually helps to keep the gradient stable as it propagates back down the network.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"historical-baggage\">Historical baggage<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If you\u2019ve read other introductions to transformers, you may have noticed that they contain some bits I\u2019ve skipped. I think these are not necessary to understand modern transformers. They are, however, helpful to understand some of the terminology and some of the writing&nbsp;<em>about<\/em>&nbsp;modern transformers. Here are the most important ones.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"why-is-it-called-self-attention\">Why is it called self-<em>attention<\/em>?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Before self-attention was first presented, sequence models consisted mostly of recurrent networks or convolutions stacked together. At some point, it was discovered that these models could be helped by adding&nbsp;<em>attention mechanisms<\/em>: instead of feeding the output sequence of the previous layer directly to the input of the next, an intermediate mechanism was introduced, that decided which elements of the input were relevant for a particular word of the output.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The general mechanism was as follows. We call the input the&nbsp;<strong>values<\/strong>. Some (trainable) mechanism assigns a&nbsp;<strong>key<\/strong>&nbsp;to each value. Then to each output, some other mechanism assigns a&nbsp;<strong>query<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These names derive from the datastructure of a key-value store. In that case we expect only one item in our store to have a key that matches the query, which is returned when the query is executed. Attention is a softened version of this:&nbsp;<em>every<\/em>&nbsp;key in the store matches the query to some extent. All are returned, and we take a sum, weighted by the extent to which each key matches the query.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The great breakthrough of self-attention was that attention by itself is a strong enough mechanism to do all the learning.&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention is all you need<\/a>, as the authors put it. The key, query and value are all the same vectors (with minor linear transformations). They&nbsp;<em>attend to themselves<\/em>&nbsp;and stacking such self-attention provides sufficient nonlinearity and representational power to learn very complicated functions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"the-original-transformer-encoders-and-decoders\">The original transformer: encoders and decoders<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">But the authors did not dispense with all the complexity of contemporary sequence modeling. The standard structure of sequence-to-sequence models in those days was an encoder-decoder architecture, with&nbsp;<a href=\"https:\/\/blog.keras.io\/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html\">teacher forcing<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/peterbloem.nl\/files\/transformers\/encoder-decoder.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<em>encoder<\/em>&nbsp;takes the input sequence and maps it to a&nbsp;<em>latent<\/em>&nbsp;representation of the whole sequence. This can be either a sequence of latent vectors, or a single one as in the image above. This vector is then passed to a&nbsp;<em>decoder<\/em>&nbsp;which unpacks it to the desired target sequence (for instance, the same sentence in another language).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Teacher forcing<\/em>&nbsp;refers to the technique of also allowing the decoder access to the input sentence, but in an autoregressive fashion. That is, the decoder generates the output sentence word for word based both on the latent vector and the words it has already generated. This takes some of the pressure off the latent representation: the decoder can use word-for-word sampling to take care of the low-level structure like syntax and grammar and use the latent vector to capture more high-level semantic structure. Decoding twice with the same latent representation would, ideally, give you two different sentences with the same meaning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In later transformers, like BERT and GPT-2, the encoder\/decoder configuration was entirely dispensed with. A simple stack of transformer blocks was found to be sufficient to achieve state of the art in many sequence based tasks.This approach is sometimes called a decoder-only transformer (for an autoregressive model) or an encoder-only transformer (for a model without masking).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"modern-transformers\">Modern transformers<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s a small selection of some modern transformers and their most characteristic details.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"bert\"><a href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT<\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">BERT&nbsp;was one of the first models to show that transformers could reach human-level performance on a variety of language based tasks: question answering, sentiment classification or classifying whether two sentences naturally follow one another.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">BERT consists of a simple stack of transformer blocks, of the type we\u2019ve described above. This stack is&nbsp;<em>pre-trained<\/em>&nbsp;on a large general-domain corpus consisting of 800M words from English books (modern work, from unpublished authors), and 2.5B words of text from English Wikipedia articles (without markup).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Pretraining is done through two tasks:MaskingA certain number of words in the input sequence are: masked out, replaced with a random word or kept as is. The model is then asked to predict, for these words, what the original words were. Note that the model doesn&#8217;t need to predict the entire denoised sentence, just the modified words. Since the model doesn&#8217;t know which words it will be asked about, it learns a representation for every word in the sequence.Next sequence classificationTwo sequences of about 256 words are sampled that either (a) follow each other directly in the corpus, or (b) are both taken from random places. The model must then predict whether a or b is the case.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">BERT uses WordPiece tokenization, which is somewhere in between word-level and character level sequences. It breaks words like&nbsp;walking&nbsp;up into the tokens&nbsp;walk&nbsp;and&nbsp;##ing. This allows the model to make some inferences based on word structure: two verbs ending in -ing have similar grammatical functions, and two verbs starting with walk- have similar semantic function.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The input is prepended with a special&nbsp;&lt;cls&gt;&nbsp;token. The output vector corresponding to this token is used as a sentence representation in sequence classification tasks like the next sentence classification (as opposed to the global average pooling over all vectors that we used in our classification model above).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After pretraining, a single task-specific layer is placed after the body of transformer blocks, which maps the general purpose representation to a task specific output. For classification tasks, this simply maps the first output token to softmax probabilities over the classes. For more complex tasks, a final sequence-to-sequence layer is designed specifically for the task.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The whole model is then re-trained to finetune the model for the specific task at hand.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In an ablation experiment, the authors show that the largest improvement as compared to previous models comes from the bidirectional nature of BERT. That is, previous models like GPT used an autoregressive mask, which allowed attention only over previous tokens. The fact that in BERT all attention is over the whole sequence is the main cause of the improved performance.This is why the B in BERT stands for &#8220;bidirectional&#8221;.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The largest BERT model uses 24 transformer blocks, an embedding dimension of 1024 and 16 attention heads, resulting in 340M parameters.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"gpt-2\"><a href=\"https:\/\/openai.com\/blog\/better-language-models\/\">GPT-2<\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">GPT-2 is the first transformer model that actually made it into&nbsp;<a href=\"https:\/\/www.bbc.com\/news\/technology-47249163\">the mainstream news<\/a>, after the controversial decision by OpenAI not to release the full model.The reason was that GPT-2 could generate sufficiently believable text that large-scale fake news campaigns of the kind seen in the 2016 US presidential election would become effectively a one-person job.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The first trick that the authors of GPT-2 employed was to create a new high-quality dataset. While BERT used high-quality data, their sources (lovingly crafted books and well-edited wikipedia articles) had a certain lack of diversity in the writing style. To collect more diverse data without sacrificing quality the authors used links posted on the social media site&nbsp;<em>Reddit<\/em>&nbsp;to gather a large collection of writing with a certain minimum level of social support (expressed on Reddit as&nbsp;<em>karma<\/em>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">GPT2 is fundamentally a language&nbsp;<em>generation<\/em>&nbsp;model, so it uses masked self-attention like we did in our model above. It uses byte-pair encoding to tokenize the language, which , like the WordPiece encoding breaks words up into tokens that are slightly larger than single characters but less than entire words.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">GPT2 is built very much like our text generation model above, with only small differences in layer order and added tricks to train at greater depths. The largest model uses 48 transformer blocks, a sequence length of 1024 and an embedding dimension of 1600, resulting in 1.5B parameters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">They show state-of-the art performance on many tasks. On the wikipedia compression task that we tried above, they achieve 0.93 bits per byte.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"transformer-xl\"><a href=\"https:\/\/arxiv.org\/abs\/1901.02860\">Transformer-XL<\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">While the transformer represents a massive leap forward in modeling long-range dependency, the models we have seen so far are still fundamentally limited by the size of the input. Since the size of the dot-product matrix grows quadratically in the sequence length, this quickly becomes the bottleneck as we try to extend the length of the input sequence. Transformer-XL is one of the first succesful transformer models to tackle this problem.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">During training, a long sequence of text (longer than the model could deal with) is broken up into shorter segments. Each segment is processed in sequence, with self-attention computed over the tokens in the curent segment&nbsp;<em>and the previous segment<\/em>. Gradients are only computed over the current segment, but information still propagates as the segment window moves through the text. In theory at layer&nbsp;n, information may be used from&nbsp;n&nbsp;segments ago.A similar trick in RNN training is called truncated backpropagation through time. We feed the model a very long sequence, but backpropagate only over part of it. The first part of the sequence, for which no gradients are computed, still influences the values of the hidden states in the part for which they are.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To make this work, the authors had to let go of the standard position encoding\/embedding scheme. Since the position encoding is&nbsp;<em>absolute<\/em>, it would change for each segment and not lead to a consistent embedding over the whole sequence. Instead they use a&nbsp;<em>relative<\/em>&nbsp;encoding. For each output vector, a different sequence of position vectors is used that denotes not the absolute position, but the distance to the current output.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This requires moving the position encoding into the attention mechanism (which is detailed in the paper). One benefit is that the resulting transformer will likely generalize much better to sequences of unseen length.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"sparse-transformers\"><a href=\"https:\/\/openai.com\/blog\/sparse-transformer\/\">Sparse transformers<\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Sparse transformers tackle the problem of quadratic memory use head-on. Instead of computing a dense matrix of attention weights (which grows quadratically), they compute the self-attention only for particular pairs of input tokens, resulting in a&nbsp;<em>sparse<\/em>&nbsp;attention matrix, with only&nbsp;nn\u2212\u2212\u221a&nbsp;explicit elements.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This allows models with very large context sizes, for instance for generative modeling over images, with large dependencies between pixels. The tradeoff is that the sparsity structure is not learned, so by the choice of sparse matrix, we are disabling some interactions between input tokens that might otherwise have been useful. However, two units that are not directly related may still interact in higher layers of the transformer (similar to the way a convolutional net builds up a larger receptive field with more convolutional layers).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Beyond the simple benefit of training transformers with very large sequence lengths, the sparse transformer also allows a very elegant way of designing an inductive bias. We take our input as a collection of units (words, characters, pixels in an image, nodes in a graph) and we specify, through the sparsity of the attention matrix, which units we believe to be related. The rest is just a matter of building the transformer up as deep as it will go and seeing if it trains.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"going-big\">Going big<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The big bottleneck in training transformers is the matrix of dot products in the self attention. For a sequence length&nbsp;t, this is a dense matrix containing&nbsp;t2&nbsp;elements. At standard 32-bit precision, and with&nbsp;t=1000&nbsp;a batch of 16 such matrices takes up about 64Mb of memory. Since we need to store at least four of them for each head of each self-attention operation (before and after softmax, plus their gradients), that limits us to at most twelve 4-head layers in a standard 12Gb GPU. In practice, we get even less, since the inputs and outputs also take up a lot of memory (although the dot product dominates).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And yet models reported in the literature contain much longer sequence lengths, with 48 or more layers, using dense dot product matrices. These models are trained on clusters, of course, but a single GPU is still required to do a single forward\/backward pass. How do we fit such humongous transformers into 12Gb of memory? There are three main tricks:Half precision. On modern GPUs and on TPUs, tensor computations can be done efficiently on 16-bit float tensors. This isn&#8217;t quite as simple as just setting the dtype of the tensor to\u00a0<code>torch.float16<\/code>. For some parts of the network, like the loss, 32 bit precision is required. But most of this can be handled with relative ease by\u00a0<a href=\"https:\/\/github.com\/NVIDIA\/apex\">existing libraries<\/a>. Practically, this doubles your effective memory.Gradient accumulation. For a large model, we may only be able to perform a forward\/backward pass on a single instance. Batch size 1 is not likely to lead to stable learning. Luckily, we can perform a single forward\/backward for each instance in a larger batch, and simply sum the gradients we find (this is a consequence of the\u00a0<a href=\"https:\/\/mlvu.github.io\/lecture07\/#slide-024\">multivariate chain rule<\/a>). When we hit the end of the batch, we do a single step of gradient descent, and zero out the gradient. In Pytorch this is particularly easy: you know that\u00a0<code>optimizer.zero_grad()<\/code>\u00a0call in your training loop that seems so superfluous? If you don&#8217;t make that call, the new gradients are simply added to the old ones.Gradient checkpointing. If your model is so big that even a single forward\/backward won&#8217;t fit in memory, you can trade off even more computation for memory efficiency. In gradient checkpointing, you separate your model into sections. For each section, you do a separate forward\/backward to compute the gradients, without retaining the intermediate values for the rest. Pytorch has\u00a0<a href=\"https:\/\/pytorch.org\/docs\/stable\/checkpoint.html\">special utilities<\/a>\u00a0for gradient checkpointing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusion\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The transformer may well be the simplest machine learning architecture to dominate the field in decades. There are good reasons to start paying attention to them if you haven\u2019t been already.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Firstly,&nbsp;<strong>the current performance limit is purely in the hardware<\/strong>. Unlike convolutions or LSTMs the current limitations to what they can do are entirely determined by how big a model we can fit in GPU memory and how much data we can push through it in a reasonable amount of time. I have no doubt, we will eventually hit the point where more layers and and more data won\u2019t help anymore, but we don\u2019t seem to have reached that point yet.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Second,&nbsp;<strong>transformers are extremely generic<\/strong>. So far, the big successes have been in language modelling, with some more modest achievements in image and music analysis, but the transformer has a level of generality that is waiting to be exploited. The basic transformer is a&nbsp;<em>set-to-set<\/em>&nbsp;model. So long as your data is a set of units, you can apply a transformer. Anything else you know about your data (like local structure) you can add by means of position embeddings, or by manipulating the structure of the attention matrix (making it sparse, or masking out parts).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is particularly useful in multi-modal learning. We could easily combine a captioned image into a set of pixels and characters and design some clever embeddings and sparsity structure to help the model figure out how to combine and align the two. If we combine the entirety of our knowledge about our domain into a relational structure like a multi-modal knowledge graph (as discussed in [3]), simple transformer blocks could be employed to propagate information between multimodal units, and to align them with the sparsity structure providing control over which units directly interact.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So far, transformers are still primarily seen as a language model. I expect that in time, we\u2019ll see them adopted much more in other domains, not just to increase performance, but to simplify existing models, and to allow practitioners more intuitive control over their models\u2019 inductive biases.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"references\">References<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"http:\/\/jalammar.github.io\/illustrated-transformer\/\">The illustrated transformer<\/a>, Jay Allamar<\/li>\n\n\n\n<li><a href=\"http:\/\/nlp.seas.harvard.edu\/2018\/04\/03\/attention.html\">The annotated transformer<\/a>, Alexander Rush<\/li>\n\n\n\n<li><a href=\"https:\/\/content.iospress.com\/articles\/data-science\/ds007\">The knowledge graph as the default data model for learning on heterogeneous knowledge<\/a>\u00a0Xander Wilcke, Peter Bloem, Victor de Boer<\/li>\n\n\n\n<li><a href=\"https:\/\/datajobs.com\/data-science-repo\/Recommender-Systems-[Netflix].pdf\">Matrix factorization techniques for recommender systems<\/a>\u00a0Yehuda Koren et al.<\/li>\n\n\n\n<li><a href=\"https:\/\/jalammar.github.io\/visual-interactive-guide-basics-n...\">A Visual and Interactive Guide to the Basics of Neural Networks<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/jalammar.github.io\/feedforward-neural-networks-visua...\">A Visual And Interactive Look at Basic Neural Network Math<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/jalammar.github.io\/visualizing-neural-machine-transl...\">Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) <\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/jalammar.github.io\/visualizing-neural-machine-transl...\">&#8211;\u00a0The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)<\/a><\/li>\n\n\n\n<li>The Illustrated GPT-2 (Visualizing Transformer Language Models) <\/li>\n\n\n\n<li><a href=\"https:\/\/jalammar.github.io\/how-gpt3-works-visualizations-ani...\">How GPT3 Works &#8211; Visualizations and Animations<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/jalammar.github.io\/illustrated-retrieval-transformer...\">The Illustrated Retrieval Transformer<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/jalammar.github.io\/illustrated-stable-diffusion\/\">The Illustrated Stable Diffusion<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/pbloem\/former\">Code on github<\/a><\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe title=\"Lecture 12\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/videoseries?list=PLIXJ-Sacf8u60G1TwcznBmK6rEL3gmZmV\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Transformers are a very exciting family of machine learning architectures. Many good tutorials exist (e.g. [1, 2]) but in the last few years, transformers have mostly become simpler, so that [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":720,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_EventAllDay":false,"_EventTimezone":"","_EventStartDate":"","_EventEndDate":"","_EventStartDateUTC":"","_EventEndDateUTC":"","_EventShowMap":false,"_EventShowMapLink":false,"_EventURL":"","_EventCost":"","_EventCostDescription":"","_EventCurrencySymbol":"","_EventCurrencyCode":"","_EventCurrencyPosition":"","_EventDateTimeSeparator":"","_EventTimeRangeSeparator":"","_EventOrganizerID":[],"_EventVenueID":[],"_OrganizerEmail":"","_OrganizerPhone":"","_OrganizerWebsite":"","_VenueAddress":"","_VenueCity":"","_VenueCountry":"","_VenueProvince":"","_VenueState":"","_VenueZip":"","_VenuePhone":"","_VenueURL":"","_VenueStateProvince":"","_VenueLat":"","_VenueLng":"","_VenueShowMap":false,"_VenueShowMapLink":false,"footnotes":""},"categories":[6],"tags":[],"class_list":["post-151","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The Illustrated Transformer From Scratch - Innovative Digital Transformation Jordan<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The Illustrated Transformer From Scratch - Innovative Digital Transformation Jordan\" \/>\n<meta property=\"og:description\" content=\"Transformers are a very exciting family of machine learning architectures. Many good tutorials exist (e.g. [1, 2]) but in the last few years, transformers have mostly become simpler, so that [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/\" \/>\n<meta property=\"og:site_name\" content=\"Innovative Digital Transformation Jordan\" \/>\n<meta property=\"article:published_time\" content=\"2023-11-22T11:51:43+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/peterbloem.nl\/files\/transformers\/self-attention.svg\" \/>\n<meta name=\"author\" content=\"Editor\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Editor\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"41 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/\"},\"author\":{\"name\":\"Editor\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/person\/260ed75841f0c76d83d1b2bc3121f6f6\"},\"headline\":\"The Illustrated Transformer From Scratch\",\"datePublished\":\"2023-11-22T11:51:43+00:00\",\"dateModified\":\"2023-11-22T11:51:43+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/\"},\"wordCount\":7683,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#organization\"},\"image\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2023\/11\/transformer_resideual_layer_norm_3.png\",\"articleSection\":[\"AI\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/\",\"url\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/\",\"name\":\"The Illustrated Transformer From Scratch - Innovative Digital Transformation Jordan\",\"isPartOf\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2023\/11\/transformer_resideual_layer_norm_3.png\",\"datePublished\":\"2023-11-22T11:51:43+00:00\",\"dateModified\":\"2023-11-22T11:51:43+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#primaryimage\",\"url\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2023\/11\/transformer_resideual_layer_norm_3.png\",\"contentUrl\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2023\/11\/transformer_resideual_layer_norm_3.png\",\"width\":1415,\"height\":804},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"The Illustrated Transformer From Scratch\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#website\",\"url\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/\",\"name\":\"Innovative Digital Transformation Jordan\",\"description\":\"Improve Your Life with NetBookLM\",\"publisher\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#organization\",\"name\":\"Innovative Digital Transformation Jordan\",\"url\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2024\/09\/cropped-cropped-Designer-1.jpeg\",\"contentUrl\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2024\/09\/cropped-cropped-Designer-1.jpeg\",\"width\":70,\"height\":70,\"caption\":\"Innovative Digital Transformation Jordan\"},\"image\":{\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/person\/260ed75841f0c76d83d1b2bc3121f6f6\",\"name\":\"Editor\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"Editor\"},\"url\":\"https:\/\/idtjo.hosting.acm.org\/wordpress\/author\/editor\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"The Illustrated Transformer From Scratch - Innovative Digital Transformation Jordan","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/","og_locale":"en_US","og_type":"article","og_title":"The Illustrated Transformer From Scratch - Innovative Digital Transformation Jordan","og_description":"Transformers are a very exciting family of machine learning architectures. Many good tutorials exist (e.g. [1, 2]) but in the last few years, transformers have mostly become simpler, so that [&hellip;]","og_url":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/","og_site_name":"Innovative Digital Transformation Jordan","article_published_time":"2023-11-22T11:51:43+00:00","og_image":[{"url":"https:\/\/peterbloem.nl\/files\/transformers\/self-attention.svg"}],"author":"Editor","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Editor","Est. reading time":"41 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#article","isPartOf":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/"},"author":{"name":"Editor","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/person\/260ed75841f0c76d83d1b2bc3121f6f6"},"headline":"The Illustrated Transformer From Scratch","datePublished":"2023-11-22T11:51:43+00:00","dateModified":"2023-11-22T11:51:43+00:00","mainEntityOfPage":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/"},"wordCount":7683,"commentCount":0,"publisher":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#organization"},"image":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#primaryimage"},"thumbnailUrl":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2023\/11\/transformer_resideual_layer_norm_3.png","articleSection":["AI"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/","url":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/","name":"The Illustrated Transformer From Scratch - Innovative Digital Transformation Jordan","isPartOf":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#website"},"primaryImageOfPage":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#primaryimage"},"image":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#primaryimage"},"thumbnailUrl":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2023\/11\/transformer_resideual_layer_norm_3.png","datePublished":"2023-11-22T11:51:43+00:00","dateModified":"2023-11-22T11:51:43+00:00","breadcrumb":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#primaryimage","url":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2023\/11\/transformer_resideual_layer_norm_3.png","contentUrl":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2023\/11\/transformer_resideual_layer_norm_3.png","width":1415,"height":804},{"@type":"BreadcrumbList","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/the-illustrated-transformer\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/idtjo.hosting.acm.org\/wordpress\/"},{"@type":"ListItem","position":2,"name":"The Illustrated Transformer From Scratch"}]},{"@type":"WebSite","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#website","url":"https:\/\/idtjo.hosting.acm.org\/wordpress\/","name":"Innovative Digital Transformation Jordan","description":"Improve Your Life with NetBookLM","publisher":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/idtjo.hosting.acm.org\/wordpress\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#organization","name":"Innovative Digital Transformation Jordan","url":"https:\/\/idtjo.hosting.acm.org\/wordpress\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/logo\/image\/","url":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2024\/09\/cropped-cropped-Designer-1.jpeg","contentUrl":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2024\/09\/cropped-cropped-Designer-1.jpeg","width":70,"height":70,"caption":"Innovative Digital Transformation Jordan"},"image":{"@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/person\/260ed75841f0c76d83d1b2bc3121f6f6","name":"Editor","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/idtjo.hosting.acm.org\/wordpress\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"Editor"},"url":"https:\/\/idtjo.hosting.acm.org\/wordpress\/author\/editor\/"}]}},"jetpack_featured_media_url":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-content\/uploads\/2023\/11\/transformer_resideual_layer_norm_3.png","_links":{"self":[{"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/posts\/151","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/comments?post=151"}],"version-history":[{"count":0,"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/posts\/151\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/media\/720"}],"wp:attachment":[{"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/media?parent=151"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/categories?post=151"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/idtjo.hosting.acm.org\/wordpress\/wp-json\/wp\/v2\/tags?post=151"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}