Transformers, are type of neural networks that has gained popularity, was introduced in the paper ‘Attention is all you need’ by Google in 2017. The Transformer architecture relies entirely on the attention mechanism drawing global dependencies between input and output. The Transformer allows for significantly more parallelization.
In sequence-to-sequence problems, RNNs were used in an encoder-decoder architecture. The limitation of the RNN seq2seq model is imminent when working with long sequences. When used with long sequences, these models ability to retain information from the first elements are lost thus losing the context. In the encoder, the hidden state in every step is associated with a certain word in the input sentence, usually one of the most recent. Therefore, if the decoder only accesses the last hidden state of the decoder, it will lose relevant information about the first elements of the sequence. Then to deal with this limitation the concept of attention mechanism was introduced. Thus, in this case, instead of using the last encoder state in each step of the decoder all the states of the encoder is looked into. This is what attention does, it extracts information from the whole sequence, a weighted sum of all the past encoder states. This allows the decoder to give importance to a certain element of the input for each element of the output. But this approach continues to have an important limitation, each sequence must be treated one element at a time. So when dealing with huge corpus it is very time consuming and computationally inefficient. Also due to the fundamental constraint of sequential computation, it is not possible to parallelize the network, which makes it hard to train on long sequences. This has been overcome with the transformer architecture.
The transformer architecture continues with the Encoder-Decoder framework that was a part of the original Attention networks — given an input sequence, create an encoding of it based on the context and decode that context-based encoding to the output sequence. The Transformer architecture is shown in Fig.1
Given a sequence of tokens x1, x2, x3, … , the input sequence corresponds to an embedding of each of these tokens. This embedding could be something as simple as one-hot encodings. Here we use an embedding layer to learn the embeddings.
Positional encoding is performed to let the network know the positions of the tokens in the sequence. This has to be done as the system does not have any means to explain the sequential nature of the data.
Thus given an embedding for token x at position i, a positional encoding for the i’th position is added to that embedding. Hence assuming
are the tokens,
are the embedding obtained after passing the tokens through the embedding layer,
are the positional embedding then the input to the multi-head attention would be
The Encoder layer maps all input sequences into an abstract continuous representation that holds the learned information for that entire sequence. From Fig.1, we see that it contains 2 sub-modules, multi-headed attention, followed by a fully connected network. There are also residual connections around each of the two sublayers followed by a layer normalization.
This layer tries to encode the word based on the other words in the sentence. It measures the encoding of the word against the encoding of another word and gives a new encoding. Here we need to introduce three elements: Queries, Values and Keys. Thus given an embedding x, it learns three separate smaller embeddings from it — query, key and value. They have the same number of dimensions. Thus, a single embedding is repeated 3 times and it goes through a feed forward neural net whose matrices Wq, Wv and Wk are learnt during the process of training. Thus the incoming token is multiplied by these three matrices to get the Query, Value and Key. This is shown in Fig.2
Now let’s see what is done after obtaining the Query, Key and Value. Refer Fig.3
We use the Query (Q), Key (K)and Value (V) to calculate the attention scores. First, the dot product of the first Query vector with the Key vector of the respective word is performed which is then repeated with the dot product of the Query vector with Key vector of next word and so on. After the first Query Vector, the subsequent ones are taken and the dot product is performed. Performing this will provide us with a score matrix. The scores measure how much focus to place on other places or words of the input sequence w.r.t a word at a certain position. So each word will have a score that corresponds to other words in the time-step. The higher the score the more focus. This is how the queries are mapped to the keys.
Next we scale the values in order to have more stable gradients as multiplying values can have exploding effects.. This is done by dividing the score matrix by the square root of the dimension of query and key. The scaled matrix is then passed through a softmax to get the attention weights. The output of the softmax will be a value between 0 and 1. By doing a softmax the higher scores get further amplified and lower scores are compressed which allows the model to be more confident about which words to attend too.
Once we obtain the softmax output, this matrix is multiplied with the Value vector. The higher softmax scores will keep the value of words the model learns is more important. The lower scores will drown out the irrelevant words. The output is then fed into a linear layer to process. The equation for the entire process is given by
Multi-head attention provides more discriminative power to self-attention. Multi-head attention can be applied by dividing the words vectors into a fixed number (N, number of heads) of chunks, and then self-attention is applied on the corresponding chunks, using Q, K and V sub-matrices. This produce different output score matrices. In theory, each head would learn something different therefore giving the encoder model more representation power. In theory, each head would learn something different therefore giving the encoder model more representation power. Thus multi-headed attention is a module in the transformer network that computes the attention weights for the input and produces an output vector with encoded information on how each word should attend to all other words in the sequence. The multi-head attention is shown in Fig.4
Residual Connection, Layer Normalisation and Feed-Forward Network
The output from the multi-headed attention is added to the original positional input embedding. This is called a residual connection. The output of the residual connection goes through a layer normalization. The normalized residual output gets projected through a pointwise feed-forward network for further processing. The pointwise feed-forward network is a couple of linear layers with a ReLU activation in between. The output of that is then again added to the input of the pointwise feed-forward network and further normalized.
The residual connections help the network train, by allowing gradients to flow through the networks directly. The layer normalizations stabilize the network which results in substantially reducing the training time and the pointwise feedforward layer is used to project the attention outputs potentially giving it a richer representation.
The Transformer decoder is shown in Fig. 5.
The decoder has a similar structure as that of the encoder. It has two multi-headed attention layers, a pointwise feed-forward layer, and residual connections, and layer normalization after each sub-layer. These sub-layers behave similarly to the layers in the encoder but each multi-headed attention layer has a different job. The decoder is capped off with a linear layer that acts as a classifier, and a softmax to get the word probabilities.
Decoder Input embeddings and Positional embeddings
The input to the decoder is the same as that of the encoder. The input tokens goes through an embedding layer t obtain the embeddings. The positional details are added using positional encoding layer. The input and positional embeddings are added, like that in the encoder, and fed into the first multi-head attention layer which computes the attention scores for the decoder’s input.
Decoder’s Masked Multi-head attention
This attention layer behaves slightly different from the encoder’s attention layer. Here we use a mask to prevent the decoder from looking into the future tokens. The mask is added before calculating the softmax, and after scaling the scores. Let’s take a look at how it is done
The mask is a matrix that’s the same size as the attention scores filled with values of 0’s and negative infinities. When you add the mask to the scaled attention scores, you get a matrix of the scores, with the top right triangle filled with negativity infinities. The reason for the mask is because once you take the softmax of the masked scores, the negative infinities get zeroed out, leaving zero attention scores for future tokens.
This masking is the only difference in how the attention scores are calculated in the first multi-headed attention layer. This layer still has multiple heads, that the mask is being applied to, before getting concatenated and fed through a linear layer for further processing. The output of the first multi-headed attention is a masked output vector with information on how the model should attend on the decoder’s input. The single head of the masked multi-head attention is shown in Fig.6.
Decoder’s Second Multi-Head Attention and Pointwise Feed Forward Layer, Linear Classifier and Final Softmax Output
The second multi-headed attention layer. For this layer, the encoder’s outputs are the queries and the keys, and the first multi-headed attention layer outputs are the values. This process matches the encoder’s input to the decoder’s input, allowing the decoder to decide which encoder input is relevant to put a focus on. The output of the second multi-headed attention goes through a pointwise feedforward layer for further processing.
The output of the final feed forward layer is fed into a final Linear layer. This layer acts as a classifier. The classifier is as big as the number of classes we have. The output of the classifier then gets fed into a softmax layer, which will produce probability scores between 0 and 1. We take the index of the highest probability score, and that equals our predicted word.
The decoder then takes the output, add’s it to the list of decoder inputs, and continues decoding again until a token is predicted. For our case, the highest probability prediction is the final class which is assigned to the end token. The decoder can also be stacked N layers high, each layer taking in inputs from the encoder and the layers before it. By stacking the layers, the model can learn to extract and focus on different combinations of attention from its attention heads, potentially boosting its predictive power.
The code is given at https://github.com/raghav-menon/END/blob/main/Session12/Session_12_HandsOn_VI.ipynb