ATTENTION MECHANISM- 🎼🎼You Just want Attention🎼🎼

Ashis Kumar Panda
3 min readMay 17, 2020

Attention Mechanism is the State of the Art implementation in the world of NLP. It is being used to solve almost every major problem in NLP.

This blog post is inspired by Rasa Algorithm Whiteboard Series. Make sure to watch and subscribe the channel.

Why Attention?

Before Attention Mechanism came into existence , there were CNN and RNN. While CNN isn’t suitable for NLP tasks, RNN also had their own limitations. RNN wasn’t able to make connection between words in longer statements. While RNN gave more importance to words that are in close proximity, Attention Mechanism don’t give any importance to the order of the words. Hence Attention Mechanism was introduced to overcome all of these drawbacks.

Let’s do a Deep Dive into Attention Mechanism.

Let’s say we have an example :-

“Bank of the River .” Here the bank refers to river bank and not the bank where we deposit money. The model has to understand the connection between river and bank to make sense of that. In NLP , all the words are represented by embedding .The embeddings (vector of numbers) for Money deposit “Bank” and “Bank” of the river has to be different . How to make sure the embeddings has more contextual information in it? The answer is “Attention Mechanism”. To understand more of this , lets focus on the below diagram.

Fig-1

Scores has been calculated by taking a element wise product of the vectors. It actually helps in getting the semantic relationship between all the words in the sentence . That score is being normalized and is used to get better contextualized embeddings i.e Y1. Similarly , the same steps can be taken to get Y2,Y3 and Y4. That’s the whole story of “Attention mechanism”.

But if we will be using for some training purpose , won’t it be good to have some trainable parameters or actual weights that can be updated during training. Say no more, check out the description of the below image.

Fig 2
  • The same Fig-1 has been drawn in a different way here, where the input embeddings can be seen at the bottom (v1,v2,v3,v4) and the output contextual embeddings can be seen at the top right corner (Y1,Y2,Y3,Y4). The above calculation is shown for only v3 to Y3 embeddings and similar approach can be taken for others.
  • There is one more add-on to the Attention mechanism and that is trainable weights . Three different weights named as Keys (Mk) , Queries(Mq) and Values (Mv) are introduced ,also marked in green. These parameters are learned/updated during the process of training. These weights are multiplied with corresponding embedding vectors and then passed through the reweighing process which results in our final Contextual embeddings .
  • In the diagram below, the same diagram has been drawn in a more vectorized form and the whole block of re-weighing process is represented as Self- Attention Block.

We can have multiple such self attention blocks to train our model depending on the task at hand ,so as to get better contextual embeddings in the output, as shown in the below image.

Multiple self attention blocks

In the next Blog post, we will discuss on how to improve this attention mechanism by using Multi-Headed Attention. Until then Good bye ..

--

--

Ashis Kumar Panda

https://www.buymeacoffee.com/AshisPanda .. Simplifying tough concepts in Machine Learning domain one at a time| Lifelong learner