Multi-Headed Attention Mechanism

2 min readMay 18, 2020

Improving the Self- attention mechanism

In my last blog post , we have discussed about Self Attention. I strongly recommend going through that before understanding Multi-Headed Attention mechanism. Now , let’s see how Multi Headed attention could be of help.

Say we have a sentence:-

“I gave my dog Charlie some food.” . As we can see, there are multiple actions going on .

“I gave” is one action.
“to my dog Charlie” is second action .
“What did I gave(some food)” is third action.

To keep a track of all these action we need Multi headed attention.

As you can see in the above image, its an extension of Self-attention with multiple heads/layers at Keys , Queries and Values blocks, which is why we need to concat the final output and pass it via a dense layer to get the final output. This Multi head mechanism is more efficient as it performs multiple attention mechanism in parallel . Earlier, in Self attention mechanism a single layer was suppose to catch all the actions going on in the sentence “I gave my dog Charlie some food.” . By using Multi headed attention mechanism , multiple actions are being shared and better captured using multiple layers. Thanks to Rasa for wonderful explanation of Multi head attention.

In my next blog post, we will discuss about Transformers, in which Multi head Attention play a crucial role. Until then Goodbye.

Multi-Headed Attention Mechanism

Written by Ashis Kumar Panda