Multi-Head Attention

This is core component of transformer. Multi head attention

Overview

Multi-head attention divides the model dimension ( $Q$ , $K$ , $V$ ) by the number of heads, applies attention independently to each head, concatenates the results, and finally applies a learned linear transformation ( $W^{O}$ ) to produce the final output.

Why Multiple Heads?

Input sequences contain various types of relationships and patterns:

Head 1 might capture grammatical structure
Head 2 might focus on semantic relationships
Deeper heads might learn more complex, abstract patterns

The final learned parameter $W^{O}$ combines and refines these different aspects into a coherent representation, so

M u lt i He a d (Q, K, V) = C o n c a t (h e a d_{1}, ..., h e a d_{h}) W_{o}

Attention Mechanisms

One-Head Attention

Components and Dimensions

Component	Shape	Description
Query (Q)	(seq_length, d_model)	Input sequence transformation
Key (K)	(seq_length, d_model)	Used to compute attention scores
Value (V)	(seq_length, d_model)	Used to compute final output

Learnable Parameters

Parameter	Shape	Purpose
$W^{Q}$	(d_model, d_model)	Query transformation
$W^{K}$	(d_model, d_model)	Key transformation
$W^{V}$	(d_model, d_model)	Value transformation

Attention formula

$A tt e n t i o n (Q, K, V) = so f t ma x (\frac{Q K ^{T}}{d _{k}}) V$

Process Steps

Apply linear transformation to get $Q'$ $K'$ $V'$ .
Do dot product of $Q'$ and $K'$ .
- For self-attention: Think of this like find attention score or find ideas, relation within input sequence (e.g. gramma, abstract ideas, we hope fully model learn something complex as we going deeper(more heads))
- For cross-attention: similar to self-attention except we $Q'$ and $K'$ are form difference sequence like in translation task, so this is like find attention score from both sequence (e.g. ideas of how we translate Thai to English)
Divide by $d_{k}$ : Just for numerical stability.
Apply softmax, so $so f t ma x (\frac{Q K ^{T}}{d _{k}})$ :
- Converts attention scores into probabilities (values between 0 and 1)
Do dot product of $so f t ma x (\frac{Q K ^{T}}{d _{k}})$ and $V$ :
- Think of this like after we know ideas, relation for each word (attention score $Q K^{T}$ ) then what exactly should we added to the embedding to make some reflection, so this is like dot product with $V$ to make new representations $Δ E$
Now we got new representations for all word $Δ E$ that can be add to original words $E$ (in Add&Norm layer), so we can get better, more meaningful words.

Practical Example

Consider the sequence “A blue cat”, word “cat” embedding is $E_{3}$ then model compute self-attention, start with $Q K^{T}$ and now model know attention score or an ideas that is this cat is a weird blue cat then model compute $\frac{Q K ^{T}}{d _{k}} V$ to get new vector that reflect this knowledge $Δ E_{3}$ , now we $E_{3} + Δ E_{3}$ (done in Add&Norm) , so we get new $E_{3}$ that change meaning from “cat” to “blue cat”.

$E_{3}$ meaning before	$E_{3}$ meaning after

Masking Types

Decoder Self-Attention Masking (causal mask)

[ 1 0 0 ]  # First position can only look at itself
[ 1 1 0 ]  # Second position can look at first and itself
[ 1 1 1 ]  # Third position can look at everything up to itself

When compute $Q K^{T}$ , sets upper triangle to -inf before softmax
Prevents model from seeing future tokens during training
Essential for autoregressive generation

Padding Masking

[ 1 1 1 0 0 ]
[ 1 1 1 0 0 ]
[ 1 1 1 0 0 ]

Applied to both encoder and decoder
Masks padding tokens to prevent them from contributing to attention

Video

Best video to get intuition

Achira

Explorer