Transformers

Introduction

In the previous post, we have seen how the attention mechanism works. While attention mechanism is the core of the transformer, the transformer entails a lot more than just the attention mechanism. Invention of Transformers is a major breakthrough in the field of NLP. It is the backbone of the modern NLP models like GPT-3, GPT-4, etc.

Workflow

After multi-head attention produces its outputs, the transformer layer still has two major stages left before the data moves to the next layer. Here’s the full post-attention flow:Here’s what happens at each stage:

① Add & Norm (first time)

The raw attention output is added back to the original input token embeddings (the residual/skip connection), then LayerNorm is applied. The residual connection is crucial: it lets gradients flow directly back through the network during training without vanishing, and ensures the model can learn “do nothing” as a valid transformation.

② Feed-Forward Network (FFN)

This is applied independently to each token position. It’s two linear layers with a GELU nonlinearity between them. The first linear layer expands the dimension (typically by 4×, so from 512 → 2048), GELU introduces nonlinearity, then the second layer contracts back to the original size. The FFN is where most of a transformer’s parameters actually live — it’s where factual knowledge and patterns get stored.

③ Add & Norm (second time)

Same pattern: the FFN output is added to the pre-FFN tensor (another residual connection), followed by another LayerNorm. This stabilizes the signal before it’s passed to the next layer.

The whole block then repeats N times (12 layers in GPT-2, 96 in GPT-3). The key insight is that attention handles communication between positions (which token should look at which), while the FFN handles computation per position (what to do with that information). They do fundamentally different jobs.

Math Formulas

Multi-head attention output

The concatenated heads are projected back to model dimension:

\[\text{MHA}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O\]

where each head is:

\[\text{head}_i = \text{Attention}(X W^Q_i,\, X W^K_i,\, X W^V_i) = \text{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d_k}}\right) V_i\]

First Add & Norm

\[y = \text{LayerNorm}(x + \text{MHA}(x))\]

where LayerNorm normalizes over the feature dimension:

\[\text{LayerNorm}(z) = \frac{z - \mu}{\sigma + \epsilon} \cdot \gamma + \beta, \quad \mu = \frac{1}{d}\sum_{j=1}^d z_j, \quad \sigma = \sqrt{\frac{1}{d}\sum_{j=1}^d (z_j - \mu)^2}\]

Feed-forward network

Applied independently at each position:

\[\text{FFN}(y) = \text{GELU}(y W_1 + b_1)\, W_2 + b_2\]

where \(W_1 \in \mathbb{R}^{d_{\text{model}} \times 4d_{\text{model}}}\), \(W_2 \in \mathbb{R}^{4d_{\text{model}} \times d_{\text{model}}}\), and GELU is:

\[\text{GELU}(z) = z \cdot \Phi(z) \approx z \cdot \sigma(1.702\, z)\]

with \(\Phi\) the standard Gaussian CDF.

Second Add & Norm

\[z = \text{LayerNorm}(y + \text{FFN}(y))\]

Full single-layer composition

Putting it all together for layer \(\ell\):

\[y^{(\ell)} = \text{LayerNorm}\!\left(x^{(\ell)} + \text{MHA}^{(\ell)}(x^{(\ell)})\right)\]

\[x^{(\ell+1)} = \text{LayerNorm}\!\left(y^{(\ell)} + \text{FFN}^{(\ell)}(y^{(\ell)})\right)\]

Full stack

Starting from token embeddings \(x^{(0)} = \text{Embed}(t) + \text{PE}\), the output after \(N\) layers is:

\[x^{(N)} = f^{(N)} \circ f^{(N-1)} \circ \cdots \circ f^{(1)}\!\left(x^{(0)}\right), \quad f^{(\ell)}(x) = x^{(\ell+1)}\]

Note that after each Transformer layer, attention is calculated again. Every transformer layer is a complete, independent repetition of the full block:

\[x^{(\ell+1)} = f^{(\ell)}(x^{(\ell)}), \quad f^{(\ell)} = \text{AddNorm} \circ \text{FFN} \circ \text{AddNorm} \circ \text{MHA}^{(\ell)}\]

Each layer \(\ell\) has its own separate set of learned weight matrices:

\[W^Q_i{}^{(\ell)},\; W^K_i{}^{(\ell)},\; W^V_i{}^{(\ell)},\; W^O{}^{(\ell)},\; W_1{}^{(\ell)},\; W_2{}^{(\ell)}, \; \gamma^{(\ell)}, \; \beta^{(\ell)}\]

So no weights are shared across layers — layer 2’s attention is a completely different learned function from layer 1’s.

What makes this powerful is that each layer attends to different things. Research on attention patterns shows a rough hierarchy:

Early layers — tend to capture local syntax, adjacent-token relationships, simple positional patterns
Middle layers — coreference, grammatical structure, phrase-level patterns
Later layers — task-specific, semantic, long-range dependencies

The output \(x^{(\ell)}\) passed into layer \(\ell+1\) is a progressively richer representation of each token — it has already been contextualized by all previous layers’ attention operations. So later layers are computing attention over increasingly abstract representations, not over the raw embeddings.