Transformer Architecture: Encoders

Peter N Robinson

Overview: Lecture 1

Game plan

This lecture is the first of ??? about the transformer algorithm and selected applications. We will start with a top-level overview to provide some context and will then explain the mathematics of each section in detail. The practical sessions will involve the coding of a “toy” LLM.

Lecture 1

Foundations

Top-level overview
Embeddings
Positional encodings

Lecture 2

Architecture

Multi-Head Attention
Residual Connections
Layer Norm

Lecture 3

Applications

Training & Data
Fine-tuning
Toy LLM Coding

Section 1: Tokenization and Embeddings

Tokenization and Embeddings

Positional Encodings

Multi-head attention

Layer Normalization

Tokens

The need for tokenization

LLMs (and many other machine learning algorithms) process numbers rather than text
Therefore, to process text, it first must be converted to numbers.

Tokenization by Open-AI

Figure 1: Created with https://platform.openai.com/tokenizer

A token is a piece of text that can be represented by an integer.
Here, we see an example GPT subword tokenization of the word discombobulate:
- dis: 4220
- comb: 43606
- ob: 630
- ulate; 10111
The integer that is associated with the token is called token ID or token index.¹

Tokenization

Definition

the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words.

Tokenization Method	Example	Granularity	Pros	Cons
Word	Full words	[“Machine”, “learning”]	High semantic clarity	Massive vocabulary; fails on “unknown” words
Subword	Meaningful chunks	[“ma”, “chine”, “learn”, “ing”]	Handles new words well	Slightly more complex to implement
Character	Individual letters	[“m”, “a”, “c”, “h”, “i”, “n”, “e”, “l”, “r”, “g”]	Tiny vocabulary; no “unknown” words	Loses semantic meaning; very long sequences

Why not just use character embeddings?

Major problems with word embeddings

There are extremely many unique words in natural language
Words can have different lexical forms (run, runs, ran, running, …)
Out of vocabulary words cannot be represented

image credit: https://github.com/minimaxir/char-embeddings

Limitations of character embeddings
- Unicode currently has $>$150K characters (growing!)
- character-based embeddings ignore linguistic regularities (e.g., in English, ‘t’ is more often followed by ‘h’ than by ‘q’)
- requires a lot of memory to cover any given text in the context (there are more character embeddings than subword embeddings and more subword embeddings than word embeddings)

Tokens are then converted to embeddings

LLMs do not use tokens directly.
Instead, an intermediate step converts input tokens to embeddings

Embeddings

Embeddings are dense numeric representations of tokens
Advantages over integers
More text can be represented using fewer numbers
Semantic relations across tokens can be captured

During training, an LLM learns embeddings to represent individual tokens

The Embedding Mapping

The embedding process transforms discrete tokens into a continuous vector space $\mathcal{V} \to \mathbb{R}^d$.

\[\begin{equation} \mathbf{e}_i = \text{embed}(w_i) = \mathbf{x}_i^\top \mathbf{W}_e \end{equation}\]

Where:

One-hot vector: $\mathbf{x}_i \in \{0,1\}^V$, where $V$ is the vocabulary size.
Weight Matrix: $\mathbf{W}_e \in \mathbb{R}^{V \times d}$ contains the learnable parameters.
Projection: The result $\mathbf{e}_i$ is the $i$-th row of the matrix $\mathbf{W}_e$.

\[ \mathbf{W}_e = \begin{bmatrix} \leftarrow & \mathbf{e}_0 & \rightarrow \\ \leftarrow & \mathbf{e}_1 & \rightarrow \\ & \vdots & \\ \leftarrow & \mathbf{e}_{V-1} & \rightarrow \end{bmatrix} \]

Embeddings

$\mathbf{x}_i^\top \mathbf{W}_e$ just picks out the $i^{th}$ row of $\mathbf{W}_e$.
The dimension of the embeddings is adjustable. For instance, SBERT has embeddings with a dimension of 768.
The embedding matrix is initialized to random values and learned during training by back-propagation
Todo: explain learning procedure (or refer to previous slides)

Section 2: Positional Encodings

Tokenization and Embeddings

Positional Encodings

Multi-head attention

Layer Normalization

RNNs vs Encoders

Recurrent neural networks process one token at a time (serial), which transformer-based encoders process an entire sequence of tokens at once (parallel).
Positional encodings are needed for the encoder to keep track of the relative and absolute positions of tokens
Todo: explain RNNs/exploding gradients in previous lecture

Figure 3: RNN vs Encoder

Positional information

Transformers process entire sequence in parallel
They lack inherent information about the positions of tokens
Positional encodings are added to embeddings to provide position information
Example for sentence, “I am an automaton” we have four embedding vectors: $\left[E_{\mathrm{I}}, E_{\mathrm{am}}, E_{\mathrm{an}}, E_{\mathrm{automaton}}\right]$
We calculate four position encodings $\left[P_0,P_1,P_2,P_3\right]$
These are then combined by elementwise addition:

\[ \left[E_{\mathrm{I}}+P_0, E_{\mathrm{am}}+P_1,E_{\mathrm{an}}+P_2, E_{\mathrm{automaton}}+P_3\right] \]

Calculating positional encodings (PEs)

For even indices (e.g., $i=0,2,4,\ldots$):

\[ \mathrm{PE}(pos,2i) = \sin\left(\dfrac{pos}{100^{\frac{2i}{d}}}\right) \]

and for odd indices (e.g., $i=1,3,5,\ldots$):

\[ \mathrm{PE}(pos,2i+1) = \cos\left(\dfrac{pos}{100^{\frac{2i}{d}}}\right) \] Note that the $i$ in these equations is separate for even and odd…

Understanding the PE Matrix (1)

A positional encoding has the same dimension as an embedding, and calculates a value for each dimension using the sine and cosine formulas.

Sequence	Index (k)	i=0 (sine)	i=0 (cosine)	i=1 (sine)	i=1 (cosine)
I	0	sin(0/1) = 0.00	cos(0/1) = 1.00	sin(0/10) = 0.00	cos(0/10) = 1.00
am	1	sin(1/1) = 0.84	cos(1/1) = 0.54	sin(1/10) = 0.10	cos(1/10) = 1.00
an	2	sin(2/1) = 0.91	cos(2/1) = -0.42	sin(2/10) = 0.20	cos(2/10) = 0.98
automaton	3	sin(3/1) = 0.14	cos(3/1) = -0.99	sin(3/10) = 0.30	cos(3/10) = 0.96

dimension ($d$): 4 in this example, generally much larger (e.g., 768 for BERT)
example: first word (position – $k=0$)
- $k=0, i=0 (\mathrm{even})$; $\mathrm{PE}(k,2i) = \sin\left(\frac{k}{100^{\frac{2i}{d}}}\right) = \sin\left(\frac{0}{100^{\frac{0}{4}}}\right) = \sin(0)=0$
- $k=0, i=0 (\mathrm{odd})$; $\mathrm{PE}(k,2i+1) = \cos\left(\frac{k}{100^{\frac{2i}{d}}}\right) = \cos\left(\frac{0}{100^{\frac{0}{4}}}\right) = \cos(0)= 1$
- $k=0, i=1 (\mathrm{even})$; $\mathrm{PE}(k,2i) = \sin\left(\frac{k}{100^{\frac{2i}{d}}}\right) = \sin\left(\frac{0}{100^{\frac{2}{4}}}\right) =\sin(0)=0$
- $k=0, i=1 (\mathrm{odd})$; $\mathrm{PE}(k,2i+1) = \cos\left(\frac{k}{100^{\frac{2i}{d}}}\right) = \cos\left(\frac{0}{100^{\frac{2}{4}}}\right) = \cos(0)= 1$

Understanding the PE Matrix (2)

A positional encoding has the same dimension as an embedding, and calculates a value for each dimension using the sine and cosine formulas.

Sequence	Index (k)	i=0 (sine)	i=0 (cosine)	i=1 (sine)	i=1 (cosine)
I	0	sin(0/1) = 0.00	cos(0/1) = 1.00	sin(0/10) = 0.00	cos(0/10) = 1.00
am	1	sin(1/1) = 0.84	cos(1/1) = 0.54	sin(1/10) = 0.10	cos(1/10) = 1.00
an	2	sin(2/1) = 0.91	cos(2/1) = -0.42	sin(2/10) = 0.20	cos(2/10) = 0.98
automaton	3	sin(3/1) = 0.14	cos(3/1) = -0.99	sin(3/10) = 0.30	cos(3/10) = 0.96

dimension ($d$): 4 in this example, generally much larger (e.g., 768 for BERT)
example: second word (position – $k=1$)
- $k=1, i=0 (\mathrm{even})$; $\mathrm{PE}(k,2i) = \sin\left(\frac{k}{100^{\frac{2i}{d}}}\right) = \sin\left(\frac{1}{100^{\frac{0}{4}}}\right) = \sin(1)=0.8415$
- $k=1, i=0 (\mathrm{odd})$; $\mathrm{PE}(k,2i+1) = \cos\left(\frac{k}{100^{\frac{2i}{d}}}\right) = \cos\left(\frac{1}{100^{\frac{0}{4}}}\right) = \cos(1)= 0.5403$
- $k=1, i=1 (\mathrm{even})$; $\mathrm{PE}(k,2i) = \sin\left(\frac{k}{100^{\frac{2i}{d}}}\right) = \sin\left(\frac{1}{100^{\frac{2}{4}}}\right) =\sin(1/10)=0.0998$
- $k=1, i=1 (\mathrm{odd})$; $\mathrm{PE}(k,2i+1) = \cos\left(\frac{k}{100^{\frac{2i}{d}}}\right) = \cos\left(\frac{1}{100^{\frac{2}{4}}}\right) = \cos(1/10)= 0.9950$

Understanding the PE Matrix (3)

The denominator of the formula contains the “magic number” 10,000: \[ \mathrm{PE}(k,2i) = \sin\left(\frac{k}{100^{\frac{2i}{d}}}\right) \]

A heuristic choice made by the authors of the original “Attention Is All You Need” paper.¹
creates a geometric progression of “frequencies.”
At $i=0$ (Smallest wavelength): The denominator is $10000^0 = 1$. The wave oscillates very quickly.
At $i = d_{model}/2$ (Largest wavelength): The denominator is $10000^1 = 10000$. The wave oscillates very slowly.

Understanding the PE Matrix (4)

If the constant $n$ were too small (e.g., 50), the sine waves would repeat their cycles every 50 tokens. The model wouldn’t be able to tell the difference between position 5 and position 55 because their encoding vectors would look identical.
$n=10000$ ensures that for any sequence length a Transformer is likely to see (e.g., 512 to 8,192 tokens), every single position has a unique combination of high-frequency and low-frequency signals.

Properties of the Positional Encoding (1)

The Linearity Property (Relative Positioning)

For any fixed offset $\Delta k$, the encoding at a shifted position can be represented as a linear transformation of the original:

\[PE_{pos + \Delta k} = \mathbf{M}_{\Delta k} \cdot PE_{pos}\]

To prove this, recall the Angle Addition Formulas that specify how to compute trig functions of sums or differences of angles.

$sin(\alpha + \beta) = \sin(\alpha)\cos(\beta)+\cos(\alpha)\sin(\beta)$
$cos(\alpha + \beta) = \cos(\alpha)\cos(\beta)-\cos(\alpha)\cos(\beta)$

To prove that $PE_{pos+k}$ is a linear transformation of $PE_{pos}$, we need to show that there exists a fixed matrix $M_k$ (that only depends on the offset $k$, not the position $pos$) such that: \[PE_{pos+k} = M_k \cdot PE_{pos}\]

Properties of the Positional Encoding (2)

For any dimension $i$, the positional encoding is a pair of $(\sin, \cos)$ values. Let’s define $\omega_i = \frac{1}{10000^{2i/d_{model}}}$ to simplify the notation.The encoding at position $pos$ is:

\[\begin{bmatrix} \sin(\omega_i \cdot pos) \\ \cos(\omega_i \cdot pos) \end{bmatrix}\]

We want to find the encoding at $pos + k$:

\[\sin(\omega_i(pos + k)) = \sin(\omega_i \cdot pos + \omega_i \cdot k)\]\[\cos(\omega_i(pos + k)) = \cos(\omega_i \cdot pos + \omega_i \cdot k)\]

Plugging into the Angle Addition Formulas:

\[\sin(\omega_i \cdot pos + \omega_i \cdot k) = \sin(\omega_i \cdot pos)\cos(\omega_i \cdot k) + \cos(\omega_i \cdot pos)\sin(\omega_i \cdot k)\]\[\cos(\omega_i \cdot pos + \omega_i \cdot k) = \cos(\omega_i \cdot pos)\cos(\omega_i \cdot k) - \sin(\omega_i \cdot pos)\sin(\omega_i \cdot k)\]

Properties of the Positional Encoding (3)

Notice that the terms $\cos(\omega_i \cdot k)$ and $\sin(\omega_i \cdot k)$ are constants once we pick a fixed distance $k$. We can rewrite the equations above as a matrix multiplication:\[\begin{bmatrix} \sin(\omega_i(pos+k)) \\ \cos(\omega_i(pos+k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix} \begin{bmatrix} \sin(\omega_i pos) \\ \cos(\omega_i pos) \end{bmatrix}\]

The matrix $M_k=\begin{bmatrix} \sin(\omega_i(pos+k)) \\ \cos(\omega_i(pos+k)) \end{bmatrix}$ is a rotation matrix.

The “rotation” required to move from position 2 to position 5 is the exact same rotation required to move from position 100 to 103.
The neural network’s attention layers can easily “learn” this rotation. It can learn to say: “To find the noun that usually appears 3 words after this adjective, I just need to apply the ‘3-step’ rotation to my current query vector.”

Distance (1)

In the Attention mechanism, the model calculates the similarity between tokens using a dot product.
Because of the way these waves overlap, the dot product between the encodings of two tokens decays as the distance between them increases
When we add the positional vector ($P$) to the word embedding ($E$), the resulting vector is $X = E + P$.
When the attention mechanism calculates the dot product between a Query ($Q$) and a Key ($K$) a combination of semantics (word embeddings) and location (positional encodings) is taken into account.
To gain intuition, let’s ignore the weight matrices $W_Q$ and $W_K$ for now.

Distance (2)

The dot product between two tokens at positions $i$ and $j$ looks like this:

\[X_i\cdot X_j = (E_i + P_i) \cdot (E_j + P_j)\]

This results in four distinct terms

$E_i \cdot E_j$ (Content-Content): How much the meaning of word $i$ relates to the meaning of word $j$.
$E_i \cdot P_j$ (Content-Position): Does word $i$ tend to look at position $j$? (e.g., “The” looking for a noun in the next slot).
$P_i \cdot E_j$ (Position-Content): Does position $i$ look for specific types of words?
$P_i \cdot P_j$ (Position-Position): Pure distance component.

How the Model Separates “Meaning” from “Distance”?

High Dimensionality: In a 512 or 1024-dimensional space, there is a lot of “room.” The model can learn to use certain dimensions primarily for semantic meaning and other dimensions primarily for positional signal.¹
Learned Projections: The weight matrices ($W_Q, W_K, W_V$) act as filters.² Through training, the model learns to project the “summed” vector into a subspace where the positional information is amplified when it needs to know distance, and the semantic information is amplified when it needs to know meaning.
The cross-terms (like $E_{v} \cdot P_v$) tend to be very close to zero because “meaning” vectors and “position” vectors are nearly orthogonal in high-dimensional space
Let $v$ be the token at position $P_v$ and $w$ be the token at position $P_w$. Then in most cases we have

\[(E_{v} + P_v) \cdot (E_{w} + P_w) \approx \underbrace{(E_{v} \cdot E_{w})}_{\text{Do these words relate?}} + \underbrace{(P_v \cdot P_w)}_{\text{Are they close together?}}\]

Rotary Position Embedding

Rotary Positional Embeddings (RoPE)
Instead of adding a “position vector” to it, RoPE applies a rotation matrix.
- Token at Position 0: No rotation.
- Token at Position 1: Rotate by $1\theta$.
- Token at Position 2: Rotate by $2\theta$.
- $\ldots$
We will not discuss this further here.
Approach adopted by many major LLMs including Llama (Meta), PaLM (Google), and Mistral.¹

Section 3: Multi-head attention

Embeddings

Positional Encodings

Multi-head attention

Layer Normalization

The Encoder: Multi-head attention

Step 2: Position-wise FFN

Tokens “think” individually.
Two linear transformations.
Non-linear activation (ReLU/GELU).

Notation

An input sequence $\mathbf{X}$ is an $n \times d$ matrix ($n$ tokens, $d$ features).
Each row is a token (sum of embedding and positional matrix)
For each token $x_i$ (a $1 \times d$ row vector): \[\underbrace{q_i}_{1 \times 64} = \underbrace{x_i}_{1 \times 768} \times \underbrace{W^Q}_{768 \times 64}\]
For the entire sequence $X$: \[\underbrace{Q}_{n \times 64} = \underbrace{X}_{n \times 768} \times \underbrace{W^Q}_{768 \times 64}, \quad \underbrace{K}_{n \times 64} = \underbrace{X}_{n \times 768} \times \underbrace{W^K}_{768 \times 64}, \quad \underbrace{V}_{n \times 64} = \underbrace{X}_{n \times 768} \times \underbrace{W^V}_{768 \times 64}\]

Learning Weight Matrix $W^K$

Learned weight matrix
Key vectors capture features of the token that are useful for the attention mechanism
Helps the model to assess the relevance of each word, such as “crane” within the context of the sentence “The wattled crane is a migratory bird” \[ W^K = \begin{bmatrix} w_{11} & w_{12} & \dots & w_{1d_k} \\ w_{21} & w_{22} & \dots & w_{2d_k} \\ \vdots & \vdots & \ddots & \vdots \\ w_{d1} & w_{d2} & \dots & w_{dd_k} \end{bmatrix}_{d \times d_k} \]

Here, $d \times d_k = 768\times 64$ in the original implementation of BERT.

The Resulting Key Matrix ($K$)

\[ K = XW^K = \begin{bmatrix} k_{11} & k_{12} & \dots & k_{1d_k} \\ k_{21} & k_{22} & \dots & k_{2d_k} \\ \vdots & \vdots & \ddots & \vdots \\ k_{n1} & k_{n2} & \dots & k_{nd_k} \end{bmatrix}_{n \times d_k} \]

Each row of $K$ is the “Key” for a specific token in the input.

The first scalar entry of this matrix corresponds to \[k_{1,1} = \sum_{j=1}^{768} x_{1,j} \cdot w_{j,1}\]

$k_{1,1}$) is the first of 64 coordinates that describe token 1’s “Key identity.”
The entire Key vector for token 1 can be represented as: \[k_1 = [x_{1,1}, x_{1,2}, \dots, x_{1,768}] \begin{bmatrix} | & | && | & \\ w_{col1} & w_{col2} & \dots& w_{col64} \\ | & | && | & \end{bmatrix}\]

Nomenclature

The weight matrix columns ($\mathbf{w}_j$) are referred to as projection vectors.
Each one projects the 768-dimensional input onto a single axis of the 64-dimensional “Key space.” (or “Query space”)

Learnable Weight Matrix ($W^Q$)

The query vectors are constructed in an analogous way. \[ Q = XW^Q = \begin{bmatrix} q_{11} & q_{12} & \dots & q_{1d_k} \\ q_{21} & q_{22} & \dots & q_{2d_k} \\ \vdots & \vdots & \ddots & \vdots \\ q_{n1} & q_{n2} & \dots & q_{nd_k} \end{bmatrix}_{n \times d_k} \]

$W^K$ versus $W^Q$

The Functional RolesThey act as two different “filters” applied to the same input data:

$W^K$ (The Key Matrix): Learns how to describe a word so it can be found. It extracts features that make a token an attractive match for others (e.g., “I am a verb,” “I am a plural noun,” “I am the subject”).
$W^Q$ (The Query Matrix): Learns how to ask a question to find other words. It extracts features that define what the current token needs to understand its own context (e.g., “I am an adjective; where is the noun I modify?” or “I am the word ‘it’; what did I refer to earlier?”).
Todo: find reference/citation for above

Attention Scores

The attention mechanism compares every Query to every Key.
Recall that our query and key vectors are both row-vectors
This means that the dot product of the first Query and all Keys is simply $q_1 K^\top$.
The entire matrix of scores is: \[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]
We will go through this step-by-step

$QK^\top$

When we multiply $Q$ and $K^\top$, the “hidden” dimensions (64) cancel out
We are left with a square matrix representing the relationships between words:\[\underbrace{Q}_{n \times 64} \times \underbrace{K^\top}_{64 \times n} = \underbrace{\text{Scores}}_{n \times n}\]

\[ QK^\top = \begin{bmatrix} q_1 \cdot k_1 & q_1 \cdot k_2 & \dots & q_1 \cdot k_n \\ q_2 \cdot k_1 & q_2 \cdot k_2 & \dots & q_2 \cdot k_n \\ \vdots & \vdots & \ddots & \vdots \\ q_n \cdot k_1 & q_n \cdot k_2 & \dots & q_n \cdot k_n \end{bmatrix}_{n \times n} \]

Visualizing the $n \times n$ Matrix

\[\underbrace{\text{Attention Scores}}_{n \times n} = \underbrace{ \begin{bmatrix} \dots & q_1 & \dots \\ \dots & q_2 & \dots \\ & \vdots & \\ \dots & q_n & \dots \end{bmatrix} }_{n \times 64} \times \underbrace{ \begin{bmatrix} \vdots & \vdots & & \vdots \\ k_1^\top & k_2^\top & \dots & k_n^\top \\ \vdots & \vdots & & \vdots \end{bmatrix} }_{64 \times n} \]

If we look at a single scalar in this resulting $n \times n$ matrix, it is the dot product:\[\text{Score}_{i,j} = q_i \cdot k_j = \sum_{m=1}^{64} q_{i,m} k_{j,m}\] - $i$: The query token. - $j$: The key token. - If $i$ aligns with $j$, this score will be high.

The scaling factor

The scaling factor $\frac{1}{\sqrt{d_k}}$ is needed to keep the gradients in a good range
Assume your $Q$ and $K$ components are independent random variables with a mean of 0 and a variance of 1.
When you take their dot product:
- $\mathbb{E}(q \cdot k)=0\quad\quad$ (Mean).
- The Variance of $q \cdot k$ becomes $d_k$ (which is 64).
- The high variance will often lead to very large positive numbers or very large negative numbers.

Proof

The dot product $q \cdot k$ for $d_k$ dimensions is defined as:

\[S = \sum_{i=1}^{d_k} q_i k_i\]

To find the variance of $S$, we first look at the variance of a single term $q_i k_i$:\[Var(q_i k_i) = E[(q_i k_i)^2] - (E[q_i k_i])^2\]Mean of the product: Since $q_i$ and $k_i$ are independent, $E[q_i k_i] = E[q_i]E[k_i] = 0 \cdot 0 = 0$.

Expected value of the square: $E[q_i^2 k_i^2] = E[q_i^2]E[k_i^2]$. Since $Var(X) = E[X^2] - (E[X])^2$, and $E[X]=0$, then $E[X^2] = Var(X) = 1$.
Thus, $E[q_i^2 k_i^2] = 1 \cdot 1 = 1$.Variance of one term: $Var(q_i k_i) = 1 - 0 = 1$.

Now, because the components are independent, the variance of the sum is the sum of the variances:

\[Var(S) = Var\left(\sum_{i=1}^{d_k} q_i k_i\right) = \sum_{i=1}^{d_k} Var(q_i k_i) = \sum_{i=1}^{d_k} 1 = d_k\]

For BERT-Base, where $d_k = 64$, the variance of the raw dot product is 64.

Scaling

If we scale by $\dfrac{1}{\sqrt{d_k}}$, the expected variance is 1, which mitigates the problem of exploding/vanishing gradients

Softmax

\[ \begin{bmatrix} 1.3 \\ 5.1\\ 2.2 \\ 0.7 \\ 1.1 \end{bmatrix} \rightarrow \dfrac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \rightarrow \begin{bmatrix} 0.02 \\ 0.90\\ 0.05 \\ 0.01 \\ 0.02 \end{bmatrix} \]

The softmax function transforms a set of inputs into a probability-like distribution

Softmax and attention

Each row in the $QK^\top$ matrix represents a single word’s “view” of the entire sentence.
Row 1: How much Word 1 cares about [Word 1, Word 2, …, Word $n$].
Row 2: How much Word 2 cares about [Word 1, Word 2, …, Word $n$].
etc.
Since we want each word to “distribute” its attention across the sentence, the scores in each row must sum to 1.0.

\[ \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) \]

Softmax and attention (2)

In detail, the operation looks like this:

\[ \text{Softmax} \left( \frac{1}{\sqrt{d_k}} \times \underbrace{ \begin{bmatrix} q_1 \cdot k_1 & q_1 \cdot k_2 & \dots & q_1 \cdot k_n \\ q_2 \cdot k_1 & q_2 \cdot k_2 & \dots & q_2 \cdot k_n \\ \vdots & \vdots & \ddots & \vdots \\ q_n \cdot k_1 & q_n \cdot k_2 & \dots & q_n \cdot k_n \end{bmatrix} }_{\text{Raw Scores } (n \times n)} \right) = \underbrace{ \begin{bmatrix} \alpha_{1,1} & \alpha_{1,2} & \dots & \alpha_{1,n} \\ \alpha_{2,1} & \alpha_{2,2} & \dots & \alpha_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ \alpha_{n,1} & \alpha_{n,2} & \dots & \alpha_{n,n} \end{bmatrix} }_{\text{Attention Weights } (n \times n)} \]

In this resulting matrix:

$\sum_{j=1}^{n} \alpha_{i,j} = 1$ (Each row sums to 1).
$\alpha_{1,2}$ is the percentage of attention Word 1 pays to Word 2.

Value Vectors

Value vectors contain detailed information about each token
They are transformed embeddings that are weighted and summed based on the attention scores
Intuitively, the value vector contains the potential meanings of a word, and the attention scores determine which meanings are most relevant withint a given context
We get the value vectors analogously from a learned $W^V$

\[ V = XW^V = \begin{bmatrix} v_{11} & v_{12} & \dots & v_{1d_k} \\ v_{21} & v_{22} & \dots & v_{2d_k} \\ \vdots & \vdots & \ddots & \vdots \\ v_{n1} & v_{n2} & \dots & v_{nd_k} \end{bmatrix}_{n \times d_k} \]

Calculating Values

Now that we have a matrix of Attention Weights ($\alpha$), we use them to take a weighted average of the information stored in the Value matrix ($V$).

The Value Matrix ($V$)Just like $Q$ and $K$, the Value matrix is created by projecting the input $X$ through a learnable weight matrix $W^V$: \[\underbrace{V}_{n \times 64} = \underbrace{X}_{n \times 768} \times \underbrace{W^V}_{768 \times 64}\]

The Value represents the actual “content” of the word.

The Weighted Sum

We take our $n \times n$ attention weights and multiply them by our $n \times 64$ Values.

\[\text{Output} = \underbrace{\text{Softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)}_{n \times n} \times \underbrace{V}_{n \times 64}\]

To see what happens to Word 1, look at the first row of the multiplication:

\[\text{Output}_1 = \alpha_{1,1}v_1 + \alpha_{1,2}v_2 + \dots + \alpha_{1,n}v_n\]

The Final Output is a $n \times 64$ matrix representing the context-aware version of each token.

Multi-Head Assembly

In BERT-Base, this happens 12 times in parallel (12 heads). Each head produces an $n \times 64$ output. We concatenate them all together:\[12 \times 64 = 768\]This brings us back to the original $768$ dimensions, which is why the output can be passed directly into the next layer!

Todo: Decide where to talk about multi-head stuff

Section 4: Layer Normalization

Embeddings

Positional Encodings

Multi-head attention

Layer Normalization

Layer Normalization (LN)

For any fixed offset $\Delta k$, the encoding at a shifted position can be represented as a linear transformation of the original:

LN greatly contributes to stabilizing training and optimizing performance in deep learning models, especially in transformers. It works by normalizing the activations at each layer, ensuring a more consistent gradient flow during training.

Two flavors:

Post-LN transformer applies normalization after the addition of the layer’s output with the residual connection’s output.¹
Pre-LN transformers [Xiong et al., 2020] apply normalization before self-attention and feed-forward layers. This configuration is used in modern architectures such as GPT, Llama, and Vision Transformers.²

LN step 1: Add

The “Add” part is a Residual Connection (also called a Skip Connection).

Add

Instead of a layer outputting $Sublayer(X)$, it outputs:

\[\text{Output} = X + Sublayer(X)\]

strictly element-wise addition
For instance, token $i$ and feature $j$ (out of the 768 features):

\[\text{Output}_{i,j} = x_{i,j} + \text{Sublayer}(X)_{i,j}\]

Intuition. This operation can be regarded as $X + f(X)$. During backpropagation, we calculate the first derivative as

\[\frac{\partial(X + f(X))}{\partial X} = 1 + f'(X)\]

Even if the sublayer’s gradient $f'(X)$ is near zero (the “vanishing gradient” problem), the total gradient is still at least $1$.¹

LN step 2: Norm

Norm

The normalization step ensures that values do no not explode¹

\[ \text{LayerNorm}(Z) = \gamma \odot \left( \frac{Z - \mu}{\sqrt{\sigma^2 + \epsilon}} \right) + \beta\]

Mean ($\mu$): Average of the 768 values for a specific token.
Variance ($\sigma^2$): Corresponding variance.
Standardization: mean of 0 and a variance of 1. Before this step, values can be highly divergent which would lead to softmax saturation problems
$\gamma$ and $\beta$ are learnable parameters. They are vectors of size 768:

\[\text{Output}_j = (\gamma_j \cdot \hat{x}_j) + \beta_j\]

Before training, all $\gamma$ values are initialized to 1 and all $\beta$ values are initialized to 0.

Section 5: Feed-Forward Network (FFN)

Embeddings

Positional Encodings

Multi-head attention

Layer Normalization

FFN Expansion-Contraction

In BERT-Base, the FFN consists of two linear transformations with a non-linear activation (GELU) in between.
Expansion-Contraction architecture
Input $x$: $1 \times 768$
$W_1$: $768 \times 3072$ (The “Expansion” weights)
$b_1$: $1 \times 3072$ (The first bias)
$W_2$: $3072 \times 768$ (The “Projection” weights)
$b_2$: $1 \times 768$ (The second bias)

FFN layer

\[FFN(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2\]

ReLU vs GeLU

activation functions

After matrix multiplication (linear), a non-linear activation function is applied, which makes the neural network a non-linear function, allowing it to achieve complex behavior. ReLu is the most widely use activation function (simple, fast to compute)

Rectified Linear Unit (ReLU)

The Rectified Linear Unit (ReLU) outputs the input if positive; otherwise, it outputs zero.

\[ \mathrm{ReLU}(x) = \max(0,x) \]

The first derivative of ReLU is

\[ \dfrac{d}{dx}\mathrm{ReLU}(x) = \begin{cases} x > 0 & 1 \\ x < 0 & 0 \\ x = 0 & \text{undefined but usually set to zero in code}\\ \end{cases} \]

the ReLU function may suffer from the dying ReLU problem, where a large fraction of the neurons can become inactive and unresponsive, hindering the learning process¹
Leaky ReLU: $f(x) = max(\alpha \cdot x, x)$, where $\alpha$ is a small value such as 0.01. Leaky ReLU enables a small, non-zero gradient for negative values.

Gaussian Error Linear Unit (GELU)

GELU has several desirable attributes: smoothness, differentiability, and ability to approximate the widely used ReLU function.
GELU is used in many tools including BERT and GPT
GELU weights the input by its magnitude

\[\text{GELU}(x) = x \Phi(x) \approx 0.5x \left( 1 + \tanh \left( \sqrt{\frac{2}{\pi}} (x + 0.044715x^3) \right) \right)\]

$\Phi(x)$ is the Cumulative Distribution Function (CDF) of the standard normal distribution. The BERT paper uses a fast approximation (shown on right)¹

BERT’s FFN Block

Coming back to the FFN architexture, BERT takes the $768$-dimensional vector from the Add & Norm layer and expands to into a $3072$-dimensional space before applying the GELU activation function, following which it projects it back to $768$-dimensional for further processing

\[ \underbrace{1 \times 768}_{\text{Input}} \xrightarrow{W_1, b_1} \underbrace{1 \times 3072}_{\text{Linear Expansion}} \xrightarrow{\text{GELU}} \underbrace{1 \times 3072}_{\text{Non-linear Activation}} \xrightarrow{W_2, b_2} \underbrace{1 \times 768}_{\text{Output}} \]

Todo: explain FFN architexture based on BERT paper - what advantage is there for expanding etc?)

Section 6: Layer Normalization

Embeddings

Positional Encodings

Multi-head attention

Layer Normalization

Layer Normalization

Layer Normalization (2)

The First Add & Norm stabilized the Communication (Attention).
This Second Add & Norm stabilizes the Computation (FFN).

\[\text{Layer Output} = \text{LayerNorm} \left( \underbrace{X_{\text{post-attn}}}_{\text{The "Base" Knowledge}} + \underbrace{\text{FFN}(X_{\text{post-attn}})}_{\text{The "New" Insights}} \right)\]

Similar to the previous add & norm layer, this one has two purposes - The “Add” (Residual): This ensures that even if the FFN didn’t learn anything useful for a particular word (like a common “the” or “a”), the original context from the attention layer is preserved perfectly. - The “Norm” (LayerNorm): This resets the “electrical signals” of the 768 features. It ensures that the numbers don’t get progressively larger as they move from Layer 1 to Layer 12. - The output of this final Norm is exactly the same shape as the input at the very beginning of the layer ($n \times 768$)

Todo: explain what is BERT specific here

Conclusion

::: {.callout-note icon=false title: “Take home”}

This lecture introduced the Encoder architecture as it is implemented in BERT. In the next lecture, we will introduce the decoder architecture and discuss how these models are trained.

:::

Todo: Talk about some of the BERT use cases
Todo: Homework, using BERT to do something
Todo: Talk about some of the variations of BERT