Game plan
This lecture is the first of ??? about the transformer algorithm and selected applications. We will start with a top-level overview to provide some context and will then explain the mathematics of each section in detail. The practical sessions will involve the coding of a “toy” LLM.
Foundations
Architecture
Applications
The need for tokenization
Tokenization by Open-AI
Definition
the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words.
| Tokenization Method | Example | Granularity | Pros | Cons |
|---|---|---|---|---|
| Word | Full words | [“Machine”, “learning”] | High semantic clarity | Massive vocabulary; fails on “unknown” words |
| Subword | Meaningful chunks | [“ma”, “chine”, “learn”, “ing”] | Handles new words well | Slightly more complex to implement |
| Character | Individual letters | [“m”, “a”, “c”, “h”, “i”, “n”, “e”, “l”, “r”, “g”] | Tiny vocabulary; no “unknown” words | Loses semantic meaning; very long sequences |
Major problems with word embeddings
image credit: https://github.com/minimaxir/char-embeddings
Embeddings
The embedding process transforms discrete tokens into a continuous vector space \(\mathcal{V} \to \mathbb{R}^d\).
\[\begin{equation} \mathbf{e}_i = \text{embed}(w_i) = \mathbf{x}_i^\top \mathbf{W}_e \end{equation}\]
Where:
\[ \mathbf{W}_e = \begin{bmatrix} \leftarrow & \mathbf{e}_0 & \rightarrow \\ \leftarrow & \mathbf{e}_1 & \rightarrow \\ & \vdots & \\ \leftarrow & \mathbf{e}_{V-1} & \rightarrow \end{bmatrix} \]
\(\mathbf{x}_i^\top \mathbf{W}_e\) just picks out the \(i^{th}\) row of \(\mathbf{W}_e\).
The dimension of the embeddings is adjustable. For instance, SBERT has embeddings with a dimension of 768.
The embedding matrix is initialized to random values and learned during training by back-propagation
Todo: explain learning procedure (or refer to previous slides)
Figure 3: RNN vs Encoder
\[ \left[E_{\mathrm{I}}+P_0, E_{\mathrm{am}}+P_1,E_{\mathrm{an}}+P_2, E_{\mathrm{automaton}}+P_3\right] \]
For even indices (e.g., \(i=0,2,4,\ldots\)):
\[ \mathrm{PE}(pos,2i) = \sin\left(\dfrac{pos}{100^{\frac{2i}{d}}}\right) \]
and for odd indices (e.g., \(i=1,3,5,\ldots\)):
\[ \mathrm{PE}(pos,2i+1) = \cos\left(\dfrac{pos}{100^{\frac{2i}{d}}}\right) \] Note that the \(i\) in these equations is separate for even and odd…
A positional encoding has the same dimension as an embedding, and calculates a value for each dimension using the sine and cosine formulas.
| Sequence | Index (k) | i=0 (sine) | i=0 (cosine) | i=1 (sine) | i=1 (cosine) |
|---|---|---|---|---|---|
| I | 0 | sin(0/1) = 0.00 | cos(0/1) = 1.00 | sin(0/10) = 0.00 | cos(0/10) = 1.00 |
| am | 1 | sin(1/1) = 0.84 | cos(1/1) = 0.54 | sin(1/10) = 0.10 | cos(1/10) = 1.00 |
| an | 2 | sin(2/1) = 0.91 | cos(2/1) = -0.42 | sin(2/10) = 0.20 | cos(2/10) = 0.98 |
| automaton | 3 | sin(3/1) = 0.14 | cos(3/1) = -0.99 | sin(3/10) = 0.30 | cos(3/10) = 0.96 |
Note: \(n=100\) is used for visualization; standard models typically use \(n=10,000\).
A positional encoding has the same dimension as an embedding, and calculates a value for each dimension using the sine and cosine formulas.
| Sequence | Index (k) | i=0 (sine) | i=0 (cosine) | i=1 (sine) | i=1 (cosine) |
|---|---|---|---|---|---|
| I | 0 | sin(0/1) = 0.00 | cos(0/1) = 1.00 | sin(0/10) = 0.00 | cos(0/10) = 1.00 |
| am | 1 | sin(1/1) = 0.84 | cos(1/1) = 0.54 | sin(1/10) = 0.10 | cos(1/10) = 1.00 |
| an | 2 | sin(2/1) = 0.91 | cos(2/1) = -0.42 | sin(2/10) = 0.20 | cos(2/10) = 0.98 |
| automaton | 3 | sin(3/1) = 0.14 | cos(3/1) = -0.99 | sin(3/10) = 0.30 | cos(3/10) = 0.96 |
The denominator of the formula contains the “magic number” 10,000: \[ \mathrm{PE}(k,2i) = \sin\left(\frac{k}{100^{\frac{2i}{d}}}\right) \]
The Linearity Property (Relative Positioning)
For any fixed offset \(\Delta k\), the encoding at a shifted position can be represented as a linear transformation of the original:
\[PE_{pos + \Delta k} = \mathbf{M}_{\Delta k} \cdot PE_{pos}\]
To prove this, recall the Angle Addition Formulas that specify how to compute trig functions of sums or differences of angles.
To prove that \(PE_{pos+k}\) is a linear transformation of \(PE_{pos}\), we need to show that there exists a fixed matrix \(M_k\) (that only depends on the offset \(k\), not the position \(pos\)) such that: \[PE_{pos+k} = M_k \cdot PE_{pos}\]
For any dimension \(i\), the positional encoding is a pair of \((\sin, \cos)\) values. Let’s define \(\omega_i = \frac{1}{10000^{2i/d_{model}}}\) to simplify the notation.The encoding at position \(pos\) is:
\[\begin{bmatrix} \sin(\omega_i \cdot pos) \\ \cos(\omega_i \cdot pos) \end{bmatrix}\]
We want to find the encoding at \(pos + k\):
\[\sin(\omega_i(pos + k)) = \sin(\omega_i \cdot pos + \omega_i \cdot k)\]\[\cos(\omega_i(pos + k)) = \cos(\omega_i \cdot pos + \omega_i \cdot k)\]
Plugging into the Angle Addition Formulas:
\[\sin(\omega_i \cdot pos + \omega_i \cdot k) = \sin(\omega_i \cdot pos)\cos(\omega_i \cdot k) + \cos(\omega_i \cdot pos)\sin(\omega_i \cdot k)\]\[\cos(\omega_i \cdot pos + \omega_i \cdot k) = \cos(\omega_i \cdot pos)\cos(\omega_i \cdot k) - \sin(\omega_i \cdot pos)\sin(\omega_i \cdot k)\]
Notice that the terms \(\cos(\omega_i \cdot k)\) and \(\sin(\omega_i \cdot k)\) are constants once we pick a fixed distance \(k\). We can rewrite the equations above as a matrix multiplication:\[\begin{bmatrix} \sin(\omega_i(pos+k)) \\ \cos(\omega_i(pos+k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix} \begin{bmatrix} \sin(\omega_i pos) \\ \cos(\omega_i pos) \end{bmatrix}\]
The matrix \(M_k=\begin{bmatrix} \sin(\omega_i(pos+k)) \\ \cos(\omega_i(pos+k)) \end{bmatrix}\) is a rotation matrix.
\[X_i\cdot X_j = (E_i + P_i) \cdot (E_j + P_j)\]
This results in four distinct terms
High Dimensionality: In a 512 or 1024-dimensional space, there is a lot of “room.” The model can learn to use certain dimensions primarily for semantic meaning and other dimensions primarily for positional signal.1
Learned Projections: The weight matrices (\(W_Q, W_K, W_V\)) act as filters.2 Through training, the model learns to project the “summed” vector into a subspace where the positional information is amplified when it needs to know distance, and the semantic information is amplified when it needs to know meaning.
The cross-terms (like \(E_{v} \cdot P_v\)) tend to be very close to zero because “meaning” vectors and “position” vectors are nearly orthogonal in high-dimensional space
Let \(v\) be the token at position \(P_v\) and \(w\) be the token at position \(P_w\). Then in most cases we have
\[(E_{v} + P_v) \cdot (E_{w} + P_w) \approx \underbrace{(E_{v} \cdot E_{w})}_{\text{Do these words relate?}} + \underbrace{(P_v \cdot P_w)}_{\text{Are they close together?}}\]
An input sequence \(\mathbf{X}\) is an \(n \times d\) matrix (\(n\) tokens, \(d\) features).
Each row is a token (sum of embedding and positional matrix)
For each token \(x_i\) (a \(1 \times d\) row vector): \[\underbrace{q_i}_{1 \times 64} = \underbrace{x_i}_{1 \times 768} \times \underbrace{W^Q}_{768 \times 64}\]
For the entire sequence \(X\): \[\underbrace{Q}_{n \times 64} = \underbrace{X}_{n \times 768} \times \underbrace{W^Q}_{768 \times 64}, \quad \underbrace{K}_{n \times 64} = \underbrace{X}_{n \times 768} \times \underbrace{W^K}_{768 \times 64}, \quad \underbrace{V}_{n \times 64} = \underbrace{X}_{n \times 768} \times \underbrace{W^V}_{768 \times 64}\]
Here, \(d \times d_k = 768\times 64\) in the original implementation of BERT.
\[ K = XW^K = \begin{bmatrix} k_{11} & k_{12} & \dots & k_{1d_k} \\ k_{21} & k_{22} & \dots & k_{2d_k} \\ \vdots & \vdots & \ddots & \vdots \\ k_{n1} & k_{n2} & \dots & k_{nd_k} \end{bmatrix}_{n \times d_k} \]
Each row of \(K\) is the “Key” for a specific token in the input.
The first scalar entry of this matrix corresponds to \[k_{1,1} = \sum_{j=1}^{768} x_{1,j} \cdot w_{j,1}\]
The query vectors are constructed in an analogous way. \[ Q = XW^Q = \begin{bmatrix} q_{11} & q_{12} & \dots & q_{1d_k} \\ q_{21} & q_{22} & \dots & q_{2d_k} \\ \vdots & \vdots & \ddots & \vdots \\ q_{n1} & q_{n2} & \dots & q_{nd_k} \end{bmatrix}_{n \times d_k} \]
The Functional RolesThey act as two different “filters” applied to the same input data:
\(W^K\) (The Key Matrix): Learns how to describe a word so it can be found. It extracts features that make a token an attractive match for others (e.g., “I am a verb,” “I am a plural noun,” “I am the subject”).
\(W^Q\) (The Query Matrix): Learns how to ask a question to find other words. It extracts features that define what the current token needs to understand its own context (e.g., “I am an adjective; where is the noun I modify?” or “I am the word ‘it’; what did I refer to earlier?”).
Todo: find reference/citation for above
The attention mechanism compares every Query to every Key.
Recall that our query and key vectors are both row-vectors
This means that the dot product of the first Query and all Keys is simply \(q_1 K^\top\).
The entire matrix of scores is: \[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]
We will go through this step-by-step
\[ QK^\top = \begin{bmatrix} q_1 \cdot k_1 & q_1 \cdot k_2 & \dots & q_1 \cdot k_n \\ q_2 \cdot k_1 & q_2 \cdot k_2 & \dots & q_2 \cdot k_n \\ \vdots & \vdots & \ddots & \vdots \\ q_n \cdot k_1 & q_n \cdot k_2 & \dots & q_n \cdot k_n \end{bmatrix}_{n \times n} \]
\[\underbrace{\text{Attention Scores}}_{n \times n} = \underbrace{ \begin{bmatrix} \dots & q_1 & \dots \\ \dots & q_2 & \dots \\ & \vdots & \\ \dots & q_n & \dots \end{bmatrix} }_{n \times 64} \times \underbrace{ \begin{bmatrix} \vdots & \vdots & & \vdots \\ k_1^\top & k_2^\top & \dots & k_n^\top \\ \vdots & \vdots & & \vdots \end{bmatrix} }_{64 \times n} \]
If we look at a single scalar in this resulting \(n \times n\) matrix, it is the dot product:\[\text{Score}_{i,j} = q_i \cdot k_j = \sum_{m=1}^{64} q_{i,m} k_{j,m}\] - \(i\): The query token. - \(j\): The key token. - If \(i\) aligns with \(j\), this score will be high.
The scaling factor \(\frac{1}{\sqrt{d_k}}\) is needed to keep the gradients in a good range
Assume your \(Q\) and \(K\) components are independent random variables with a mean of 0 and a variance of 1.
When you take their dot product:
The dot product \(q \cdot k\) for \(d_k\) dimensions is defined as:
\[S = \sum_{i=1}^{d_k} q_i k_i\]
To find the variance of \(S\), we first look at the variance of a single term \(q_i k_i\):\[Var(q_i k_i) = E[(q_i k_i)^2] - (E[q_i k_i])^2\]Mean of the product: Since \(q_i\) and \(k_i\) are independent, \(E[q_i k_i] = E[q_i]E[k_i] = 0 \cdot 0 = 0\).
Now, because the components are independent, the variance of the sum is the sum of the variances:
\[Var(S) = Var\left(\sum_{i=1}^{d_k} q_i k_i\right) = \sum_{i=1}^{d_k} Var(q_i k_i) = \sum_{i=1}^{d_k} 1 = d_k\]
For BERT-Base, where \(d_k = 64\), the variance of the raw dot product is 64.
\[ \begin{bmatrix} 1.3 \\ 5.1\\ 2.2 \\ 0.7 \\ 1.1 \end{bmatrix} \rightarrow \dfrac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \rightarrow \begin{bmatrix} 0.02 \\ 0.90\\ 0.05 \\ 0.01 \\ 0.02 \end{bmatrix} \]
\[ \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) \]
\[ \text{Softmax} \left( \frac{1}{\sqrt{d_k}} \times \underbrace{ \begin{bmatrix} q_1 \cdot k_1 & q_1 \cdot k_2 & \dots & q_1 \cdot k_n \\ q_2 \cdot k_1 & q_2 \cdot k_2 & \dots & q_2 \cdot k_n \\ \vdots & \vdots & \ddots & \vdots \\ q_n \cdot k_1 & q_n \cdot k_2 & \dots & q_n \cdot k_n \end{bmatrix} }_{\text{Raw Scores } (n \times n)} \right) = \underbrace{ \begin{bmatrix} \alpha_{1,1} & \alpha_{1,2} & \dots & \alpha_{1,n} \\ \alpha_{2,1} & \alpha_{2,2} & \dots & \alpha_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ \alpha_{n,1} & \alpha_{n,2} & \dots & \alpha_{n,n} \end{bmatrix} }_{\text{Attention Weights } (n \times n)} \]
In this resulting matrix:
\[ V = XW^V = \begin{bmatrix} v_{11} & v_{12} & \dots & v_{1d_k} \\ v_{21} & v_{22} & \dots & v_{2d_k} \\ \vdots & \vdots & \ddots & \vdots \\ v_{n1} & v_{n2} & \dots & v_{nd_k} \end{bmatrix}_{n \times d_k} \]
Now that we have a matrix of Attention Weights (\(\alpha\)), we use them to take a weighted average of the information stored in the Value matrix (\(V\)).
The Value represents the actual “content” of the word.
We take our \(n \times n\) attention weights and multiply them by our \(n \times 64\) Values.
\[\text{Output} = \underbrace{\text{Softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)}_{n \times n} \times \underbrace{V}_{n \times 64}\]
To see what happens to Word 1, look at the first row of the multiplication:
\[\text{Output}_1 = \alpha_{1,1}v_1 + \alpha_{1,2}v_2 + \dots + \alpha_{1,n}v_n\]
In BERT-Base, this happens 12 times in parallel (12 heads). Each head produces an \(n \times 64\) output. We concatenate them all together:\[12 \times 64 = 768\]This brings us back to the original \(768\) dimensions, which is why the output can be passed directly into the next layer!
For any fixed offset \(\Delta k\), the encoding at a shifted position can be represented as a linear transformation of the original:
LN greatly contributes to stabilizing training and optimizing performance in deep learning models, especially in transformers. It works by normalizing the activations at each layer, ensuring a more consistent gradient flow during training.
Two flavors:
The “Add” part is a Residual Connection (also called a Skip Connection).
Add
Instead of a layer outputting \(Sublayer(X)\), it outputs:
\[\text{Output} = X + Sublayer(X)\]
\[\text{Output}_{i,j} = x_{i,j} + \text{Sublayer}(X)_{i,j}\]
\[\frac{\partial(X + f(X))}{\partial X} = 1 + f'(X)\]
Norm
The normalization step ensures that values do no not explode1
\[ \text{LayerNorm}(Z) = \gamma \odot \left( \frac{Z - \mu}{\sqrt{\sigma^2 + \epsilon}} \right) + \beta\]
\[\text{Output}_j = (\gamma_j \cdot \hat{x}_j) + \beta_j\]
\[FFN(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2\]
activation functions
After matrix multiplication (linear), a non-linear activation function is applied, which makes the neural network a non-linear function, allowing it to achieve complex behavior. ReLu is the most widely use activation function (simple, fast to compute)
\[ \mathrm{ReLU}(x) = \max(0,x) \]
The first derivative of ReLU is
\[ \dfrac{d}{dx}\mathrm{ReLU}(x) = \begin{cases} x > 0 & 1 \\ x < 0 & 0 \\ x = 0 & \text{undefined but usually set to zero in code}\\ \end{cases} \]
\[\text{GELU}(x) = x \Phi(x) \approx 0.5x \left( 1 + \tanh \left( \sqrt{\frac{2}{\pi}} (x + 0.044715x^3) \right) \right)\]
\[ \underbrace{1 \times 768}_{\text{Input}} \xrightarrow{W_1, b_1} \underbrace{1 \times 3072}_{\text{Linear Expansion}} \xrightarrow{\text{GELU}} \underbrace{1 \times 3072}_{\text{Non-linear Activation}} \xrightarrow{W_2, b_2} \underbrace{1 \times 768}_{\text{Output}} \]
\[\text{Layer Output} = \text{LayerNorm} \left( \underbrace{X_{\text{post-attn}}}_{\text{The "Base" Knowledge}} + \underbrace{\text{FFN}(X_{\text{post-attn}})}_{\text{The "New" Insights}} \right)\]
Similar to the previous add & norm layer, this one has two purposes - The “Add” (Residual): This ensures that even if the FFN didn’t learn anything useful for a particular word (like a common “the” or “a”), the original context from the attention layer is preserved perfectly. - The “Norm” (LayerNorm): This resets the “electrical signals” of the 768 features. It ensures that the numbers don’t get progressively larger as they move from Layer 1 to Layer 12. - The output of this final Norm is exactly the same shape as the input at the very beginning of the layer (\(n \times 768\))
::: {.callout-note icon=false title: “Take home”}
This lecture introduced the Encoder architecture as it is implemented in BERT. In the next lecture, we will introduce the decoder architecture and discuss how these models are trained.
:::