Game plan
This lecture covers a review of artificial neural networks and backpropagation as a prerequisite for understanding transformers and LLMs.
Neuron
Each neuron receives the weight sum of activations from the previous layer (or the input values) and applies a non-linear activation function
\[f(z) = \frac{1}{1 + e^{-z}}\]
\[ \sigma(x) = \frac{1}{1 + e^{-x}} \tag{1}\]
\[ f(x) = \max(0, x) \tag{2}\]
Approximation
A neural network with at least one hidden layer is a universal approximator in that it can represent any function
Figure 2: Weiss R, et al (2022) Applications of Neural Networks in Biomedical Data Analysis. Biomedicines. 10(7):1469.
Game plan
This lecture covers a review of artificial neural networks and backpropagation as a prerequisite for understanding transformers and LLMs.
Example ANN
skikit-learn.Figure 3: ANN Architecture: \(64 \to 128 \to 64 \to 32 \to 10\)
\[ \begin{bmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1,64} \\ w_{2,1} & w_{2,2} & \cdots & w_{2,64} \\ & & & \\ \vdots & \vdots & \ddots & \vdots \\ & & & \\ w_{128,1} & w_{128,2} & \cdots & w_{128,64} \\ \end{bmatrix}^{128\times 64} \begin{bmatrix} x_1\\ x_2\\ \vdots \\ x_{64} \end{bmatrix}^{64\times 1} + \begin{bmatrix} b_1\\ b_2\\ \; \\ \vdots \\ \; \\ b_{128} \end{bmatrix}^{128\times 1} = \begin{bmatrix} z_1\\ z_2\\ \; \\ \vdots \\ \; \\ z_{128} \end{bmatrix}^{128\times 1} \] followed by activation \[ \begin{bmatrix} \sigma(z_1)\\ \sigma(z_2)\\ \; \\ \vdots \\ \; \\ \sigma(z_{128}) \end{bmatrix}^{128\times 1} = \begin{bmatrix} a_1\\ a_2\\ \; \\ \vdots \\ \; \\ a_{128} \end{bmatrix}^{128\times 1} \]
A single “layer” of computation follows a predictable three-step cycle:
The input vector \(\mathbf{x}\) (or activations from the previous layer \(\mathbf{a}_{i-1}\)) is multiplied by the weight matrix \(W\) and the bias \(b\) is added. \[\mathbf{z} = W \mathbf{x} + \mathbf{b}\]
A non-linear function \(\sigma\) (like Sigmoid or ReLU) is applied element-wise to the pre-activations. \[\mathbf{a} = \sigma(\mathbf{z})\]
Summary/Nomenclature
Note: This process is repeated for every layer in the network until the data reaches the final output node.
Each neuron in the hidden layers can be thought of as a function that takes the results of all neurons in the previous layer as input and returns a number between 0 and 1.
The entire network can be thought of as a function that takes 64 numbers as input and returns 10 numbers as output.
Backpropagation
Backprop is the core algorithm behind deep learning.
A key idea of neural nets is to decompose computation into a series of layers, which form modular blocks that can be chained together into a computation graph to perform backprop.
\[ \sigma(x) = \dfrac{1}{1+e^{-x}} \]
\(z_1 = w_1x_1 + w_3x_2 + b_1\)
\(a_1 = \sigma(z_1)\)
\(z_2 = w_2x_1 + w_4x_2 + b_2\)
\(a_2 = \sigma(z_2)\)
\(a_3 = w_5a_1 + w_6a_2 + b_3\)
\(z_1\) and \(z_2\) are intermediate values that form the argument to the activation function
\(x_1\), \(x_2\), and \(y\) are constants as far as the loss function is concerned. However, the loss will change as a function of the parameters \[ \mathcal{L} = L(w_1, w_2,w_3, w_4, w_5, w_6, b_1, b_2, b_3; x_1, x_2, y) \]
Equivalently: \(\mathcal{L} = L(\Theta; \mathbf{x};y)\)
If \(f\) and \(g\) are differentiable functions, then \[ \left(f(g(x))\right)^{\prime} = f^{\prime}(g(x))\cdot g^{\prime}(x) \]
The chain rule can also be written as follows. If \(y=f(u)\) and \(u=g(x)\), then \[ \dfrac{dy}{dx} =\dfrac{dy}{du}\dfrac{du}{dx} \]
Examples
To calculate the gradient of the loss function, \(\nabla (\Theta; \mathbf{x};y)\), we need all the partial derivatives – for all weights and biases. Let’s do this step by step
derivative of the sigmoid activation function \[ \begin{eqnarray*} \dfrac{d\sigma(x)}{dx} =\dfrac{d}{dx} \dfrac{1}{1+e^{-x}} &=& \hspace{1cm} \dfrac{-1}{(1+e^{-x})^2}\cdot (-e^{-x})&& \text{chain rule} \\ &=&\dfrac{e^{-x}}{(1+e^{-x})^2} && \hspace{1cm} \text{simplify}\\ &=&\dfrac{1}{1+e^{-x}} \cdot \dfrac{e^{-x}}{1+e^{-x}}&& \hspace{1cm} \text{rearrange}\\ &=&\sigma(x) \cdot \dfrac{e^{-x}}{1+e^{-x}}&& \hspace{1cm} \text{definition of $\sigma(x)$}\\ &=&\sigma(x) \cdot \dfrac{1 + e^{-x} - 1}{1+e^{-x}}&& \hspace{1cm} \text{add and substract 1}\\ &=&\sigma(x) \cdot \left( \dfrac{1 + e^{-x} }{1+e^{-x}} -\dfrac{1}{1+e^{-x}} \right)&& \hspace{1cm} \text{rearrange}\\ &=&\sigma(x) \cdot \left( 1 -\sigma(x) \right)&& \hspace{1cm} \text{simplify using definition of $\sigma(x)$}\\ \end{eqnarray*} \]
The forward pass calculates \(\sigma(x)\). Using this derivation, the backward pass can calculate the partial derivate easily using the value of \(\sigma(x)\).
\[ \mathcal{L}=\dfrac{1}{2}(y-a_3)^2 \]
\[ \dfrac{\partial \mathcal{L}}{\partial w_6} = \dfrac{\partial \mathcal{L}}{\partial a_3} \cdot \dfrac{\partial a_3}{\partial w_6} \]
\[ \dfrac{\partial a_3}{\partial w_6} = \dfrac{\partial (w_5a_1 + w_6a_2+b_3)}{\partial w_6} = a_2 \]
We have by the chain rule
\[ \dfrac{\partial \mathcal{L}}{\partial b_3} = \dfrac{\partial \mathcal{L}}{\partial a_3} \cdot \dfrac{\partial \mathcal{a_3}}{\partial b_3} \]
\[ \dfrac{\partial a_3}{\partial b_3} = \dfrac{\partial (w_5a_1 + w_6a_2+b_3)}{\partial b_3} = 1 \]
We have by the chain rule
\[ \dfrac{\partial \mathcal{L}}{\partial b_2} = \dfrac{\partial \mathcal{L}}{\partial a_3} \cdot \dfrac{\partial \mathcal{a_3}}{\partial a_2} \cdot \dfrac{\partial \mathcal{a_2}}{\partial z_2} \cdot \dfrac{\partial \mathcal{z_2}}{\partial b_2} \]
We had that \(\frac{\partial \mathcal{L}}{\partial a_3} = a_3 - y\)
\(a_3 = w_5a_1 + w_6a_2 + b_3\) and thus \(\frac{\partial \mathcal{a_3}}{\partial a_2} = w_6\)
\(a_2 = \sigma(z_2)\), and so \[ \dfrac{\partial \mathcal{a_2}}{\partial z_2} = \sigma^{\prime}(z_2) = \sigma(z_2)(1-\sigma(z_2)) = a_2(1-a_2) \]
\(z_2 = w_2x_1 + w_4x_2 + b_2\) and so \(\frac{\partial \mathcal{z_2}}{\partial b_2} = 1\)
Putting everything together1, we have \[ \dfrac{\partial \mathcal{L}}{\partial b_2} = ( a_3 - y)\cdot w_6 \cdot a_2(1-a_2) \cdot 1 \]
We will briefly review matrix calculus before proceeding.
Consider the function \[ \mathbf{f}(t) = \begin{bmatrix} t \cos(t) \\ t \sin (t) \\ t \end{bmatrix} \]
In general, we write such functions as
\[ \mathbf{f}(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots \\ f_n(x) \end{bmatrix} \]
Tangent vector
The derivative of \(\mathbf{f}(t)\) is callled the tangent vector. The derivative is defined as the derivative of each of the components of \(\mathbf{f}(t)\)
\[ \dfrac{\partial \mathbf{f}}{\partial x} = \begin{bmatrix} \dfrac{\partial f_1}{\partial x} \\ \dfrac{\partial f_2}{\partial x} \\ \vdots \\ \dfrac{\partial f_n}{\partial x} \end{bmatrix}^{1\times n} \tag{3}\]
\[ \mathbf{f} = \begin{bmatrix} 4x^2-7x +42\\ x^9 -21\end{bmatrix} \] then \[ \dfrac{\partial \mathbf{f}}{\partial x} = \begin{bmatrix} 8x-7 \\9x^8 \end{bmatrix} \]
Scalar field
A function taking a vector argument and returning a scalar is called a scalar field: \(f: \mathbb{R}^m \rightarrow \mathbb{R}\)
In matrix calculus notation1, we write \[ \dfrac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \dfrac{\partial f}{\partial x_1} & \dfrac{\partial f}{\partial x_2} & \cdots & \dfrac{\partial f}{\partial x_n} \end{bmatrix} \tag{4}\]
Another commonly encountered notation is the gradient, which by convention is represented as a column vector
\[ \nabla f(\mathbf{x}) = \begin{bmatrix} \dfrac{\partial f}{\partial \mathbf{x}} \end{bmatrix}^T = \begin{bmatrix} \dfrac{\partial f}{\partial x_1}\\ \dfrac{\partial f}{\partial x_2} \\ \cdots \\ \dfrac{\partial f}{\partial x_n} \end{bmatrix} \tag{5}\]
Note
We are calculating \(\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\), a function that takes a vector input and returns a vector (the two vectors may or may not have the same dimensions).
The numerator layout convention yields a column vector for the derivative of \(\mathbf{f}(x)\) (a function that takes a scalar argument and returns a vector). Correspondingly, if \(\mathbf{f}\) is a vector and \(\mathbf{x}\) is a vector, the derivative is a matrix where each row corresponds to an element of \(\mathbf{f}\) and each column to an element of \(\mathbf{x}\).
\[ \dfrac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \dfrac{\partial f_1}{\partial x_2} & \cdots & \dfrac{\partial f_1}{\partial x_m} \\ \dfrac{\partial f_2}{\partial x_1} & \dfrac{\partial f_2}{\partial x_2} & \cdots & \dfrac{\partial f_2}{\partial x_m} \\ \vdots & \vdots & \ddots & \vdots \\ \dfrac{\partial f_n}{\partial x_1} & \dfrac{\partial f_n}{\partial x_2} & \cdots & \dfrac{\partial f_n}{\partial x_m} \\ \end{bmatrix}^{n\times m} \tag{6}\]
\[ \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \dfrac{\partial f_1}{\partial x_2} & \cdots & \dfrac{\partial f_1}{\partial x_m} \\ \dfrac{\partial f_2}{\partial x_1} & \dfrac{\partial f_2}{\partial x_2} & \cdots & \dfrac{\partial f_2}{\partial x_m} \\ \vdots & \vdots & \ddots & \vdots \\ \dfrac{\partial f_n}{\partial x_1} & \dfrac{\partial f_n}{\partial x_2} & \cdots & \dfrac{\partial f_n}{\partial x_m} \end{bmatrix}^{n\times m} \tag{7}\]
\[ \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \dfrac{\partial f_1}{\partial x_2} & \cdots & \dfrac{\partial f_1}{\partial x_m} \\ \dfrac{\partial f_2}{\partial x_1} & \dfrac{\partial f_2}{\partial x_2} & \cdots & \dfrac{\partial f_2}{\partial x_m} \\ \vdots & \vdots & \ddots & \vdots \\ \dfrac{\partial f_n}{\partial x_1} & \dfrac{\partial f_n}{\partial x_2} & \cdots & \dfrac{\partial f_n}{\partial x_m} \end{bmatrix}^{n\times m} = \begin{bmatrix} - \nabla f_1(\mathbf{x})^T - \\ - \nabla f_2(\mathbf{x})^T - \\ \\ \vdots \\ \\ - \nabla f_n(\mathbf{x})^T - \end{bmatrix}^{n\times m} \]
\[ \dfrac{\partial f}{\partial \mathbf{X}} = \begin{bmatrix} \frac{\partial f}{\partial x_{11}} & \frac{\partial f}{\partial x_{21}} & \cdots \frac{\partial f}{\partial x_{n1}} \\ \frac{\partial f}{\partial x_{12}} & \frac{\partial f}{\partial x_{22}} & \cdots \frac{\partial f}{\partial x_{n2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial x_{1m}} & \frac{\partial f}{\partial x_{2m}} & \cdots \frac{\partial f}{\partial x_{nm}} \\ \end{bmatrix}^{m\times n} \tag{8}\]
\[ \dfrac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \dfrac{\partial f}{\partial x_1} & \dfrac{\partial f}{\partial x_2} & \cdots & \dfrac{\partial f}{\partial x_n} \end{bmatrix}^{1\times n} \]
\(f: \mathbb{R}^n \rightarrow \mathbb{R}\) (function takes a vector as input and produces a scalar)
\[ \dfrac{\partial f}{\partial \mathbf{X}} = \begin{bmatrix} \frac{\partial f}{\partial x_{11}} & \frac{\partial f}{\partial x_{11}} & \cdots \frac{\partial f}{\partial x_{n1}} \\ \frac{\partial f}{\partial x_{12}} & \frac{\partial f}{\partial x_{22}} & \cdots \frac{\partial f}{\partial x_{n2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial x_{1m}} & \frac{\partial f}{\partial x_{2m}} & \cdots \frac{\partial f}{\partial x_{nm}} \\ \end{bmatrix}^{m\times n} \]
\[ \mathbf{y}^{m\times 1} = \mathbf{f}(\mathbf{x}^{n\times 1}) \]
The input to the layer is an \(n\)-dimensional vector \(\mathbf{x}^{n\times 1}\). This vector is either the actual input to the FFN or (if we are working with a hidden layer), the output of the previous layer.
The forward pass runs through each layer of the FFN. \(\mathbf{y}^{(i)}\) becomes \(\mathbf{x}^{(i+1)}\) (output of layer \(i\) becomes the input to layer \(i+1\))
The output of the final layer is used to calculate the loss. Letting this layer be called \(\mathbf{h}\), we have \[ \mathcal{L} = L(\mathbf{h}, \mathbf{y}_{true}) \]
The loss \(\mathcal{L}\) is a scalar value.
We need to move this error back through the network to perform back propagation
For layers with an input vector coming from the previous layer, we have \[ \mathbf{y}^{m\times 1} = \mathbf{W}^{m\times n}\mathbf{x}^{n\times 1} + \mathbf{b}^{m\times 1} \]
For activation layers, we have \[ \mathbf{y}^{m\times 1} = \mathbf{\sigma}(\mathbf{x}^{m\times 1} ) \]
In this notation, \(\mathbf{\sigma}\) is a vector-valued function (and is shown in bold)
\[ \mathbf{\sigma}(\mathbf{x}^{m\times 1} ) = \begin{bmatrix} \sigma(x_1) \\ \sigma(x_2) \\ \vdots \\ \sigma(x_m) \\ \end{bmatrix} \]
\[ \dfrac{\partial E}{\partial \mathbf{y}} \]
\[ \dfrac{\partial E}{\partial \mathbf{x}} \]
\[ \dfrac{\partial E}{\partial \mathbf{x}} = \dfrac{\partial E}{\partial \mathbf{y}} \odot \dfrac{\partial \mathbf{y}}{\partial \mathbf{x}} \]
The entire backprop algorithm can be summarized as follows
Backprop with activation layers
In general, we can compute this by getting \(\frac{\partial E}{\partial \mathbf{y}}\) from the previous layer in our backprop. If the final activation layer produces the output layer, then we need to calculate \(\frac{\partial E}{\partial \mathbf{y}}\) based on the specific loss function.
The derivative is: \[ \frac{\partial E_{batch}}{\partial \mathbf{y}} = \frac{1}{n} (\mathbf{y} - \mathbf{t}) \tag{10}\]
\[ \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & 0 & \dots \\ 0 & \frac{\partial y_2}{\partial x_2} & \dots \\ \vdots & \vdots & \ddots \end{bmatrix} = \begin{bmatrix} \sigma'(x_1) & 0 & \dots \\ 0 & \sigma'(x_2) & \dots \\ \vdots & \vdots & \ddots \end{bmatrix} \]
The gradient of the error \(E\) with respect to the input \(\mathbf{x}\) is defined by the matrix product of the upstream gradient and the Jacobian:
\[ \begin{aligned} \frac{\partial E}{\partial \mathbf{x}} &= \frac{\partial E}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \\ &= \begin{bmatrix} \frac{\partial E}{\partial y_1} & \dots & \frac{\partial E}{\partial y_m} \end{bmatrix} \begin{bmatrix} \sigma'(x_1) & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \sigma'(x_m) \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial E}{\partial y_1}\sigma'(x_1) & \dots & \frac{\partial E}{\partial y_m}\sigma'(x_m) \end{bmatrix} \end{aligned} \]
\[ \frac{\partial E}{\partial \mathbf{x}} = \frac{\partial E}{\partial \mathbf{y}} \odot \sigma'(\mathbf{x}) \tag{11}\]
Note
This identity allows us to skip expensive \(O(m^2)\) matrix multiplications in favor of \(O(m)\) element-wise operations.
\[ \begin{eqnarray*} \dfrac{\partial E}{\partial \mathbf{x}} &=& \dfrac{\partial E}{\partial \mathbf{y}} \dfrac{\partial \mathbf{y}}{\partial \mathbf{x}} \\ &=& \dfrac{\partial E}{\partial \mathbf{y}} \dfrac{\partial \mathbf{Wx+b}}{\partial \mathbf{x}}\\ &=& \dfrac{\partial E}{\partial \mathbf{y}} \dfrac{\partial \mathbf{Wx}}{\partial \mathbf{x}}\\ &=& \mathbf{W}^T\dfrac{\partial E}{\partial \mathbf{y}} \end{eqnarray*} \tag{12}\]
\[ \underbrace{\mathbf{W}^T}_{(n \times m)} \cdot \underbrace{\frac{\partial E}{\partial \mathbf{y}}}_{(m \times 1)} = \underbrace{\frac{\partial E}{\partial \mathbf{x}}}_{(n \times 1)} \]
If \(\mathbf{y} = \mathbf{Wx} + \mathbf{b}\), then: \[ y_1 = w_{11}x_1 + w_{12}x_2 + \dots + w_{1n}x_n + b_1 \]
Taking the derivative of the first output (\(y_1\)) with respect to the first input (\(x_1\)): \[ \frac{\partial y_1}{\partial x_1} = w_{11} \]
\[ \begin{eqnarray*} \dfrac{\partial E}{\partial \mathbf{b}} &=& \dfrac{\partial E}{\partial \mathbf{y}} \dfrac{\partial \mathbf{y}}{\partial \mathbf{b}} \\ &=& \dfrac{\partial E}{\partial \mathbf{y}} \dfrac{\partial \mathbf{Wx+b}}{\partial \mathbf{b}}\\ &=& \dfrac{\partial E}{\partial \mathbf{y}} (\mathbf{0} + \mathbf{I} )\\ &=& \dfrac{\partial E}{\partial \mathbf{y}} \end{eqnarray*} \tag{13}\]
Following the numerator layout (where gradients are column vectors): \[ \begin{eqnarray*} \dfrac{\partial E}{\partial \mathbf{W}} &=& \dfrac{\partial E}{\partial \mathbf{y}} \dfrac{\partial \mathbf{y}}{\partial \mathbf{W}} \\ &=& \dfrac{\partial E}{\partial \mathbf{y}} \dfrac{\partial \mathbf{Wx+b}}{\partial \mathbf{W}}\\ &=& \dfrac{\partial E}{\partial \mathbf{y}} (\mathbf{x}^T + \mathbf{0})\\ &=& \underbrace{\dfrac{\partial E}{\partial \mathbf{y}}}_{(m \times 1)} \underbrace{ \mathbf{x}^T }_{1\times n} \end{eqnarray*} \tag{14}\]
Game plan
This lecture covers a review of artificial neural networks and backpropagation as a prerequisite for understanding transformers and LLMs.
Update the weights and biases of each layer \[ \begin{eqnarray*} \mathbf{W} &\leftarrow &\mathbf{W} - \eta\Delta \mathbf{W} \\ \mathbf{b} &\leftarrow &\mathbf{b} - \eta\Delta \mathbf{b} \end{eqnarray*} \]
Gradient descent is used in a variety of settings outside of back prop. Let’s do one simple example – finding the minimum of a function
\[ f(x) = 7x^2 + 3x-9 \]
\[ \begin{eqnarray*} \dfrac{df}{dx} (7x^2 + 3x-9) &=& 14x +3 &=& 0 \\ x &=& - \dfrac{3}{14} \end{eqnarray*} \]
\[ x \leftarrow x-\eta (14x +3 ) \]
import numpy as np
import matplotlib.pylab as plt
def f(x):
return 7*x**2 + 3*x - 9
def d(x):
return 14*x + 3
# first guess
eta = 0.03
# plot
x = np.linspace(-1, 1.5, 1000)
plt.figure(figsize=(3, 3))
plt.plot(x, f(x))
t = 1
for i in range(15):
plt.plot(t, f(t), marker='o', color='r')
t = t - eta*d(t)
print(f"Solution: {t:.4f} (-3/14={-3/14:.4f})")
plt.axvline(x=t, color='blue', linestyle='--', linewidth=0.8)Solution: -0.2139 (-3/14=-0.2143)
Final Solution: -0.0199 (Target: -0.2143)
\[ \mathcal{L} = L(\Theta; \mathbf{x},y) \]
If the local landscape includes many local minima, vanilla GD may “get stuck” in a local minimum
Several procedures have been developed that add a form of momentum to the standard GD update rule - this adds inertia to the GD and potentially allows GD to move past local minima.
Recall that the standard update step is \[ x \leftarrow x - \eta\dfrac{\partial f}{\partial x} \]
With momentum, this becomes
\[ \begin{eqnarray*} v &\leftarrow & \mu v- \eta\dfrac{\partial f}{\partial x}\\ x &\leftarrow & x + v \end{eqnarray*} \]
Game plan
This lecture covers a review of artificial neural networks and backpropagation as a prerequisite for understanding transformers and LLMs.
Most FFN software frameworks arrange data in the transposed form.
In theoretical notation (as in the previous slides of this lecture), we generally use column vectors: \[ \underbrace{\mathbf{y}}_{m \times 1} = \underbrace{\mathbf{W}}_{m \times n} \underbrace{\mathbf{x}}_{n \times 1} + \underbrace{\mathbf{b}}_{m \times 1} \]
Frameworks such as scikit-learn and PyTorch use the transpose (row-major) form:
\[ \begin{aligned} (\mathbf{y})^T &= (\mathbf{W}\mathbf{x} + \mathbf{b})^T \\ \mathbf{y}^T &= (\mathbf{W}\mathbf{x})^T + \mathbf{b}^T \\ \underbrace{\mathbf{y}^T}_{1 \times m} &= \underbrace{\mathbf{x}^T}_{1 \times n} \underbrace{\mathbf{W}^T}_{n \times m} + \underbrace{\mathbf{b}^T}_{1 \times m} \end{aligned} \]
To transition between column-vector theory and row-major implementation (like in PyTorch or scikit-learn), we rely on these fundamental properties:
| Rule Name | Mathematical Definition | Logic / Description |
|---|---|---|
| Sum/Difference | \((A \pm B)^T = A^T \pm B^T\) | The transpose of a sum is the sum of the transposes. |
| Product Rule | \((AB)^T = B^T A^T\) | The order of multiplication reverses. This is why \(Wx\) becomes \(x^T W^T\). |
| Scalar Rule | \((\alpha A)^T = \alpha A^T\) | A scalar \(\alpha\) is unaffected by the transpose operation. |
| Involution | \((A^T)^T = A\) | Transposing a matrix twice returns it to its original form. |
| Dimensions | \((m \times n)^T = (n \times m)\) | Rows become columns; columns become rows. |
In practice, we rarely compute one sample at a time. We use batches to leverage GPU parallelism.
(n,)In NumPy/scikit-learn, a 1D array has the shape (n,).
dot() or @, Python treats it as a row vector by default:
When we move to a batch of \(k\) samples, we stack our row vectors \(\mathbf{x}^T\) into a Design Matrix \(\mathbf{X}\).
\[ \underbrace{\mathbf{Y}}_{k \times m} = \underbrace{\mathbf{X}}_{k \times n} \underbrace{\mathbf{W}^T}_{n \times m} + \underbrace{\mathbf{b}^T}_{1 \times m} \]
Broadcasting the Bias
Note that \(\mathbf{b}^T\) is \(1 \times m\), but \(\mathbf{Y}\) is \(k \times m\). Python uses broadcasting to automatically “stretch” the bias vector so it is added to every single row in the batch.
Toy FFN
To help illustrate the concepts in this lecture, we close with the demonstration of a toy Python implementation we will use to predict the identity of hand-written digits
The datatype of the argument is NumPy (np.float64). Therefore, the function is vectorized. If the argument is a scalar (single number), it returns a single number; if the argument is a vector, it returns a vector of the same dimension and applies the same function to each element of the vector.
HOMEWORK Write similar functions for sigmoid_prime (the first derivative), mse (mean-squared error), and mse_prime(first derivative of the mse)
Network to keep track of everythingLayer = typing.Union[FullyConnectedLayer, ActivationLayer]
ArrayOfFloats = npt.NDArray[np.float64]
class Network:
def __init__(self, verbose=True):
self.verbose = verbose
self.layers = []
def add(self, layer: Layer) -> None:
self.layers.append(layer)
def predict(self, input_data: ArrayOfFloats) -> typing.List[ArrayOfFloats]:
result = []
for i in range(input_data.shape[0]):
output = input_data[i]
for layer in self.layers:
output = layer.forward(output)
result.append(output)
return result
def fit(self, x_train, y_train, minibatches, learning_rate, batch_size) -> None:
# see following slides!predict method. It takes an array of floats (in our case, this is a vector with dimension \(64\times 1\) that represents one image), and it returns an array of float (this is a vector with dimension \(10\times 1\) that represents the predicted probabilities for the class of the image)import numpy as np
import numpy.typing as npt
from sklearn import datasets
import typing
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
digits = datasets.load_digits()
first_digit = digits.data[0] # type: ignore
first_digitarray([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,
15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,
12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,
0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.])
Shape of digits: '(1797, 64)'
Shape of an element: '(64,)'
Shape of X: '(1797, 1, 64)'
Shape of an element: '(1, 64)'
X from (64,) into (1, 64), i.e., a row vector with 64 columns, to work with numpyShape of y: '(1797,)'
First ten values of y:
[0 1 2 3 4 5 6 7 8 9]
Training set size: 1437
Testing set size: 360
network = Network()
network.add(FullyConnectedLayer(64, 128))
network.add(ActivationLayer())
network.add(FullyConnectedLayer(128, 64))
network.add(ActivationLayer())
network.add(FullyConnectedLayer(64, 10))
network.add(ActivationLayer())
network.fit(X_train, y_train, minibatches=40000, learning_rate=0.5)If we run the following code to predict the class of a single example
We obtain
y_test)input_data is an np.ndarray with shape (1, 64), i.e., a row vectoroutput = input_data[i] means that output has shape (64,) (also a row vector)FullyConnectedLayer(64, 128)self.weights is a matrix with shape (64, 128) and and self.bias has shape (1,128) (and the input_data argument to this function was the output from the calling code, i.e., (64,), a row vector)np.dot returns a vector with shape (1,128).\[ \mathbf{z}^{1\times 128} = \mathbf{x}^{1,64} W^{64, 128} + \mathbf{b}^{1\times 128} \]
ActivationLayer()The forwardmethod of this class is as follows:
input_data (which is the output of the previous fully connected layer) has shape (1,128).sigmoid function is vectorized and returns an array of the same shape (1,128).(10,1)row vector representing the probabilities of the digits \(0,\ldots, 9\).out = network.predict(X_test)
cm = np.zeros((10,10), dtype="uint32")
for i in range(len(y_test)):
# Convert the One-Hot ground truth back to an integer (0-9)
true_label = np.argmax(y_test[i])
# Convert the network output (probabilities) to an integer (0-9)
predicted_label = np.argmax(out[i])
cm[true_label, predicted_label] += 1
print()
print(np.array2string(cm))
print()
print("Accuracy = %0.7f" % (np.diag(cm).sum() / cm.sum()))
[[33 0 0 0 0 0 0 0 0 0]
[ 0 28 0 0 0 0 0 0 0 0]
[ 0 0 33 0 0 0 0 0 0 0]
[ 0 0 0 33 0 1 0 0 0 0]
[ 0 0 0 0 46 0 0 0 0 0]
[ 0 0 0 0 0 45 1 0 0 1]
[ 0 0 0 0 0 1 34 0 0 0]
[ 0 0 0 0 0 0 0 33 0 1]
[ 0 0 0 0 0 1 0 0 29 0]
[ 0 0 0 0 0 0 0 0 1 39]]
Accuracy = 0.9805556X_testhas shape (360,1,64), i.e., there are 360 examplesIn our first example, input_data.shape[0] was 1. For the combined data, it is 360. - The remaining steps are performed 360 times (basically the same as before) - out now becomes a list of 360 elements, each of with is a (1,10) vector as before.
fit method, which trains the network using back propagationLayer = typing.Union[FullyConnectedLayer, ActivationLayer]
ArrayOfFloats = npt.NDArray[np.float64]
def fit(self, x_train: ArrayOfFloats, y_train: npt.NDArray[np.float64],
minibatches: int, learning_rate: float, batch_size: int=64) -> None:
for i in range(minibatches):
idx = np.argsort(np.random.random(x_train.shape[0]))[:batch_size]
x_batch = x_train[idx]
y_batch = y_train[idx]
for j in range(batch_size):
output = x_batch[j]
for layer in self.layers:
output = layer.forward(output)
error = mse_prime(y_batch[j], output)
for layer in reversed(self.layers):
error = layer.backward(error)
for layer in self.layers:
layer.step(learning_rate)Layer = typing.Union[FullyConnectedLayer, ActivationLayer]
ArrayOfFloats = npt.NDArray[np.float64]
def fit(self, x_train: ArrayOfFloats, y_train: npt.NDArray[np.float64],
minibatches: int, learning_rate: float, batch_size: int=64) -> None:
for i in range(minibatches):
idx = np.argsort(np.random.random(x_train.shape[0]))[:batch_size]
x_batch = x_train[idx]
y_batch = y_train[idx]
for j in range(batch_size):
output = x_batch[j]
for layer in self.layers:
output = layer.forward(output)
error = mse_prime(y_batch[j], output)
for layer in reversed(self.layers):
error = layer.backward(error)
for layer in self.layers:
layer.step(learning_rate)X_train has shape (1437, 1, 64), i.e., 1437 examples. Correspondingly, y_train has the same number of expected outcomes coding unsing one-hot vectors : (1437, 10)idx is a row vector, i.e., (64,) of randomly chosen indices for the current minibatchx_batch has shape (64, 1, 64),y_batch has share (64,10)Layer = typing.Union[FullyConnectedLayer, ActivationLayer]
ArrayOfFloats = npt.NDArray[np.float64]
def fit(self, x_train: ArrayOfFloats, y_train: npt.NDArray[np.float64],
minibatches: int, learning_rate: float, batch_size: int=64) -> None:
for i in range(minibatches):
idx = np.argsort(np.random.random(x_train.shape[0]))[:batch_size]
x_batch = x_train[idx]
y_batch = y_train[idx]
for j in range(batch_size):
output = x_batch[j]
for layer in self.layers:
output = layer.forward(output)
error = mse_prime(y_batch[j], output)
for layer in reversed(self.layers):
error = layer.backward(error)
for layer in self.layers:
layer.step(learning_rate)predictLayer = typing.Union[FullyConnectedLayer, ActivationLayer]
ArrayOfFloats = npt.NDArray[np.float64]
def fit(self, x_train: ArrayOfFloats, y_train: npt.NDArray[np.float64],
minibatches: int, learning_rate: float, batch_size: int=64) -> None:
for i in range(minibatches):
idx = np.argsort(np.random.random(x_train.shape[0]))[:batch_size]
x_batch = x_train[idx]
y_batch = y_train[idx]
for j in range(batch_size):
output = x_batch[j]
for layer in self.layers:
output = layer.forward(output)
error = mse_prime(y_batch[j], output)
for layer in reversed(self.layers):
error = layer.backward(error)
for layer in self.layers:
layer.step(learning_rate)FullyConnectedLayer class to be able to perform gradient descentmse_prime function.Layer = typing.Union[FullyConnectedLayer, ActivationLayer]
ArrayOfFloats = npt.NDArray[np.float64]
def fit(self, x_train: ArrayOfFloats, y_train: npt.NDArray[np.float64],
minibatches: int, learning_rate: float, batch_size: int=64) -> None:
for i in range(minibatches):
idx = np.argsort(np.random.random(x_train.shape[0]))[:batch_size]
x_batch = x_train[idx]
y_batch = y_train[idx]
for j in range(batch_size):
output = x_batch[j]
for layer in self.layers:
output = layer.forward(output)
error = mse_prime(y_batch[j], output)
for layer in reversed(self.layers):
error = layer.backward(error)
for layer in self.layers:
layer.step(learning_rate)This is the full class for the Activation Layer
class ActivationLayer:
input: typing.Optional[npt.NDArray[np.float64]]
def forward(self, input_data: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
self.input = input_data
return sigmoid(input_data)
def backward(self, output_error: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
return sigmoid_prime(self.input) * output_error # type: ignore (We know self.input will be npt.NDArray[np.float64]
def step(self, eta: float) -> None:
returnClass definition of an Activation layer
The forward pass stores the input data vector and returns the sigmoid activation of that vector
The backward pass returns the first derivative of the sigmoid function on the storedinput_dataand multiplies it by theoutput_error` that it receives as an argument
This is an implementation of Equation 11, where the output_error is \(\frac{\partial E}{\partial \mathbf{y}}\) \[
\frac{\partial E}{\partial \mathbf{x}} = \frac{\partial E}{\partial \mathbf{y}} \odot \sigma'(\mathbf{x})
\]
Does nothing, as there are no weights in the sigmoid function that could be updated by backprop. {.fragment}
error that was returned from activation layer had a dimension of \(1\times 10\); this is passed to the layer.backward method in the next iteration.class FullyConnectedLayer:
weights: npt.NDArray[np.float64]
bias: npt.NDArray[np.float64]
delta_w: npt.NDArray[np.float64]
delta_b: npt.NDArray[np.float64]
passes: int
input: typing.Optional[npt.NDArray[np.float64]]
def __init__(self, input_size: int, output_size: int):
(... omitted ...)
def forward(self, input_data: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
(... omitted ...)
def backward(self, output_error: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
input_error = np.dot(output_error, self.weights.T)
weights_error = np.dot(self.input.T, output_error) # type: ignore (We know self.input will be npt.NDArray[np.float64]
# accumulate the error over the minibatch
self.delta_w += weights_error
self.delta_b += output_error
self.passes += 1
return input_error
def step(self, eta: float) -> None:
(... omitted ...)FullyConnectedLayerActivationLayer()np.dot(output_error, self.weights.T) corresponds to \(\underbrace{\mathbf{W}^T}_{(n \times m)} \cdot \underbrace{\frac{\partial E}{\partial \mathbf{y}}}_{(m \times 1)}\): See Equation 12,output_error has shape (1,10)self.weights has shape (64, 10) and self.weights.T has shape (10, 64)input_error has shape (1, 64)weights_error = np.dot(self.input.T, output_error) corresponds to \(\dfrac{\partial E}{\partial \mathbf{W}} = \underbrace{\frac{\partial E}{\partial \mathbf{y}}}_{(m \times 1)} \underbrace{ \mathbf{x}^T }_{1\times n}\), see Equation 14self.input.T and \(\dfrac{\partial E}{\partial \mathbf{y}}\) corresponds to output_error.weights_error therefore has the shape of (64, 10), which is exactly the size of self.weights - this allows us to use it for the update!output_errorFullyConnectedLayerself.delta_w) and the bias (self.delta_b) in each iteration.Layer = typing.Union[FullyConnectedLayer, ActivationLayer]
ArrayOfFloats = npt.NDArray[np.float64]
def fit(self, x_train: ArrayOfFloats, y_train: npt.NDArray[np.float64],
minibatches: int, learning_rate: float, batch_size: int=64) -> None:
for i in range(minibatches):
idx = np.argsort(np.random.random(x_train.shape[0]))[:batch_size]
x_batch = x_train[idx]
y_batch = y_train[idx]
for j in range(batch_size):
output = x_batch[j]
for layer in self.layers:
output = layer.forward(output)
error = mse_prime(y_batch[j], output)
for layer in reversed(self.layers):
error = layer.backward(error)
for layer in self.layers:
layer.step(learning_rate)self.delta_w += weights_error command in the previous slidelayer.step method for the fully connected layerclass FullyConnectedLayer:
(..omitted...)
def step(self, eta: float) -> None:
self.weights -= eta * self.delta_w / self.passes
self.bias -= eta * self.delta_b / self.passes
# reset for the next minibatch
self.delta_w = np.zeros(self.weights.shape)
self.delta_b = np.zeros(self.bias.shape)
self.passes = 0self.delta_w += weights_error, self.delta_b += output_error and self.passes += 1 were called once for each example in the minibatchGame plan
This lecture provided a review of artificial neural networks and the matrix math behind forward and backwards propagation.
| Further reading |
|---|
| Kneusel RT (2021) Math for Deep Learning, No Starch Press, 2021 |
| Parr T and Howard J The Matrix Calculus You Need For Deep Learning: Web resource |
| Baydin AG et al (2018) Automatic differentiation in machine learning: a survey arXiv |