sigmoid functions, gradient descent, Python
Free University Berlin
2026-04-26
Game plan
Logistic regression shares some characteristics with neural networks including the weighted sum and activation function (for single neurons) and gradient-descent training loop. Studying logistic regression therefore serves as a good preparation for understanding neural networks.
Logistic Regression (LogReg) is one of the most common machine learning algorithms. It can be used to predict the probability of an event occurring based on a given labeled data set.
We will go over the algorithm in detail, because it is an excellent preparation for understanding neural networks.
Sentiment analysis uses natural language processing (NLP) and machine learning to identify, extract, and quantify emotional tones—positive, negative, or neutral—within text data.
\[ training set: {(x^{(1)}, y^{(1)}),(x^{(2)}, y^{(2)}),\ldots,(x^{(m)}, y^{(m)})} \]
Example
1. Feature Representation
For each input \(x\), we have a vector: \(\mathbf{x} = [x_1, x_2, \ldots, x_n]\)
2. Classification Function
Computes the estimated class using the Sigmoid function: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
3. Objective Function
A Loss function (Cross-Entropy) to measure how well the model is performing ( cross-entropy loss function).
4. Optimization
Gradient Descent: The method used to minimize the loss and find the best weights.
For each input \(x\), we create a vector of features: \(\mathbf{x} = [x_1, x_2, \ldots, x_n]\).
Example
There are many different ways of generating a feature vector
Example
In the Python exercise for this lecture, we will develop a naive feature extraction script for sentiment analysis of movie reviews based on key word counts
Weights
LogReg learns a vector of a vector of weights and a bias term to solve the classification task
\[ z = \sum_{i=1}^n w_ix_x + b \]
This sum can be represented identically using dot product notation (where the bold font indicates a vector of weights (\(\mathbf{w}\)) and inputs (\(\mathbf{x}\)))
\[ z = \mathbf{w}\cdot \mathbf{x} + b \]
Sigmoid function
To create a probability, we transform z through the sigmoid function, \(\sigma(z)\)
To map the real-valued \(z\) into a probability \(P \in (0,1)\), we use:
\[ \sigma(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+\exp(-z)} \]
\[ \begin{eqnarray*} 1−\sigma(x) &=& 1- \frac{1}{1+e^{-x}}\\ &=& \frac{1+e^{-x} - 1}{1+e^{-x}} \\ &=& \frac{e^{-x}}{1+e^{-x}} \cdot \frac{e^{x}}{e^{x}}\\ &=& \frac{1}{1+e^{x}} \\ &=& \sigma(-x) \\ \end{eqnarray*} \]
\[ \begin{eqnarray*} z &=& \ln \left( \frac{p}{1-p} \right)\quad\quad \text{Log odds}\\ \end{eqnarray*} \]
\[ \]\[\begin{eqnarray*} \sigma(\text{logit}(p)) &=& \frac{1}{1 + e^{-\left[ \ln \left( \frac{p}{1-p} \right) \right]}} && \text{Substitute Logit into Sigmoid } z \\ &=&\frac{1}{1 + e^{\left[ \ln \left( \frac{1-p}{p} \right) \right]}} && \text{Apply } -\ln(A) = \ln(1/A) \\ &=&\frac{1}{1 + \frac{1-p}{p}} && \text{Identity: } e^{\ln(x)} = x \\ &=&\frac{1}{\frac{p + 1 - p}{p}} \\ &=&\frac{1}{1/p} && \text{Simplify} \\ &=& p \end{eqnarray*}\]\[ \]
We then have the rule:
\[ \mathrm{classification}(x) = \begin{cases} 1 & \text{if }P(y = 1|x) > 0.5 \\ 0 & \text{otherwise} \end{cases} \]
\[ \hat{y} = \sigma(\mathbf{w}\cdot\mathbf{x} + b) \]
conditional maximum likelihood estimation: we choose the parameters w,b that maximize the log probability of the true y labels in the training data given the observations x.
The resulting loss function is the negative log likelihood loss, which is known as the cross-entropy loss
The weights should maximize the probability of the correct label \(p(y|x)\).
It is convenient to write this as \[ \log p(y|x) = \log[\hat{y}^y(1-\hat{y})^{1-y}] = y\log\hat{y} + (1-y)\log(1-\hat{y}) \]
Note this simplifies to either \(\log\hat{y}\) or \(\log(1-\hat{y})\) depending on whether \(y\) is 0 or 1.
Since it is convenient to minimize the function, we multiply it by -1: The result is known as the cross-entropy loss
\[ \mathcal{L}_{CE}(\hat{y}, y) = - y\log\hat{y} - (1-y)\log(1-\hat{y}) \]
\[ \mathcal{L}_{CE}(\hat{y}, y) = - y\log\sigma(\mathbf{w}\cdot \mathbf{x} + b) - (1-y)\log(1-\sigma(\mathbf{w}\cdot \mathbf{x} + b)) \]
\[ \mathcal{L}_{CE}(\hat{y}, y) = - y\log\sigma(\mathbf{Xw} + b) - (1-y)\log(1-\sigma(\mathbf{Xw} + b)) \]
\[ \mathcal{L}_{CE}(\hat{y}, y) = - 1\log 1 = 0 \]
Optimization Goal
To create a high-performing classifier, we minimize the Cost Function \(J(\theta)\), which is the average loss over the entire training set of \(m\) examples.
\[ J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \]
\[ \hat{\theta} = \arg\min_{\theta} J(\theta) \]
Gradient
The gradient descent algorithm finds the gradient of the loss function at the current point and moving in the opposite direction.
Another commonly encountered notation is the gradient, which by convention is represented as a column vector
\[ \nabla f(\mathbf{x}) = \begin{bmatrix} \dfrac{\partial f}{\partial \mathbf{x}} \end{bmatrix}^T = \begin{bmatrix} \dfrac{\partial f}{\partial x_1}\\ \dfrac{\partial f}{\partial x_2} \\ \cdots \\ \dfrac{\partial f}{\partial x_n} \end{bmatrix} \tag{1}\]
To calculate the gradient of the loss function, \(\nabla (\Theta; \mathbf{x};y)\), we need all the partial derivatives – for all weights and biases. Let’s do this step by step
\[ \frac{d\sigma(x)}{dx} = \sigma(x)(1-\sigma(x)) \]
\[ \nabla J(\theta) = \begin{bmatrix} \frac{\partial J}{\partial w_1} \\ \frac{\partial J}{\partial w_2} \\ \frac{\partial J}{\partial b} \end{bmatrix} \]
For Logistic Regression, the partial derivative for any specific weight \(w_j\) over \(m\) examples is:
\[ \frac{\partial J}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) x_j^{(i)} \] - Remember that \(\hat{y}\) is the prediction, and that \(\hat{y}=\sigma(x)\), and thus \(\sigma(x)(1-\sigma(x)) = \hat{y}(1-\hat{y})\)
If \(f\) and \(g\) are differentiable functions, then \[ \left(f(g(x))\right)^{\prime} = f^{\prime}(g(x))\cdot g^{\prime}(x) \]
The chain rule can also be written as follows. If \(y=f(u)\) and \(u=g(x)\), then \[ \dfrac{dy}{dx} =\dfrac{dy}{du}\dfrac{du}{dx} \]
Examples
We are differentiating the loss with respect to a weight \(w_j\).
The chain
Loss (\(\mathcal{L}\)) depends on Prediction (\(\hat{y}\)), which depends on Logit (\(z\)), which depends on Weight (\(w_j\)).
\[ \frac{\partial \mathcal{L}}{\partial w_j} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w_j} \]
\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial \hat{y}} &= -\frac{\partial }{\partial \hat{y}} [y \ln \hat{y} + (1-y) \ln(1-\hat{y})] \\ &= -y \frac{\partial }{\partial \hat{y}}\ln \hat{y} - (1-y)\frac{\partial }{\partial \hat{y}}\ln(1-\hat{y}) \\ &= -\frac{y}{ \hat{y}} - (1-y) \frac{-1}{1-\hat{y}} && \text{recall } \frac{d \ln x}{dx} = \frac{1}{x} \\ &= -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}} && \\ &= \frac{-y(1-\hat{y}) + \hat{y}(1-y)}{\hat{y}(1-\hat{y})} && \text{common denominator} \\ &= \frac{-y + y\hat{y} + \hat{y} - y\hat{y}}{\hat{y}(1-\hat{y})} && \\ &= \frac{\hat{y} - y}{\hat{y}(1-\hat{y})} && \text{final result} \end{align*} \]
Recalling this result \[ \frac{d\sigma(x)}{dx} = \sigma(x)(1-\sigma(x)) \]
and remembering that we define \[ \hat{y} = \sigma(\mathbf{w}\cdot\mathbf{x} + b) \]
We have that
\[ \frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y}) \]
The Since \(z = w_1x_1 + w_2x_2 + \dots + b\), the derivative with respect to \(w_j\) is simply the feature it is multiplied by: \[ \frac{\partial z}{\partial w_j} = x_j \]
\[ \frac{\partial \mathcal{L}}{\partial w_j} = \underbrace{\left[ \frac{\hat{y} - y}{\hat{y}(1-\hat{y})} \right]}_{\text{Part A}} \cdot \underbrace{\left[ \hat{y}(1-\hat{y}) \right]}_{\text{Part B}} \cdot \underbrace{x_j}_{\text{Part C}} \]
\[ \frac{\partial \mathcal{L}}{\partial w_j} = (\hat{y} - y)x_j \]
Intuition
The gradient is a product of the error and the magnitude of the input.
\[ \frac{\partial \mathcal{L}}{\partial b} = (\hat{y} - y) \]
Update the weights and biases of each layer \[ \begin{eqnarray*} \mathbf{W} &\leftarrow &\mathbf{W} - \eta\Delta \mathbf{W} \\ \mathbf{b} &\leftarrow &\mathbf{b} - \eta\Delta \mathbf{b} \end{eqnarray*} \]
Gradient descent is used in a variety of settings outside of back prop. Let’s do one simple example – finding the minimum of a function
\[ f(x) = 7x^2 + 3x-9 \]
\[ \begin{eqnarray*} \dfrac{df}{dx} (7x^2 + 3x-9) &=& 14x +3 &=& 0 \\ x &=& - \dfrac{3}{14} \end{eqnarray*} \]
\[ x \leftarrow x-\eta (14x +3 ) \]
import numpy as np
import matplotlib.pylab as plt
def f(x):
return 7*x**2 + 3*x - 9
def d(x):
return 14*x + 3
# first guess
eta = 0.03
# plot
x = np.linspace(-1, 1.5, 1000)
plt.figure(figsize=(3, 3))
plt.plot(x, f(x))
t = 1
for i in range(15):
plt.plot(t, f(t), marker='o', color='r')
t = t - eta*d(t)
print(f"Solution: {t:.4f} (-3/14={-3/14:.4f})")
plt.axvline(x=t, color='blue', linestyle='--', linewidth=0.8)Solution: -0.2139 (-3/14=-0.2143)
Final Solution: -0.0199 (Target: -0.2143)
\[ \mathcal{L} = L(\Theta; \mathbf{x},y) \]
\[ \begin{array}{ll} \hline \mathbf{Algorithm:} & \text{Gradient Descent Update} \\ \hline 1: & \text{Initialize } \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R} \\ 2: & \textbf{for } i \leftarrow 1 \text{ to } \text{iterations } \textbf{do} \\ 3: & \quad \hat{\mathbf{y}} \leftarrow \sigma(\mathbf{Xw} + b) \\ 4: & \quad \mathbf{w} \leftarrow \mathbf{w} - \eta \frac{1}{m} \mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) \\ 5: & \textbf{end for} \\ \hline \end{array} \]
Sigmoid function
It is convenient to stack many examples into a matrix
\[ \mathbf{\hat{y}} = \sigma(\mathbf{Xw} + b) \]
\[ \underbrace{\hat{\mathbf{y}}}_{m \times 1} = \sigma \left( \underbrace{\mathbf{X}}_{m \times n} \cdot \underbrace{\mathbf{w}}_{n \times 1} + b \right) \]
\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m,1} & x_{m,2} & \cdots & x_{m,n} \end{pmatrix} \]
\[ \sigma(z) = \frac{1}{1+e^{-z}} \]
\[ \mathrm{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{-z_j}} \quad\quad 1\leq i \leq K \]