Handout-Logistic Regression

Binary Logistic Regression Notes

Terminology, variables

$n$ : number of features

$m$ : size of training set

$\boldsymbol{x}$ : input vector, size $n+1$

$y$ : true target/output variable, always 0 or 1

$\hat{y}$ : predicted target/output variable, always between 0 and 1

$x^{(i)}, y^{(i)}$ : i’th training feature variable, target variable

$\boldsymbol{w}$ : weight[s] (parameter), vector of size $n+1$

For vectorization, we also have:

$X$ : matrix of input feature values, size $(m \times n+1)$ , including a column of all ones in the far-left, for the bias term/feature.

$\boldsymbol{y}$ : vector of true target labels, length $m$

$\boldsymbol{\hat y}$ : vector of predicted labels, length $m$

Model:

Define $z = \boldsymbol{w} \cdot \boldsymbol{x}$ , and

$f(\boldsymbol{x}) = g(z)$ , where

$g(z)$ is the sigmoid or logistic function, defined as

g(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-\boldsymbol{w} \cdot \boldsymbol{x}}}

We sometimes use the notation $\sigma(\cdot)$ to refer to the sigmoid or logistic function, rather than $g$ (for reasons that will become clear once we study neural networks).

Note that our model, $f(\boldsymbol{x})$ , always returns a number between 0 and 1, which we interpret as the probability of $\boldsymbol{x}$ belonging to class 1 (the positive class).

Loss function:

We define the loss function - a function to calculate the loss on a single training example, as:

L(\hat y, y) = \begin{cases} -\log\left( \hat y \right) & \text{if } y=1 \\ -\log\left( 1-\hat y \right) & \text{if } y=0 \end{cases}

where, as always, $\hat y = f(\boldsymbol{x})$ , and “log” is the natural logarithm.

This can be written as a single equation as

L(\hat y, y) = -y \log\left( \hat y \right) -(1-y) \log\left( 1-\hat y \right)

Note that these two expressions for the loss function are equivalent because $y$ is always 1 or 0, so therefore either $y=0$ or $1-y=0$ , and so therefore one of the two logarithms in the formula will be multiplied by zero, effectively removing it from the formula.

Cost function:

The cost function still calculates the average loss over all training examples:

\begin{align*} J(\boldsymbol{w}) & = \frac{1}{m}\sum_{i=1}^m L(\hat y^{(i)}, y^{(i)}) \\ & = -\frac{1}{m}\sum_{i=1}^m \left[ y^{(i)} \log( \hat y^{(i)}) +(1-y^{(i)}) \log( 1-\hat y^{(i)}) \right] \end{align*}

Vectorized version:

Define $\boldsymbol{\hat y} = g(X\boldsymbol{w}) = \sigma(X\boldsymbol{w})$ to be a vector of predictions for the feature matrix $X$ . Note how this computation takes the dot product of each row of $X$ (a feature vector) with the weight vector, and then applies the sigmoid function to it.

Then the cost function can be calculated as:

J(\boldsymbol{w}) = -\frac{1}{m} \left[ \boldsymbol{y} \cdot \log({\boldsymbol{\hat y}}) + (1-\boldsymbol{y}) \cdot \log(1-\boldsymbol{\hat y}) \right]

Note that $\log(\boldsymbol{\hat y})$ is a vector where the log function is applied element-wise to $\boldsymbol{y}$ , similarly for $1-\boldsymbol{y}$ and $\log(1-\boldsymbol{\hat y})$ . This can be done easily in NumPy and other libraries with broadcasting.

Gradient descent:

First we find the gradient of the cost function:

\dfrac{\partial}{\partial w_j} J(\boldsymbol{w}) = \frac{1}{m} \sum_{i=1}^m \left( \hat y^{(i)} - y^{(i)} \right) x^{(i)}_j \qquad \text{for } j=0, \ldots, n

and so the update equations for gradient descent become:

w_j \leftarrow w_j - \alpha \cdot \frac{1}{m} \sum_{i=1}^m \left( \hat y^{(i)} - y^{(i)} \right) x^{(i)}_j \qquad \text{for } j=0, \ldots, n

Vectorized version:

First we find the gradient of the cost function:

\begin{align*} \nabla J(\boldsymbol w) & = \frac{1}{m}X^T(\boldsymbol{\hat y} - \boldsymbol{y}) \\ \\ & = \frac{1}{m}X^T(\sigma(X \bm w) - \boldsymbol{y}) \end{align*}

and so the update equation for gradient descent becomes:

\begin{align*} \boldsymbol{w} &\leftarrow \boldsymbol{w} - \alpha \cdot \frac{1}{m}X^T(\boldsymbol{\hat y} - \boldsymbol{y}) \qquad \text{or} \\ \\ \boldsymbol{w} &\leftarrow \boldsymbol{w} - \alpha \cdot \frac{1}{m}X^T(\sigma(X \bm w)- \boldsymbol{y}) \end{align*}

Note that $\nabla J(\boldsymbol w)$ is a vector of length $n+1$ , so the update equation is updating the entire weight vector at once.