Handout-Logistic Regression

Binary Logistic Regression Notes

Terminology, variables

nn: number of features

mm: size of training set

x\boldsymbol{x}: input vector, size n+1n+1

yy: true target/output variable, always 0 or 1

y^\hat{y}: predicted target/output variable, always between 0 and 1

x(i),y(i)x^{(i)}, y^{(i)}: i’th training feature variable, target variable

w\boldsymbol{w}: weight[s] (parameter), vector of size n+1n+1

For vectorization, we also have:

XX: matrix of input feature values, size (m×n+1)(m \times n+1), including a column of all ones in the far-left, for the bias term/feature.

y\boldsymbol{y}: vector of true target labels, length mm

y^\boldsymbol{\hat y}: vector of predicted labels, length mm

Model:

Define z=wxz = \boldsymbol{w} \cdot \boldsymbol{x}, and

f(x)=g(z)f(\boldsymbol{x}) = g(z), where

g(z)g(z) is the sigmoid or logistic function, defined as

g(z)=11+ez=11+ewxg(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-\boldsymbol{w} \cdot \boldsymbol{x}}}

We sometimes use the notation σ()\sigma(\cdot) to refer to the sigmoid or logistic function, rather than gg (for reasons that will become clear once we study neural networks).

Note that our model, f(x)f(\boldsymbol{x}), always returns a number between 0 and 1, which we interpret as the probability of x\boldsymbol{x} belonging to class 1 (the positive class).

Loss function:

We define the loss function - a function to calculate the loss on a single training example, as:

L(y^,y)={log(y^)if y=1log(1y^)if y=0L(\hat y, y) = \begin{cases} -\log\left( \hat y \right) & \text{if } y=1 \\ -\log\left( 1-\hat y \right) & \text{if } y=0 \end{cases}

where, as always, y^=f(x)\hat y = f(\boldsymbol{x}), and “log” is the natural logarithm.

This can be written as a single equation as

L(y^,y)=ylog(y^)(1y)log(1y^)L(\hat y, y) = -y \log\left( \hat y \right) -(1-y) \log\left( 1-\hat y \right)

Note that these two expressions for the loss function are equivalent because yy is always 1 or 0, so therefore either y=0y=0 or 1y=01-y=0, and so therefore one of the two logarithms in the formula will be multiplied by zero, effectively removing it from the formula.

Cost function:

The cost function still calculates the average loss over all training examples:

J(w)=1mi=1mL(y^(i),y(i))=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))]\begin{align*} J(\boldsymbol{w}) & = \frac{1}{m}\sum_{i=1}^m L(\hat y^{(i)}, y^{(i)}) \\ & = -\frac{1}{m}\sum_{i=1}^m \left[ y^{(i)} \log( \hat y^{(i)}) +(1-y^{(i)}) \log( 1-\hat y^{(i)}) \right] \end{align*}

Vectorized version:

Define y^=g(Xw)=σ(Xw)\boldsymbol{\hat y} = g(X\boldsymbol{w}) = \sigma(X\boldsymbol{w}) to be a vector of predictions for the feature matrix XX. Note how this computation takes the dot product of each row of XX (a feature vector) with the weight vector, and then applies the sigmoid function to it.

Then the cost function can be calculated as:

J(w)=1m[ylog(y^)+(1y)log(1y^)]J(\boldsymbol{w}) = -\frac{1}{m} \left[ \boldsymbol{y} \cdot \log({\boldsymbol{\hat y}}) + (1-\boldsymbol{y}) \cdot \log(1-\boldsymbol{\hat y}) \right]

Note that log(y^)\log(\boldsymbol{\hat y}) is a vector where the log function is applied element-wise to y\boldsymbol{y}, similarly for 1y1-\boldsymbol{y} and log(1y^)\log(1-\boldsymbol{\hat y}). This can be done easily in NumPy and other libraries with broadcasting.

Gradient descent:

First we find the gradient of the cost function:

wjJ(w)=1mi=1m(y^(i)y(i))xj(i)for j=0,,n\dfrac{\partial}{\partial w_j} J(\boldsymbol{w}) = \frac{1}{m} \sum_{i=1}^m \left( \hat y^{(i)} - y^{(i)} \right) x^{(i)}_j \qquad \text{for } j=0, \ldots, n

and so the update equations for gradient descent become:

wjwjα1mi=1m(y^(i)y(i))xj(i)for j=0,,nw_j \leftarrow w_j - \alpha \cdot \frac{1}{m} \sum_{i=1}^m \left( \hat y^{(i)} - y^{(i)} \right) x^{(i)}_j \qquad \text{for } j=0, \ldots, n

Vectorized version:

First we find the gradient of the cost function:

J(w)=1mXT(y^y)=1mXT(σ(Xw)y)\begin{align*} \nabla J(\boldsymbol w) & = \frac{1}{m}X^T(\boldsymbol{\hat y} - \boldsymbol{y}) \\ \\ & = \frac{1}{m}X^T(\sigma(X \bm w) - \boldsymbol{y}) \end{align*}

and so the update equation for gradient descent becomes:

wwα1mXT(y^y)orwwα1mXT(σ(Xw)y)\begin{align*} \boldsymbol{w} &\leftarrow \boldsymbol{w} - \alpha \cdot \frac{1}{m}X^T(\boldsymbol{\hat y} - \boldsymbol{y}) \qquad \text{or} \\ \\ \boldsymbol{w} &\leftarrow \boldsymbol{w} - \alpha \cdot \frac{1}{m}X^T(\sigma(X \bm w)- \boldsymbol{y}) \end{align*}

Note that J(w)\nabla J(\boldsymbol w) is a vector of length n+1n+1, so the update equation is updating the entire weight vector at once.