Handout-Multinomial Logistic Regression

Multinomial Logistic Regression Notes

Terminology, variables

$n$ : number of features

$m$ : size of training set

$K$ : number of classes (we think of classes being named $1, 2, \ldots, K$ .

$\boldsymbol{x}$ : input vector, size $n+1$

$\boldsymbol y$ : one-hot target vector - vector of all zeroes except for one 1 in the target class slot

$\boldsymbol{\hat{y}}$ : predicted target/output variable, always a probability vector, we interpret each entry as being the probability of the input being predicted as each class.

$\boldsymbol {x^{(i)}}, \boldsymbol {y^{(i)}}$ : i’th training feature variable, target variable

$\bm w_1, \ldots, \bm w_K$ : weight vectors, each of size $n+1$ .

For vectorization, we also have:

$\boldsymbol X$ : matrix of input feature values, size $(m \times n+1)$ , including a column of all ones in the far-left, for the bias term/feature.

$\boldsymbol Y$ : matrix of true target labels as one-hot vectors, each vector is in a row of this matrix of size $(m \times K)$ .

$\boldsymbol{\hat Y}$ : matrix of predicted labels, as probability vectors, size $(m \times K)$ .

$\boldsymbol{W}$ : weight matrix of size $(n+1 \times K)$

Each column of this matrix corresponds to a “regular” weight vector in binary linear regression. Except now there are $K$ of these vectors in the columns of this matrix.

\bm W = \begin{bmatrix}\uparrow & \uparrow & \uparrow & \uparrow \\\boldsymbol{w_1} & \boldsymbol{w_2} & \ldots & \boldsymbol{w_K} \\\downarrow & \downarrow & \downarrow & \downarrow \\\end{bmatrix}

Model:

Define $z_k = \boldsymbol{w}_k \cdot \boldsymbol{x}$ for $k = 1, \ldots, K$ and $\bm z = [z_1, \ldots, z_K]$ . (So this is a vector of dot products.)

Then define

\begin{align*} f(\boldsymbol{x}) =\text{softmax}(\boldsymbol{z}) &= \dfrac{\exp(\bm z)}{\displaystyle \sum_{k=1}^K \exp(z_k)} =\left[ \dfrac{\exp(z_1)}{\displaystyle \sum_{k=1}^K \exp(z_k)}, \dfrac{\exp(z_2)}{\displaystyle \sum_{k=1}^K \exp(z_k)}, \ldots, \dfrac{\exp(z_K)}{\displaystyle \sum_{k=1}^K \exp(z_k)} \right] \\ &= \dfrac{1}{\displaystyle \sum_{k=1}^K \exp(z_k)} \left[ \exp(z_1), \exp(z_2), \ldots, \exp(z_K) \right] \end{align*}

where $\exp(x)$ is the exponential function, $e^x$ .

The softmax function accepts a vector of numbers and does two things to this vector:

Each component in the vector is exponentiated, which always results in a positive number.

This vector is then normalized, by dividing each component by the sum of all the components. This preserves the relative sizes of each component in relation to each other, but forces the sum of all the entries to sum to 1.

Thus, we can interpret the output of the softmax function as a probability vector.

Note that all the elements of $f(\bm x)$ will sum to 1, because of the way the softmax function is defined.

And we define $\bm {\hat y} = f(\bm x)$ , and so

\begin{gather*} \bm {\hat y} = [\hat y_1, \hat y_2, \ldots \hat y_K], \quad \text{where} \\ \hat{y}_k = \dfrac{\exp( z_k)}{\displaystyle \sum_{j=1}^K \exp(z_j)} = \dfrac{\exp(\boldsymbol{w}_k \cdot \bm x)}{\displaystyle \sum_{j=1}^K \exp(\boldsymbol{w}_j \cdot \bm x)} \end{gather*}

where we can interpret $\hat y_k$ as the probability that the prediction is class $k$ .

Vectorized version:

We can alternatively write $\bm z = [z_1, \ldots, z_K] = [\boldsymbol{w}_1 \cdot \boldsymbol{x}, \ldots, \boldsymbol{w}_K \cdot \boldsymbol{x}] = \bm x \bm W$ , and so therefore alternatively write $\bm {\hat y} = f(\bm x) = \text{softmax}(\bm x \bm W)$ . This is how to predict a single example. If we want to predict all the examples, we can define:

\bm Z = \bm {XW}

and

\bm {\hat Y} = \text{softmax}(\bm Z)

where the softmax function is applied across the rows of $\bm Z$ . Note how $\bm Z$ and $\bm {\hat Y}$ are both ( $m \times K$ ) matrices, the same size as $\bm Y$ . Each row of $\bm{\hat Y}$ is a probability vector, corresponding to the prediction function $\bm{\hat y} = f(\bm x)$ above.

Loss function:

We use what is called the cross-entropy loss function:

L( \boldsymbol{\hat{y}}, \boldsymbol{y}) = -\sum_{k=1}^K y_k \log \hat{y}_k

Because $\boldsymbol{y}$ is a one-hot vector, only one of the $y_k$ terms in the formula is 1; all the rest are zeros.

Denote the $y_k$ that is 1 by $y_c$ ( $c$ standing for "correct," meaning the "correct class"). We can therefore simplify the formula:

\begin{align*} L( \boldsymbol{\hat{y}}, \boldsymbol{y}) & = -\sum_{k=1}^K y_k \log \hat{y}_k &= -y_c \log \hat{y}_c &= -\log \dfrac{\exp(\boldsymbol{w}_c \cdot \boldsymbol{x})}{\displaystyle \sum_{k=1}^K \exp(\boldsymbol{w}_k \cdot \boldsymbol{x})} \end{align*}

Vectorized version:

We can rewrite the loss function as

L( \boldsymbol{\hat{y}}, \boldsymbol{y}) = -\sum_{k=1}^K y_k \log \hat{y}_k = -\bm y \cdot \log \bm {\hat y}

This is the negative dot product of $\bm y$ and a vector obtained by taking the logarithm of each entry in $\bm {\hat y}$ . Note that because $\bm {\hat y}$ is a one-hot vector, this dot product will be adding up a lot of zeroes, except for the entry in $\log \bm {\hat{y}}$ that corresponds to the “correct class.” (This formula will rarely be used on its own, but is useful to look at for the vectorized cost function definition below.)

Cost function:

The cost function still calculates the average loss over all training examples:

\begin{align*} J(\boldsymbol{w_1}, \ldots, \boldsymbol{w_K}) & = \frac{1}{m}\sum_{i=1}^m L(\bm{\hat y^{(i)}}, \bm{y^{(i)}}) = -\frac{1}{m}\sum_{i=1}^m \sum_{k=1}^K y^{(i)}_k \log \hat{y}^{(i)}_k \end{align*}

Vectorized version:

This can be vectorized in a number of different ways.

J(\boldsymbol{W}) = -\frac{1}{m}\sum_{i=1}^m \sum_{k=1}^K {\bm Y}_{i,k} \log \hat{\bm Y}_{i,k}

The formula above uses the $\bm Y$ and $\bm {\hat Y}$ matrices defined earlier; the subscript $i, k$ refers to the $i$ ’th row and the $k$ ’th column of the matrix. Note that the formula above is not a matrix multiplication (at least not in the traditional sense); the formula multiplies the elements of the two matrices element-by-element. This is sometimes denoted with the $\odot$ symbol:

\begin{align*} J(\boldsymbol{W}) &= -\frac{1}{m}\sum_{i=1}^m \sum_{k=1}^K {\bm Y}_{i,k} \log \hat{\bm Y}_{i,k} \\ & = -\frac{1}{m}\sum_{i=1}^m \sum_{k=1}^K ({\bm Y}\odot \log \hat{\bm Y})_{i,k} \\&= -\frac{1}{m}\text{sum}(\bm Y\odot \log \hat{\bm Y}) \end{align*}

where the “sum” notation just means add up all the entries in the resulting matrix (this can be done in one line with np.sum in Numpy). The element-by-element product can be obtained by just using * in Numpy rather than @.

Gradient descent:

First we find the gradient of the cost function:

\dfrac{\partial}{\partial w_{j,k}} J(\boldsymbol{w_1}, \ldots, \boldsymbol{w_K}) = \frac{1}{m} \sum_{i=1}^m \left( \boldsymbol{\hat{y}}_k^{(i)} - \boldsymbol{y}^{(i)}_k \right)x^{(i)}_j \qquad \\\text{for } j=0, \ldots, n \quad \text{and} \quad k=1, \ldots, K

and so the update equations for gradient descent become:

w_{j,k} \leftarrow w_{j,k} - \alpha \cdot \frac{1}{m} \sum_{i=1}^m \left( \boldsymbol{\hat{y}}_k^{(i)} - \boldsymbol{y}^{(i)}_k \right)x^{(i)}_j \qquad \text{for } j=0, \ldots, n

Vectorized version:

First we find the gradient of the cost function:

\nabla J(\boldsymbol W) = \frac{1}{m}{\bm X}^T(\boldsymbol{\hat Y} - \boldsymbol{Y})

and so the update equation for gradient descent becomes:

\boldsymbol{W} \leftarrow \boldsymbol{W} - \alpha \cdot \frac{1}{m}{\bm X}^T(\boldsymbol{\hat Y} - \boldsymbol{Y})

where recall that $\boldsymbol {\hat Y}$ and $\boldsymbol {Y}$ are defined the same as earlier.

Analysis: Note that $\nabla J(\boldsymbol W)$ is a matrix of the same size as $\bm W$ itself; both are ( $n+1\times K)$ .

$\bm X$ is $(m \times n+1)$ , so $\bm X^T$ is $(n+1 \times m)$ .

$\bm Y$ and $\bm {\hat Y}$ are both $(m \times K)$ .

So multiplying $\bm X^T$ by $(\boldsymbol{\hat Y} - \boldsymbol{Y})$ results in a matrix of size $(n+1 \times K)$ , which is the same size as $\bm W$ .