K: number of classes (we think of classes being named 1,2,…,K.
x: input vector, size n+1
y: one-hot target vector - vector of all zeroes except for one 1 in the target class slot
y^: predicted target/output variable, always a probability vector, we interpret each entry as being the probability of the input being predicted as each class.
x(i),y(i): i’th training feature variable, target variable
w1,…,wK: weight vectors, each of size n+1.
For vectorization, we also have:
X: matrix of input feature values, size (m×n+1), including a column of all ones in the far-left, for the bias term/feature.
Y: matrix of true target labels as one-hot vectors, each vector is in a row of this matrix of size (m×K).
Y^: matrix of predicted labels, as probability vectors, size (m×K).
W: weight matrix of size (n+1×K)
Each column of this matrix corresponds to a “regular” weight vector in binary linear regression. Except now there are K of these vectors in the columns of this matrix.
W=↑w1↓↑w2↓↑…↓↑wK↓
Model:
Define zk=wk⋅x for k=1,…,K and z=[z1,…,zK]. (So this is a vector of dot products.)
The softmax function accepts a vector of numbers and does two things to this vector:
Each component in the vector is exponentiated, which always results in a positive number.
This vector is then normalized, by dividing each component by the sum of all the components. This preserves the relative sizes of each component in relation to each other, but forces the sum of all the entries to sum to 1.
Thus, we can interpret the output of the softmax function as a probability vector.
Note that all the elements of f(x) will sum to 1, because of the way the softmax function is defined.
where we can interpret y^k as the probability that the prediction is class k.
Vectorized version:
We can alternatively write z=[z1,…,zK]=[w1⋅x,…,wK⋅x]=xW, and so therefore alternatively write y^=f(x)=softmax(xW). This is how to predict a single example. If we want to predict all the examples, we can define:
Z=XW
and
Y^=softmax(Z)
where the softmax function is applied across the rows of Z. Note how Z and Y^ are both (m×K) matrices, the same size as Y. Each row of Y^ is a probability vector, corresponding to the prediction function y^=f(x) above.
Loss function:
We use what is called the cross-entropy loss function:
L(y^,y)=−k=1∑Kyklogy^k
Because y is a one-hot vector, only one of the yk terms in the formula is 1; all the rest are zeros.
Denote the yk that is 1 by yc (c standing for "correct," meaning the "correct class"). We can therefore simplify the formula:
This is the negative dot product of y and a vector obtained by taking the logarithm of each entry in y^. Note that because y^ is a one-hot vector, this dot product will be adding up a lot of zeroes, except for the entry in logy^ that corresponds to the “correct class.” (This formula will rarely be used on its own, but is useful to look at for the vectorized cost function definition below.)
Cost function:
The cost function still calculates the average loss over all training examples:
This can be vectorized in a number of different ways.
J(W)=−m1i=1∑mk=1∑KYi,klogY^i,k
The formula above uses the Y and Y^ matrices defined earlier; the subscript i,k refers to the i’th row and the k’th column of the matrix. Note that the formula above is not a matrix multiplication (at least not in the traditional sense); the formula multiplies the elements of the two matrices element-by-element. This is sometimes denoted with the ⊙ symbol:
where the “sum” notation just means add up all the entries in the resulting matrix (this can be done in one line with np.sum in Numpy). The element-by-element product can be obtained by just using * in Numpy rather than @.