Handout-NN-Final version
Notation
- features
- data samples
- is a training sample of length [row vector]
- For vectorization, is a matrix of training data,
- layers in the neural network, numbered 1 through
- 0 = input layer, = output layer, so technically layers
- is the number of units in layer , from .
- are weight matrices, one for each layer, from
- Size is
- maps input layer (layer 0) to layer 1, maps layer 1 to layer 2, etc.
- are bias vectors, one for each layer, from
- Size is [row vector]
- are pre-activation vectors, one for each layer, from
- Size is [row vector]
- For vectorization, we have , which is
- are the activation functions, one for each layer, from
- is most commonly identity (for regression), sigmoid (for binary classification), or softmax (multiclass classification).
- for other layers can be anything, but will usually be ReLU.
- are activation vectors, one for each layer, from
- Size is [row vector]
- For vectorization, we have , which is
- is a target value, of length (often , but not necessarily)
- For vectorization, is the ground-truth labels,
- For regression and binary classification, ; for multiclass classification, could be any integer.
- , is one single predicted value, same length as
- Vectorization: is the predictions matrix, same dimension as .
- is the loss function, which will usually be mean squared error (MSE) for regression, and cross-entropy loss for classification.
- We use a fancy curly to distinguish from the used for the number of layers in the neural network.
Forward propagation
For each layer :
- Calculate dot product:
- For one sample:
-
- For multiple samples:
-
- We use the shortcut notation that and , in other words, we consider the input layer of the neural network (layer 0) to be the first “activation” layer, so we don’t have to use a separate formula for the first layer of the network.
- For one sample:
- Calculate activations:
- and
Backward propagation for binary classification
Assumptions:
- We are using binary classification, so the last layer of the network will have sigmoid activation. Every other layer will use ReLU.
- Since we are using binary classification, our loss function will be:
-
- Note, this is sometimes tricky to compute in Python because if or , then the calculation will take the log of 0, which is -infinity, which will crash your code.
- Instead, you can use
scipy.special.xlogy
, which returns 0 if you would otherwise take the log of zero.
- Our cost function is:
- [this is a function of all of the W and b terms]
- To run gradient descent, we need and for each and :
- [this is a matrix of the same size as
- [vector the same size as
- To compute these, we must already have computed the predictions, , for each , or in this case, we do this with matrices/vectors.
- Given input data :
- Pre-activation:
- Activation:
Where is ReLU for hidden layers, and sigmoid for the output layer.
- And now
Backpropagation Overview
- The key to understanding the backpropagation algorithm is realizing that each of the partial derivatives we are computing is formed from previous ones. Therefore, we will define the term which stores a specific piece of the gradient from each layer of the network and is passed backwards through the layers.
Backpropagation Step 1: Gradient at output layer
- We need to compute and . We are going to do this whole thing with vectors/matrices.
- Our matrices will always be of dimension .
- We first set .
- Note, the dimension of this is ( x 1).
- Next, we define:
-
- i.e., multiplying by , resulting in .
-
- the subscript means the ’th row of the matrix, but since is only one column, this whole calculation is basically “sum up all the terms in the matrix and divide by .”
- Note that is just one term here, so this makes sense
-
Backpropagation Step 2: calculations at each hidden layer
Note that this process proceeds backwards from the output layer to the input layer, so we already calculated and , so now we need to calculate the partial derivatives for the other W’s and b’s.
- For each layer , going backwards from to 1:
- Define
- has size
- has size
- is the derivative of ReLU, applied elementwise. Because the ReLU function is:
- the derivative of this is:
- and so therefore is just the ReLU’(z) function applied to each element of independently.
- Therefore, has the same dimensions as , which is
- The “circle-dot” operator means multiply the two matrices on either side (which must have the same dimensions) element-by-element In python this can be done with the * operator, as opposed to @.
- So all of these dimensions should make sense: and are both and we multiply them elementwise to get .
- Then, define
-
- i.e., multiplying by , resulting in .
-
- the subscript means the ’th row of the matrix, so because is , this calculation means take each row (of length ) and sum them all up, so you get one vector of size , which matches the size of : .
-
- Define
Backpropagation Step 3: Gradient descent
For each , update using gradient descent with learning rate :
-
-
Implementation Hints
- Use Python lists (indexed 0 or 1 to L as appropriate) to store the W matrices, b vectors, matrices, Z matrices, and A matrices.