Handout-NN-Final version

Notation

$n$ features

$m$ data samples

$\bm x$ is a training sample of length $n$ [row vector]
- For vectorization, $\bm X$ is a matrix of training data, $(m \times n)$

$L$ layers in the neural network, numbered 1 through $L$
- 0 = input layer, $L$ = output layer, so technically $L+1$ layers

$n_\ell$ is the number of units in layer $\ell$ , from $\ell=0, \ldots, L$ .

$\bm W^{[\ell]}$ are weight matrices, one for each layer, from $\ell=1, \ldots, L$
- Size is $(n_{\ell-1}\times n_\ell)$
- $\bm W^{[0]}$ maps input layer (layer 0) to layer 1, $\bm W^{[1]}$ maps layer 1 to layer 2, etc.

$\bm b^{[\ell]}$ are bias vectors, one for each layer, from $\ell=1, \ldots, L$
- Size is $(1 \times n_\ell)$ [row vector]

$\bm z^{[\ell]}$ are pre-activation vectors, one for each layer, from $\ell=1, \ldots, L$
- Size is $(1 \times n_\ell)$ [row vector]
- For vectorization, we have $\bm Z^{[\ell]}$ , which is $(m\times n_\ell)$

$g^{[\ell]}$ are the activation functions, one for each layer, from $\ell=1, \ldots, L$
- $g^{[L]}$ is most commonly identity (for regression), sigmoid (for binary classification), or softmax (multiclass classification).
- $g^{[\ell]}$ for other layers can be anything, but will usually be ReLU.

$\bm a^{[\ell]}$ are activation vectors, one for each layer, from $\ell=1, \ldots, L$
- Size is $(1 \times n_\ell)$ [row vector]
- For vectorization, we have $\bm A^{[\ell]}$ , which is $(m\times n_\ell)$

$\bm y$ is a target value, of length $n_L$ (often $n_L=1$ , but not necessarily)
- For vectorization, $\bm Y$ is the ground-truth labels, $(m \times n_L)$
- For regression and binary classification, $n_L=1$ ; for multiclass classification, could be any integer.

$\bm{ \hat{y}}$ $\hat y$ , is one single predicted value, same length as $\bm y$
- Vectorization: $\bm {\hat Y}$ is the predictions matrix, same dimension as $\bm Y$ .

$\mathcal L$ is the loss function, which will usually be mean squared error (MSE) for regression, and cross-entropy loss for classification.
- We use a fancy curly $\mathcal L$ to distinguish from the $L$ used for the number of layers in the neural network.

Forward propagation

For each layer $\ell=1, 2, \ldots, L$ :

Calculate dot product:
- For one sample:
  - $\bm z^{[\ell]} = \bm a^{[\ell-1]}\bm W^{[\ell]} + \bm b^{[\ell]}$
- For multiple samples:
  - $\bm Z^{[\ell]} = \bm A^{[\ell-1]}\bm W^{[\ell]} + \bm b^{[\ell]}$
- We use the shortcut notation that $\bm a^{[0]}=\bm x$ and $\bm A^{[0]}=\bm X$ , in other words, we consider the input layer of the neural network (layer 0) to be the first “activation” layer, so we don’t have to use a separate formula for the first layer of the network.

Calculate activations:
- $\bm a^{[\ell]} = g^{[\ell]}(\bm z^{[\ell]})$ and $\bm A^{[\ell]} = g^{[\ell]}(\bm Z^{[\ell]})$

Backward propagation for binary classification

Assumptions:

We are using binary classification, so the last layer of the network will have sigmoid activation. Every other layer will use ReLU.

Since we are using binary classification, our loss function will be:
- $\mathcal L(\hat y, y) = y\log(\hat y) + (1-y)\log(1-\hat y)$
- Note, this is sometimes tricky to compute in Python because if $\hat y=0$ or $1$ , then the calculation will take the log of 0, which is -infinity, which will crash your code.
- Instead, you can use scipy.special.xlogy, which returns 0 if you would otherwise take the log of zero.

Our cost function is:
- $J = \displaystyle-\frac{1}{m} \sum_{i=1}^m \mathcal L(\hat y^{(i)}, y^{(i)})$ [this is a function of all of the W and b terms]

To run gradient descent, we need $\partial J/ \partial W$ and $\partial J/ \partial b$ for each $W$ and $b$ :
- $\dfrac{\partial J}{\partial \bm{W}^{[\ell]}}, \ell = 1 \text{ to } L$ [this is a matrix of the same size as $\boldsymbol{W^{[\ell]}}$
- $\dfrac{\partial J}{\partial \bm{b}^{[\ell]}}, \ell = 1 \text{ to } L$ [vector the same size as $\boldsymbol{b^{[\ell]}}]$

To compute these, we must already have computed the predictions, $\hat y$ , for each $y$ , or in this case, we do this with matrices/vectors.
- Given input data $\bm{X} = \bm{A}^{[0]}, \text{ for each layer } \ell = 1, \dots, L$ :
- Pre-activation: $\bm{Z}^{[\ell]} = \bm{A}^{[\ell-1]} \bm{W}^{[\ell]} + \bm{b}^{[\ell]}$
- Activation: $\bm{A}^{[\ell]} = g^{[\ell]}(\bm{Z}^{[\ell]})$
  Where $g^{[\ell]}$ is ReLU for hidden layers, and sigmoid for the output layer.
- And now $\hat {\bm Y} = \bm A^{[L]}$

Backpropagation Overview

The key to understanding the backpropagation algorithm is realizing that each of the partial derivatives we are computing is formed from previous ones. Therefore, we will define the term $\bm \Delta^{[\ell]}$ which stores a specific piece of the gradient from each layer of the network and is passed backwards through the layers.

Backpropagation Step 1: Gradient at output layer

We need to compute $\dfrac{\partial J}{\partial \bm{W}^{[L]}}$ and $\dfrac{\partial J}{\partial \bm{W}^{[L]}}$ . We are going to do this whole thing with vectors/matrices.

Our $\bm \Delta^{[\ell]}$ matrices will always be of dimension $(m \times n_\ell)$ .

We first set $\bm\Delta^{[L]}=\bm{ \hat Y} - \bm Y$ .
- Note, the dimension of this is ( $m$ x 1).

Next, we define:
- $\displaystyle \frac{\partial J}{\partial \bm{W}^{[L]}} = \frac{1}{m} \bm{A}^{[L-1]^\top} \bm{\Delta}^{[L]}$
  - i.e., multiplying $(n_{L-1} \times m)$ by $(m \times 1)$ , resulting in $(n_{L-1} \times 1)$ .
- $\displaystyle \frac{\partial J}{\partial \bm{b}^{[L]}} = \frac{1}{m} \sum_{i=1}^m {\bm \Delta}^{[L]}_i$
  - the subscript $i$ means the $i$ ’th row of the matrix, but since $\bm{\Delta}^{[L]}$ is only one column, this whole calculation is basically “sum up all the terms in the matrix and divide by $m$ .”
  - Note that $\bm b^{[L]}$ is just one term here, so this makes sense

Backpropagation Step 2: calculations at each hidden layer

Note that this process proceeds backwards from the output layer to the input layer, so we already calculated $\frac{\partial J}{\partial \bm{W}^{[L]}}$ and $\frac{\partial J}{\partial \bm{b}^{[L]}}$ , so now we need to calculate the partial derivatives for the other W’s and b’s.

For each layer $\ell$ , going backwards from $L-1$ to 1:
- Define $\bm\Delta^{[\ell]}=(\bm{ \Delta^{[\ell+1]}}\bm W^{[\ell+1]^\top})\odot g'^{[\ell]}(\bm Z^{[\ell]})$
  - $\bm{ \Delta^{[\ell+1]}}$ has size $(m \times n_{\ell+1})$
  - $\bm W^{[\ell+1]^\top}$ has size $(n_{\ell+1} \times n_\ell)$
  - $g'^{[\ell]}(\bm{Z}^{[\ell]})$ is the derivative of ReLU, applied elementwise. Because the ReLU function is:
  $\text{ReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}$
  - the derivative of this is:
  $\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}$
  - and so therefore $g'^{[\ell]}(\bm{Z}^{[\ell]})$ is just the ReLU’(z) function applied to each element of $g'^{[\ell]}(\bm{Z}^{[\ell]})$ independently.
  - Therefore, $g'^{[\ell]}(\bm{Z}^{[\ell]})$ has the same dimensions as $\bm{Z}^{[\ell]}$ , which is $(m \times n_\ell)$
  - The “circle-dot” operator $\odot$ means multiply the two matrices on either side (which must have the same dimensions) element-by-element In python this can be done with the * operator, as opposed to @.
  - So all of these dimensions should make sense: $\bm{ \Delta^{[\ell+1]}}\bm W^{[\ell+1]^\top}$ and $g'^{[\ell]}(\bm{Z}^{[\ell]})$ are both $(m \times n_\ell)$ and we multiply them elementwise to get $\bm\Delta^{[\ell]}$ .
- Then, define
  - $\displaystyle \frac{\partial J}{\partial \bm{W}^{[\ell]}} = \frac{1}{m} \bm{A}^{[\ell-1]^\top} \bm{\Delta}^{[\ell]}$
    - i.e., multiplying $(n_{\ell-1} \times m)$ by $(m \times n_\ell)$ , resulting in $(n_{\ell-1} \times n_\ell)$ .
  - $\displaystyle \frac{\partial J}{\partial \bm{b}^{[\ell]}} = \frac{1}{m} \sum_{i=1}^m\bm{\Delta}_i^{[\ell]}$
    - the subscript $i$ means the $i$ ’th row of the matrix, so because $\bm{\Delta}^{[\ell]}$ is $(m \times n_\ell)$ , this calculation means take each row (of length $n_\ell$ ) and sum them all up, so you get one vector of size $n_\ell$ , which matches the size of $\bm b^{[\ell]}$ : $(1 \times n_\ell)$ .

Backpropagation Step 3: Gradient descent

For each $\ell = 1, \dots, L$ , update using gradient descent with learning rate $\alpha$ :

$\bm W^{[\ell]} \gets \bm W^{[\ell]} - \displaystyle \alpha \frac{\partial J}{\partial \bm{W}^{[\ell]}}$

$\bm b^{[\ell]} \gets \bm b^{[\ell]} - \displaystyle \alpha \frac{\partial J}{\partial \bm{b}^{[\ell]}}$

Implementation Hints

Use Python lists (indexed 0 or 1 to L as appropriate) to store the W matrices, b vectors, $\Delta$ matrices, Z matrices, and A matrices.