Linear Regression Notes

Terminology, variables

$n$ : number of features

$m$ : size of training set

$x$ : feature/input variable

$y$ : true target/output variable

$\hat{y}$ : predicted target/output variable

$x^{(i)}, y^{(i)}$ : i’th training feature variable, target variable

$w$ : weight[s] (parameter)

$b$ : bias (parameter)

Simple linear regression (one feature)

Model: $f(x) = w\cdot x + b$ , where $x$ is the feature variable.

Cost function: $\displaystyle J(w, b) = \frac{1}{2m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right)^2$

Remember that $\hat{y} = f(x)$ and $\hat{y}^{(i)} = f(x^{(i)}) = wx^{(i)}+b$

Gradient (partial derivatives) of cost function:

$\displaystyle \frac{\partial}{\partial w} J(w, b) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right) x^{(i)} = \frac{1}{m} \sum_{i=1}^m \left( f(x^{(i)}) - y^{(i)} \right) x^{(i)}$

$\displaystyle \frac{\partial}{\partial b} J(w, b) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right) = \frac{1}{m} \sum_{i=1}^m \left( f(x^{(i)}) - y^{(i)} \right)$

Gradient descent algorithm:

Initialize $w$ and $b$ randomly.

Run until convergence:

$\displaystyle w = w - \alpha \cdot \frac{\partial}{\partial w} J(w, b)$

$\displaystyle b = b - \alpha \cdot \frac{\partial}{\partial b} J(w, b)$

Multiple linear regression (more than one feature)

Model: $f(x) = \boldsymbol{w}\cdot \boldsymbol{x} + b$ , where $\boldsymbol{x}$ is a vector of features of length $n$ , and $\boldsymbol{w}$ is a vector of weights of length $n$ .

Note: $\boldsymbol{w} \cdot \boldsymbol{x}$ represents the dot product of $\boldsymbol{w}$ and $\boldsymbol{x}$ .

Cost function: $\displaystyle J(\boldsymbol{w}, b) = \frac{1}{2m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right)^2$

Remember that $\hat{y} = f(\boldsymbol{x})$ and $\hat{y}^{(i)} = f(\boldsymbol{x}^{(i)}) = \boldsymbol{w}\cdot \boldsymbol{x}^{(i)}+b$

Gradient (partial derivatives) of cost function:

$\displaystyle \frac{\partial}{\partial w_j} J(\boldsymbol{w}, b) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right) x_j^{(i)} = \frac{1}{m} \sum_{i=1}^m \left( f(\boldsymbol{x}^{(i)}) - y^{(i)} \right) x_j^{(i)}$

$\displaystyle \frac{\partial}{\partial b} J(\boldsymbol{w}, b) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right) = \frac{1}{m} \sum_{i=1}^m \left( f(\boldsymbol{x}^{(i)}) - y^{(i)} \right)$

Gradient descent algorithm:

Initialize $\boldsymbol{w}$ and $b$ randomly.

Run until convergence:

$\displaystyle w_1 = w_1 - \alpha \cdot \frac{\partial}{\partial w_1} J(\boldsymbol{w}, b)$

etc, for each additional component of $\boldsymbol{w}$

$\displaystyle b = b - \alpha \cdot \frac{\partial}{\partial b} J(\boldsymbol{w}, b)$

Multiple linear regression with matrices

We have $m$ training examples and $n$ features, so each training example $\boldsymbol{x}$ is a vector of length $n$ .

Create an $m$ -by- $n$ matrix called $X$ , where each row of $X$ is a training example:

X = \underset{m \times n}{\begin{bmatrix} \longleftarrow & (x^{(1)}) ^T & \longrightarrow \\ \longleftarrow & (x^{(2)}) ^T & \longrightarrow \\ & \vdots & \\ \longleftarrow & (x^{(m)}) ^T & \longrightarrow \\ \end{bmatrix}} = \underset{m \times n}{\begin{bmatrix} x^{(1)}_1 & x^{(1)}_2 & \ldots & x^{(1)}_n \\ x^{(2)}_1 & x^{(2)}_2 & \ldots & x^{(2)}_n \\ \vdots & \vdots & \vdots &\vdots \\ x^{(m)}_1 & x^{(m)}_2 & \ldots & x^{(m)}_n \\ \end{bmatrix}}

Now augment $X$ with a column of 1’s at the far-left edge:

X=\underset{m \times (n+1)}{\begin{bmatrix} 1 & \longleftarrow & (x^{(1)}) ^T & \longrightarrow \\ 1 & \longleftarrow & (x^{(2)}) ^T & \longrightarrow \\ \vdots & & \vdots & \\ 1 & \longleftarrow & (x^{(m)}) ^T & \longrightarrow \\ \end{bmatrix}} = \underset{m \times (n+1)}{\begin{bmatrix} 1 & x^{(1)}_1 & x^{(1)}_2 & \ldots & x^{(1)}_n \\ 1 & x^{(2)}_1 & x^{(2)}_2 & \ldots & x^{(2)}_n \\ \vdots & \vdots & \vdots & \vdots &\vdots \\ 1 & x^{(m)}_1 & x^{(m)}_2 & \ldots & x^{(m)}_n \\ \end{bmatrix}}

So now $X$ is a $m$ -by-( $n+1$ ) matrix ( $m$ rows and $n+1$ columns).

Similarly, place the target values $y^{(1)}, y^{(2)}, \ldots, y^{(m)}$ into a column vector $\boldsymbol{y}$ of length $m$ .

Model: $f(x) = \boldsymbol{w}\cdot \boldsymbol{x}$ , where $\boldsymbol{x}$ is a vector of features of length $n+1$ , where $x_0$ is always 1; and $\boldsymbol{w}$ is a vector of weights of length $n+1$ .

Note: $w_0$ takes over the role of what the parameter $b$ used to be.

Cost function: $\displaystyle J(\boldsymbol{w}) = \frac{1}{2m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right)^2 = \frac{1}{2m} \sum_{i=1}^m \left(f( \boldsymbol{x}^{(i)}) - y^{(i)} \right)^2$ , but we can rewrite this:

First note that $X\boldsymbol{w}=\boldsymbol{\hat {y}}$ (in other words, we can predict all the target values at once).

So ${\boldsymbol{\hat y}} - \boldsymbol{y} = X\boldsymbol{w}-\boldsymbol{y}$

So $J(\boldsymbol{w}) =\displaystyle \frac{1}{2m}(\boldsymbol{\hat{y}} - \boldsymbol{y})^T(\boldsymbol{\hat{y}} - \boldsymbol{y}) = \frac{1}{2m}(X\boldsymbol{w} - \boldsymbol{y})^T(X\boldsymbol{w} - \boldsymbol{y})$

Gradient of cost function (in matrix form):

$\nabla J(\boldsymbol{w}) = \dfrac{1}{m}(X^TX\boldsymbol{w} - X^T\boldsymbol{y}) = \displaystyle \frac{1}{m}X^T(X\boldsymbol{w}-\boldsymbol{y})$

Note: the computation above results in a ( $n$ +1)-by-1 matrix, or a column vector of length $n$ +1, which is the same size as $\boldsymbol{w}$ .

Gradient descent algorithm:

Initialize $\boldsymbol{w}$ randomly.

Run until convergence:

$\displaystyle \boldsymbol{w} = \boldsymbol{w} - \alpha \cdot \nabla J(\boldsymbol{w})$

Normal equation

It is possible for linear regression to obtain the exact solution to minimizing the cost function $J$ . We can use matrices to do it:

$\boldsymbol{w} = (X^TX)^{-1}X^T\boldsymbol{y}$