Linear Regression Notes

Terminology, variables

nn: number of features

mm: size of training set

xx: feature/input variable

yy: true target/output variable

y^\hat{y}: predicted target/output variable

x(i),y(i)x^{(i)}, y^{(i)}: i’th training feature variable, target variable

ww: weight[s] (parameter)

bb: bias (parameter)


Simple linear regression (one feature)

Model: f(x)=wx+bf(x) = w\cdot x + b, where xx is the feature variable.

Cost function: J(w,b)=12mi=1m(y^(i)y(i))2\displaystyle J(w, b) = \frac{1}{2m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right)^2

Remember that y^=f(x)\hat{y} = f(x) and y^(i)=f(x(i))=wx(i)+b\hat{y}^{(i)} = f(x^{(i)}) = wx^{(i)}+b

Gradient (partial derivatives) of cost function:

wJ(w,b)=1mi=1m(y^(i)y(i))x(i)=1mi=1m(f(x(i))y(i))x(i)\displaystyle \frac{\partial}{\partial w} J(w, b) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right) x^{(i)} = \frac{1}{m} \sum_{i=1}^m \left( f(x^{(i)}) - y^{(i)} \right) x^{(i)} 

bJ(w,b)=1mi=1m(y^(i)y(i))=1mi=1m(f(x(i))y(i))\displaystyle \frac{\partial}{\partial b} J(w, b) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right) = \frac{1}{m} \sum_{i=1}^m \left( f(x^{(i)}) - y^{(i)} \right) 

Gradient descent algorithm:

Initialize ww and bb randomly.

Run until convergence:

w=wαwJ(w,b)\displaystyle w = w - \alpha \cdot \frac{\partial}{\partial w} J(w, b)



b=bαbJ(w,b)\displaystyle b = b - \alpha \cdot \frac{\partial}{\partial b} J(w, b)


Multiple linear regression (more than one feature)

Model: f(x)=wx+bf(x) = \boldsymbol{w}\cdot \boldsymbol{x} + b, where x\boldsymbol{x} is a vector of features of length nn, and w\boldsymbol{w} is a vector of weights of length nn.

Note: wx\boldsymbol{w} \cdot \boldsymbol{x} represents the dot product of w\boldsymbol{w} and x\boldsymbol{x}.

Cost function: J(w,b)=12mi=1m(y^(i)y(i))2\displaystyle J(\boldsymbol{w}, b) = \frac{1}{2m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right)^2

Remember that y^=f(x)\hat{y} = f(\boldsymbol{x}) and y^(i)=f(x(i))=wx(i)+b\hat{y}^{(i)} = f(\boldsymbol{x}^{(i)}) = \boldsymbol{w}\cdot \boldsymbol{x}^{(i)}+b

Gradient (partial derivatives) of cost function:

wjJ(w,b)=1mi=1m(y^(i)y(i))xj(i)=1mi=1m(f(x(i))y(i))xj(i)\displaystyle \frac{\partial}{\partial w_j} J(\boldsymbol{w}, b) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right) x_j^{(i)} = \frac{1}{m} \sum_{i=1}^m \left( f(\boldsymbol{x}^{(i)}) - y^{(i)} \right) x_j^{(i)} 

bJ(w,b)=1mi=1m(y^(i)y(i))=1mi=1m(f(x(i))y(i))\displaystyle \frac{\partial}{\partial b} J(\boldsymbol{w}, b) = \frac{1}{m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right) = \frac{1}{m} \sum_{i=1}^m \left( f(\boldsymbol{x}^{(i)}) - y^{(i)} \right) 

Gradient descent algorithm:

Initialize w\boldsymbol{w} and bb randomly.

Run until convergence:

w1=w1αw1J(w,b)\displaystyle w_1 = w_1 - \alpha \cdot \frac{\partial}{\partial w_1} J(\boldsymbol{w}, b)

etc, for each additional component of w\boldsymbol{w}

b=bαbJ(w,b)\displaystyle b = b - \alpha \cdot \frac{\partial}{\partial b} J(\boldsymbol{w}, b)


Multiple linear regression with matrices

We have mm training examples and nn features, so each training example x\boldsymbol{x} is a vector of length nn.

Create an mm-by-nn matrix called XX, where each row of XX is a training example:

X=[(x(1))T(x(2))T(x(m))T]m×n=[x1(1)x2(1)xn(1)x1(2)x2(2)xn(2)x1(m)x2(m)xn(m)]m×nX = \underset{m \times n}{\begin{bmatrix} \longleftarrow & (x^{(1)}) ^T & \longrightarrow \\ \longleftarrow & (x^{(2)}) ^T & \longrightarrow \\ & \vdots & \\ \longleftarrow & (x^{(m)}) ^T & \longrightarrow \\ \end{bmatrix}} = \underset{m \times n}{\begin{bmatrix} x^{(1)}_1 & x^{(1)}_2 & \ldots & x^{(1)}_n \\ x^{(2)}_1 & x^{(2)}_2 & \ldots & x^{(2)}_n \\ \vdots & \vdots & \vdots &\vdots \\ x^{(m)}_1 & x^{(m)}_2 & \ldots & x^{(m)}_n \\ \end{bmatrix}}

Now augment XX with a column of 1’s at the far-left edge:

X=[1(x(1))T1(x(2))T1(x(m))T]m×(n+1)=[1x1(1)x2(1)xn(1)1x1(2)x2(2)xn(2)1x1(m)x2(m)xn(m)]m×(n+1)X=\underset{m \times (n+1)}{\begin{bmatrix} 1 & \longleftarrow & (x^{(1)}) ^T & \longrightarrow \\ 1 & \longleftarrow & (x^{(2)}) ^T & \longrightarrow \\ \vdots & & \vdots & \\ 1 & \longleftarrow & (x^{(m)}) ^T & \longrightarrow \\ \end{bmatrix}} = \underset{m \times (n+1)}{\begin{bmatrix} 1 & x^{(1)}_1 & x^{(1)}_2 & \ldots & x^{(1)}_n \\ 1 & x^{(2)}_1 & x^{(2)}_2 & \ldots & x^{(2)}_n \\ \vdots & \vdots & \vdots & \vdots &\vdots \\ 1 & x^{(m)}_1 & x^{(m)}_2 & \ldots & x^{(m)}_n \\ \end{bmatrix}}

So now XX is a mm-by-(n+1n+1) matrix (mm rows and n+1n+1 columns).

Similarly, place the target values y(1),y(2),,y(m)y^{(1)}, y^{(2)}, \ldots, y^{(m)} into a column vector y\boldsymbol{y} of length mm.

Model: f(x)=wxf(x) = \boldsymbol{w}\cdot \boldsymbol{x}, where x\boldsymbol{x} is a vector of features of length n+1n+1, where x0x_0 is always 1; and w\boldsymbol{w} is a vector of weights of length n+1n+1.

Note: w0w_0 takes over the role of what the parameter bb used to be.

Cost function: J(w)=12mi=1m(y^(i)y(i))2=12mi=1m(f(x(i))y(i))2\displaystyle J(\boldsymbol{w}) = \frac{1}{2m} \sum_{i=1}^m \left( \hat{y}^{(i)} - y^{(i)} \right)^2 = \frac{1}{2m} \sum_{i=1}^m \left(f( \boldsymbol{x}^{(i)}) - y^{(i)} \right)^2, but we can rewrite this:

First note that Xw=y^X\boldsymbol{w}=\boldsymbol{\hat {y}} (in other words, we can predict all the target values at once).

So y^y=Xwy{\boldsymbol{\hat y}} - \boldsymbol{y} = X\boldsymbol{w}-\boldsymbol{y}

So J(w)=12m(y^y)T(y^y)=12m(Xwy)T(Xwy)J(\boldsymbol{w}) =\displaystyle \frac{1}{2m}(\boldsymbol{\hat{y}} - \boldsymbol{y})^T(\boldsymbol{\hat{y}} - \boldsymbol{y}) = \frac{1}{2m}(X\boldsymbol{w} - \boldsymbol{y})^T(X\boldsymbol{w} - \boldsymbol{y})

Gradient of cost function (in matrix form):

J(w)=1m(XTXwXTy)=1mXT(Xwy)\nabla J(\boldsymbol{w}) = \dfrac{1}{m}(X^TX\boldsymbol{w} - X^T\boldsymbol{y}) = \displaystyle \frac{1}{m}X^T(X\boldsymbol{w}-\boldsymbol{y})

Note: the computation above results in a (nn+1)-by-1 matrix, or a column vector of length nn+1, which is the same size as w\boldsymbol{w}.

Gradient descent algorithm:

Initialize w\boldsymbol{w} randomly.

Run until convergence:

w=wαJ(w)\displaystyle \boldsymbol{w} = \boldsymbol{w} - \alpha \cdot \nabla J(\boldsymbol{w})


Normal equation

It is possible for linear regression to obtain the exact solution to minimizing the cost function JJ. We can use matrices to do it:

w=(XTX)1XTy\boldsymbol{w} = (X^TX)^{-1}X^T\boldsymbol{y}