Linear Regression Notes
Terminology, variables
n: number of features
m: size of training set
x: feature/input variable
y: true target/output variable
y^: predicted target/output variable
x(i),y(i): i’th training feature variable, target variable
w: weight[s] (parameter)
b: bias (parameter)
Simple linear regression (one feature)
Model: f(x)=w⋅x+b, where x is the feature variable.
Cost function: J(w,b)=2m1i=1∑m(y^(i)−y(i))2
Remember that y^=f(x) and y^(i)=f(x(i))=wx(i)+b
Gradient (partial derivatives) of cost function:
∂w∂J(w,b)=m1i=1∑m(y^(i)−y(i))x(i)=m1i=1∑m(f(x(i))−y(i))x(i)
∂b∂J(w,b)=m1i=1∑m(y^(i)−y(i))=m1i=1∑m(f(x(i))−y(i))
Gradient descent algorithm:
Initialize w and b randomly.
Run until convergence:
w=w−α⋅∂w∂J(w,b)
b=b−α⋅∂b∂J(w,b)
Multiple linear regression (more than one feature)
Model: f(x)=w⋅x+b, where x is a vector of features of length n, and w is a vector of weights of length n.
Note: w⋅x represents the dot product of w and x.
Cost function: J(w,b)=2m1i=1∑m(y^(i)−y(i))2
Remember that y^=f(x) and y^(i)=f(x(i))=w⋅x(i)+b
Gradient (partial derivatives) of cost function:
∂wj∂J(w,b)=m1i=1∑m(y^(i)−y(i))xj(i)=m1i=1∑m(f(x(i))−y(i))xj(i)
∂b∂J(w,b)=m1i=1∑m(y^(i)−y(i))=m1i=1∑m(f(x(i))−y(i))
Gradient descent algorithm:
Initialize w and b randomly.
Run until convergence:
w1=w1−α⋅∂w1∂J(w,b)
etc, for each additional component of w
b=b−α⋅∂b∂J(w,b)
Multiple linear regression with matrices
We have m training examples and n features, so each training example x is a vector of length n.
Create an m-by-n matrix called X, where each row of X is a training example:
X=m×n⟵⟵⟵(x(1))T(x(2))T⋮(x(m))T⟶⟶⟶=m×nx1(1)x1(2)⋮x1(m)x2(1)x2(2)⋮x2(m)……⋮…xn(1)xn(2)⋮xn(m) Now augment X with a column of 1’s at the far-left edge:
X=m×(n+1)11⋮1⟵⟵⟵(x(1))T(x(2))T⋮(x(m))T⟶⟶⟶=m×(n+1)11⋮1x1(1)x1(2)⋮x1(m)x2(1)x2(2)⋮x2(m)……⋮…xn(1)xn(2)⋮xn(m) So now X is a m-by-(n+1) matrix (m rows and n+1 columns).
Similarly, place the target values y(1),y(2),…,y(m) into a column vector y of length m.
Model: f(x)=w⋅x, where x is a vector of features of length n+1, where x0 is always 1; and w is a vector of weights of length n+1.
Note: w0 takes over the role of what the parameter b used to be.
Cost function: J(w)=2m1i=1∑m(y^(i)−y(i))2=2m1i=1∑m(f(x(i))−y(i))2, but we can rewrite this:
First note that Xw=y^ (in other words, we can predict all the target values at once).
So y^−y=Xw−y
So J(w)=2m1(y^−y)T(y^−y)=2m1(Xw−y)T(Xw−y)
Gradient of cost function (in matrix form):
∇J(w)=m1(XTXw−XTy)=m1XT(Xw−y)
Note: the computation above results in a (n+1)-by-1 matrix, or a column vector of length n+1, which is the same size as w.
Gradient descent algorithm:
Initialize w randomly.
Run until convergence:
w=w−α⋅∇J(w)
Normal equation
It is possible for linear regression to obtain the exact solution to minimizing the cost function J. We can use matrices to do it:
w=(XTX)−1XTy