y^: predicted target/output variable, always between 0 and 1
x(i),y(i): i’th training feature variable, target variable
w: weight[s] (parameter), vector of size n+1
For vectorization, we also have:
X: matrix of input feature values, size (m×n+1), including a column of all ones in the far-left, for the bias term/feature.
y: vector of true target labels, length m
y^: vector of predicted labels, length m
Model:
Define z=w⋅x, and
f(x)=g(z), where
g(z) is the sigmoid or logistic function, defined as
g(z)=1+e−z1=1+e−w⋅x1
We sometimes use the notation σ(⋅) to refer to the sigmoid or logistic function, rather than g (for reasons that will become clear once we study neural networks).
Note that our model, f(x), always returns a number between 0 and 1, which we interpret as the probability of x belonging to class 1 (the positive class).
Loss function:
We define the loss function - a function to calculate the loss on a single training example, as:
L(y^,y)={−log(y^)−log(1−y^)if y=1if y=0
where, as always, y^=f(x), and “log” is the natural logarithm.
This can be written as a single equation as
L(y^,y)=−ylog(y^)−(1−y)log(1−y^)
Note that these two expressions for the loss function are equivalent because y is always 1 or 0, so therefore either y=0 or 1−y=0, and so therefore one of the two logarithms in the formula will be multiplied by zero, effectively removing it from the formula.
Cost function:
The cost function still calculates the average loss over all training examples:
Define y^=g(Xw)=σ(Xw) to be a vector of predictions for the feature matrix X. Note how this computation takes the dot product of each row of X (a feature vector) with the weight vector, and then applies the sigmoid function to it.
Then the cost function can be calculated as:
J(w)=−m1[y⋅log(y^)+(1−y)⋅log(1−y^)]
Note that log(y^) is a vector where the log function is applied element-wise to y, similarly for 1−y and log(1−y^). This can be done easily in NumPy and other libraries with broadcasting.