ML Test Bank — Answer Key

Intro / General Concepts

1. What is the difference between supervised and unsupervised learning? Give an example of each.

In supervised learning, the training data includes both input features and the correct output (target/label) for each example. The algorithm learns to map inputs to outputs. Example: predicting house prices from square footage (we have the actual sale prices to learn from).
In unsupervised learning, the training data has only input features with no correct output labels. The algorithm tries to find structure, patterns, or groupings in the data on its own. Example: clustering customers into groups based on purchasing behavior (no one tells the algorithm what the groups should be).

2. What is the difference between a regression problem and a classification problem?

Regression predicts a continuous numerical output (e.g., a price, a temperature). Classification predicts one of a finite number of discrete categories (e.g., spam/not-spam, cat/dog).
(a) Regression — temperature is a continuous number.
(b) Classification — yes/no is a binary category.
(c) Classification — the letter is one of 26 discrete categories.
(d) Regression — minutes is a continuous number.

3. Define the following terms: training set, feature, target, training example.

Training set: The collection of data used to train the model.
Feature: An input variable used by the model to make predictions (e.g., square footage). Denoted by $x$ (or $x_j$ for individual features in the multiple-feature case).
Target: The output variable the model is trying to predict, denoted by $y$.
Training example: A single (input, output) pair from the training set, denoted $(x^{(i)}, y^{(i)})$.

4. Using standard ML notation, what do each of the following symbols represent?

$m$: The number of training examples in the training set.
$x^{(i)}$: The input features of the $i$-th training example.
$y^{(i)}$: The target (correct output) for the $i$-th training example.
$\hat{y}^{(i)}$: The model’s predicted output for the $i$-th training example.
$n$: The number of features.

5. Explain the difference between a model’s parameters and its hyperparameters. Give an example of each from linear regression.

Parameters are the values that the learning algorithm adjusts during training to fit the model to the data. In linear regression, $w$ (weight) and $b$ (bias) are parameters.
Hyperparameters are values set by the human before training begins; they control how the learning algorithm behaves but are not learned from the data. In linear regression, the learning rate $\alpha$ is a hyperparameter. The number of iterations of gradient descent is another example.

Linear Regression

6. Write the model equation for univariate linear regression. What are $w$ and $b$, and what is the goal of the learning algorithm with respect to them?

\[f_{w,b}(x) = wx + b\]
$w$ is the weight (slope) and $b$ is the bias (y-intercept). The goal of the learning algorithm is to find the values of $w$ and $b$ that minimize the cost function $J(w, b)$, which means finding the straight line that best fits the training data.

7. Write the mean squared error cost function $J(w, b)$ for univariate linear regression. Why do we include the $\frac{1}{2}$ factor?

\[J(w, b) = \frac{1}{2m} \sum_{i=1}^m \left( f_{w,b}(x^{(i)}) - y^{(i)} \right)^2\]
The $\frac{1}{2}$ is a mathematical convenience: when we take the derivative of $J$ (needed for gradient descent), the exponent 2 from the squared term comes down and cancels with the $\frac{1}{2}$, making the derivative cleaner.

8. In your own words, explain what the cost function $J$ measures. Why do we want to minimize it?

The cost function $J$ measures how poorly the model’s predictions match the actual target values across the entire training set. Specifically, it computes the average of the squared differences between each predicted value $\hat{y}^{(i)}$ and the true value $y^{(i)}$. We want to minimize $J$ because a smaller cost means the model’s predictions are closer to the true values — i.e., the model fits the data better.

9. Suppose you have the following three training examples: $(1, 2)$, $(2, 4)$, $(3, 5)$. If your current model is $f(x) = 1.5x + 0.5$, compute the cost $J(w, b)$ by hand.

First compute predictions:
$f(1) = 1.5(1) + 0.5 = 2.0$
$f(2) = 1.5(2) + 0.5 = 3.5$
$f(3) = 1.5(3) + 0.5 = 5.0$
Then compute squared errors:
$(2.0 - 2)^2 = 0$
$(3.5 - 4)^2 = 0.25$
$(5.0 - 5)^2 = 0$
Sum of squared errors = $0 + 0.25 + 0 = 0.25$
\[J = \frac{1}{2(3)}(0.25) = \frac{0.25}{6} \approx 0.0417\]

10. Describe the gradient descent algorithm in plain English.

Imagine you are standing on a hilly landscape and want to get to the lowest valley. You look around in all directions, determine which direction is the steepest downhill slope, and take a small step in that direction. From your new position, you repeat: look around, find the steepest downhill direction, take a step. You keep doing this until you reach a point where no step would take you any lower — you’ve reached a (local) minimum. In gradient descent, the “landscape” is the cost function $J$, your “position” is determined by the current parameter values, and each “step” updates the parameters using the partial derivatives of $J$.

11. Write the gradient descent update equations for univariate linear regression. What does each piece represent?

$w = w - \alpha \cdot \frac{1}{m} \sum_{i=1}^m \left( f_{w,b}(x^{(i)}) - y^{(i)} \right) x^{(i)}$ $b = b - \alpha \cdot \frac{1}{m} \sum_{i=1}^m \left( f_{w,b}(x^{(i)}) - y^{(i)} \right)$
$\alpha$: the learning rate, controlling step size.
$\frac{1}{m} \sum \ldots$: the partial derivative of $J$ with respect to $w$ (or $b$), which tells us the direction and magnitude of the steepest ascent. We subtract it to go downhill.
$f_{w,b}(x^{(i)}) - y^{(i)}$: the error (prediction minus actual) for the $i$-th example.
The $x^{(i)}$ at the end of the $w$ equation comes from the chain rule when differentiating $J$ with respect to $w$.
Both updates happen simultaneously (using the old values of $w$ and $b$).

12. What is the learning rate $\alpha$? What happens if it is too large or too small?

The learning rate $\alpha$ is a small positive number that controls the size of each gradient descent step.
Too large: Gradient descent may overshoot the minimum and diverge — the cost $J$ may increase instead of decrease, and the algorithm fails to converge.
Too small: Gradient descent will work correctly but will take many more iterations to converge, making training unnecessarily slow.

13. What is a learning curve (in the context of monitoring gradient descent)?

A learning curve is a plot with the number of gradient descent iterations on the x-axis and the cost $J$ on the y-axis. It shows how the cost changes over time.
Good learning curve: Starts high and decreases smoothly, eventually leveling off (flattening) at a low value, indicating convergence.
Bad learning curve ($\alpha$ too large): The cost oscillates or increases over iterations, indicating divergence.

14. Explain why gradient descent can converge even with a fixed learning rate $\alpha$.

As gradient descent approaches a minimum, the partial derivatives (gradients) become smaller and smaller because the slope of $J$ becomes flatter near the minimum. Since the step size is $\alpha \times \text{gradient}$, and the gradient is shrinking, the actual steps taken become smaller and smaller even though $\alpha$ stays the same. This naturally causes the algorithm to take smaller steps as it gets close to the minimum and eventually converge.

15. What is the difference between univariate and multiple linear regression?

Univariate linear regression has a single input feature $x$: $f(x) = wx + b$.
Multiple linear regression has $n$ input features: $f(\boldsymbol{x}) = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b$, which can be written in vector notation as: $f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x} + b$ where $\boldsymbol{w}$ and $\boldsymbol{x}$ are vectors of length $n$.

16. Explain why we use vectorization rather than explicit Python for-loops. Give at least two reasons.

Computational speed: Vectorized operations (via NumPy) call optimized, compiled C code behind the scenes and can exploit parallel hardware (multiple CPU cores, GPUs), making them dramatically faster than Python loops.
Code simplicity: Vectorized code is more concise and easier to read. For example, np.dot(w, x) + b replaces a multi-line loop.
Mathematical clarity: Vectorized code maps directly onto the mathematical notation, making it easier to verify correctness.

17. In multiple linear regression, we sometimes add a column of 1s to the left of the feature matrix $X$. Why?

Adding a column of 1s creates a “fake” feature $x_0 = 1$ for every training example. This allows us to absorb the bias parameter $b$ into the weight vector $\boldsymbol{w}$ as $w_0$. Since $w_0 \cdot x_0 = w_0 \cdot 1 = w_0$, this plays the role of $b$. The benefit is that the model simplifies to $f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}$ (just a dot product, no separate $+b$ term), which makes both the math and the code cleaner.

18. What is the normal equation? Write it.

The normal equation gives the exact closed-form solution for the parameters that minimize $J$: $\boldsymbol{w} = (X^T X)^{-1} X^T \boldsymbol{y}$
Prefer the normal equation when the number of features $n$ is not too large, because it gives the exact answer in one step with no need to choose $\alpha$ or iterate.
Prefer gradient descent when $n$ is very large, because the normal equation requires computing $(X^T X)^{-1}$, which involves inverting an $(n+1) \times (n+1)$ matrix, and this becomes computationally expensive for large $n$. Gradient descent also generalizes to other models (logistic regression, neural networks) where no closed-form solution exists.

19. Suppose $x_1$ ranges from 0 to 1000 and $x_2$ ranges from 0 to 5. Why might this cause problems?

Features with very different scales can cause gradient descent to “bounce around” inefficiently because the cost function surface becomes elongated (like a narrow valley). Gradient descent takes a long time to navigate this, and may require a very small learning rate.
Two methods to address this:
Mean normalization: Replace each feature with $\frac{x_j - \mu_j}{\max(x_j) - \min(x_j)}$, centering features around 0 in the range $[-1, 1]$.
Z-score normalization: Replace each feature with $\frac{x_j - \mu_j}{\sigma_j}$, where $\mu_j$ is the mean and $\sigma_j$ is the standard deviation. This accounts for the distribution/spread of the feature values.

20. Explain what feature engineering is. Give an example.

Feature engineering is the process of creating new features from existing ones using domain knowledge or intuition about the problem.
Example: If a housing dataset has “frontage” (width of the plot) and “depth” (length of the plot), you might create a new feature “area” = frontage $\times$ depth. This is useful because the total area of the lot likely has a stronger relationship with price than either dimension alone, and linear regression cannot discover multiplicative relationships between features on its own (since it can only compute linear combinations).

21. How can you use linear regression to fit a polynomial model?

You can add higher powers of existing features as new features. For example, if you have feature $x$, you can add $x^2$, $x^3$, etc. as additional features. Then linear regression fits a model of the form $f(x) = w_1 x + w_2 x^2 + w_3 x^3 + \cdots + b$, which is a polynomial curve even though the algorithm is still computing a linear combination of features.
Feature scaling becomes especially important because squaring and cubing features drastically changes their ranges. If $x$ ranges from 0 to 1000, then $x^2$ ranges from 0 to 1,000,000 and $x^3$ from 0 to 1,000,000,000. Without scaling, gradient descent will struggle with these wildly different magnitudes.

22. Your learning curve shows cost $J$ increasing over iterations. What is the most likely cause?

The most likely cause is that the learning rate $\alpha$ is too large. Gradient descent is overshooting the minimum on each step, causing the cost to increase. The fix is to decrease $\alpha$ (e.g., by a factor of 10) and try again.

23. True or false: “For linear regression with the squared error cost function, gradient descent can get stuck in a local minimum that is not the global minimum.”

False. The squared error cost function for linear regression is convex (bowl-shaped), meaning it has exactly one minimum, which is the global minimum. There are no local minima to get stuck in. Any run of gradient descent (with a suitable learning rate) will converge to this single global minimum.

24. Code interpretation.

X @ w computes the matrix-vector product $X\boldsymbol{w}$. This produces a vector of length $m$ where each entry is $\boldsymbol{w} \cdot \boldsymbol{x}^{(i)}$ — the model’s prediction $\hat{y}^{(i)}$ for each training example.
errors.T @ errors computes the dot product of the errors vector with itself: $\boldsymbol{e}^T \boldsymbol{e} = \sum_{i=1}^m e_i^2$. This gives the sum of squared errors. Taking the dot product of a vector with itself is a compact way to compute the sum of squares of its entries, which is exactly what the MSE cost function needs.

25. After 1000 iterations cost is 45.2, after 1001 it’s 45.19. Has gradient descent converged?

It is close to converging but has likely not fully converged yet. The cost is still decreasing (by 0.01 per iteration), which means gradient descent is still making progress. However, the change is small, which suggests it is approaching convergence. You should continue running gradient descent and monitor the learning curve. Convergence is when the change in $J$ becomes negligibly small (e.g., below some small threshold $\epsilon$, like $0.0000001$), or when the learning curve has clearly leveled off.

Logistic Regression

26. Why can’t we simply use linear regression for classification?

Linear regression outputs any real number, but classification requires predicting discrete categories (e.g., 0 or 1). A linear model could predict values less than 0 or greater than 1, which don’t make sense as class labels or probabilities. Additionally, if we use the squared error cost function with the logistic model, the resulting cost surface is non-convex (has multiple local minima), which means gradient descent may not find the global minimum.

27. Write the sigmoid function. What is its output range?

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]
Output range: $(0, 1)$ — always strictly between 0 and 1.
$\sigma(0) = 0.5$
$\sigma(10) \approx 0.9999546 \approx 1$ (very close to 1)
$\sigma(-10) \approx 0.0000454 \approx 0$ (very close to 0)

28. Write the complete model equation for logistic regression. How does it relate to linear regression?

\[f_{\boldsymbol{w}}(\boldsymbol{x}) = \sigma(\boldsymbol{w} \cdot \boldsymbol{x}) = \frac{1}{1 + e^{-\boldsymbol{w} \cdot \boldsymbol{x}}}\]
Logistic regression takes the linear regression model ($\boldsymbol{w} \cdot \boldsymbol{x}$, which can output any real number) and passes it through the sigmoid function, which squashes the output to be between 0 and 1. In other words, logistic regression wraps the linear model inside a non-linear activation function.

29. How do we interpret the output of a logistic regression model?

The output is interpreted as the probability that the input belongs to the positive class (class 1). If the model outputs 0.73 for a particular input, that means the model estimates a 73% probability that this example belongs to class 1 (and a 27% probability it belongs to class 0).

30. Decision boundary for $w_0 = -4$, $w_1 = 1$, $w_2 = 1$.

The decision boundary is the set of points where $z = \boldsymbol{w} \cdot \boldsymbol{x} = 0$, which is where the model is exactly 50/50 between predicting class 0 and class 1.
With $w_0 = -4$, $w_1 = 1$, $w_2 = 1$ (and $x_0 = 1$): $z = w_0(1) + w_1 x_1 + w_2 x_2 = -4 + x_1 + x_2 = 0$ $x_1 + x_2 = 4$
This is a straight line in the $x_1$-$x_2$ plane. On one side of the line (where $x_1 + x_2 > 4$), the model predicts class 1. On the other side (where $x_1 + x_2 < 4$), it predicts class 0.

31. What shape is the decision boundary without polynomial features? How to get a non-linear one?

Without polynomial features, the decision boundary is always a straight line (or hyperplane in higher dimensions), because $\boldsymbol{w} \cdot \boldsymbol{x} = 0$ is a linear equation.
To get a non-linear decision boundary (e.g., circular), you can add polynomial features like $x_1^2$ and $x_2^2$. For example, with features $x_1^2$ and $x_2^2$ and weights $w_0 = -1$, $w_1 = 1$, $w_2 = 1$, the decision boundary becomes $x_1^2 + x_2^2 = 1$, which is a circle.

32. Why not use MSE for logistic regression?

If we plug the logistic regression model $f(\boldsymbol{x}) = \sigma(\boldsymbol{w} \cdot \boldsymbol{x})$ into the squared error cost function, the resulting cost surface $J(\boldsymbol{w})$ is non-convex — it has many local minima. This means gradient descent could get stuck in a local minimum and fail to find the best parameters. We need a different cost function (the cross-entropy loss) that, when combined with the logistic model, produces a convex cost surface with a single global minimum.

33. Write the piecewise loss function for logistic regression.

\[L(\hat{y}, y) = \begin{cases} -\log(\hat{y}) & \text{if } y = 1 \\ -\log(1 - \hat{y}) & \text{if } y = 0 \end{cases}\]
When $y = 1$: We use $-\log(\hat{y})$. If the model correctly predicts $\hat{y} \approx 1$, then $-\log(1) = 0$ (no penalty). If the model incorrectly predicts $\hat{y} \approx 0$, then $-\log(0) \to \infty$ (huge penalty).
When $y = 0$: We use $-\log(1 - \hat{y})$. If the model correctly predicts $\hat{y} \approx 0$, then $-\log(1) = 0$ (no penalty). If the model incorrectly predicts $\hat{y} \approx 1$, then $-\log(0) \to \infty$ (huge penalty).
In both cases, the loss is 0 for a perfect prediction and increases without bound as the prediction moves further from the correct answer.

34. Write the simplified cross-entropy loss and verify equivalence.

\[L(\hat{y}, y) = -y \log(\hat{y}) - (1 - y) \log(1 - \hat{y})\]
Verification with $y = 1$: $L = -(1)\log(\hat{y}) - (1-1)\log(1-\hat{y}) = -\log(\hat{y})$. This matches the top piece of the piecewise definition.
Verification with $y = 0$: $L = -(0)\log(\hat{y}) - (1-0)\log(1-\hat{y}) = -\log(1-\hat{y})$. This matches the bottom piece.

35. Why are gradient descent for linear and logistic regression different despite the same-looking update equation?

The update equation has the same algebraic form, but the function $f$ inside is different:
In linear regression: $f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}$ (a linear function).
In logistic regression: $f(\boldsymbol{x}) = \sigma(\boldsymbol{w} \cdot \boldsymbol{x}) = \frac{1}{1 + e^{-\boldsymbol{w} \cdot \boldsymbol{x}}}$ (the sigmoid of a linear function).
Because $f$ is computed differently, the error term $f(\boldsymbol{x}^{(i)}) - y^{(i)}$ evaluates to different numbers, producing different gradients and different parameter updates. The identical form is a coincidence arising from the calculus, not an indication that the algorithms are the same.

36. Logistic regression with $w_0 = -2$, $w_1 = 3$, $w_2 = -1$; input $x_1 = 1$, $x_2 = 0$.

(a) $z = w_0(1) + w_1(1) + w_2(0) = -2 + 3 + 0 = 1$
(b) $\sigma(1) = \frac{1}{1 + e^{-1}} = \frac{1}{1 + 1/e} = \frac{e}{e + 1} \approx \frac{2.718}{3.718} \approx 0.731$
(c) Since $\sigma(z) \approx 0.731 \geq 0.5$, the model predicts class 1.

37. What is multinomial logistic regression? When would you use it?

Multinomial logistic regression (also called softmax regression) extends binary logistic regression to handle multi-class classification problems — problems with more than two classes. Instead of using the sigmoid function (which outputs a single probability), it uses the softmax function, which outputs a probability distribution over all $K$ classes. You would use it when your target variable has more than two categories (e.g., predicting a grade of A, B, or C; classifying images into dog, cat, or bird).

38. What is a one-hot vector? Give an example with 4 classes, correct class is 3.

A one-hot vector is a vector of 0s and 1s where exactly one entry is 1 (indicating the correct class) and all other entries are 0.
For 4 classes where the correct class is class 3: $[0, 0, 1, 0]$
(The 1 is in the 3rd position, corresponding to class 3.)

39. Write the softmax function. Describe what it does to $z = [1, 2, 3]$.

\[\text{softmax}(\boldsymbol{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}\]
Applied to $z = [1, 2, 3]$: The softmax exponentiates each element ($e^1, e^2, e^3$), then divides each by the sum of all three exponentials. The result is a vector of three positive numbers that sum to 1, forming a probability distribution. Because 3 is the largest input, its corresponding output will have the highest probability. The output would be approximately $[0.09, 0.24, 0.67]$. The softmax function preserves the relative ordering of the inputs but converts them into a valid probability distribution.

40. How many weight vectors for $n$ features and $K$ classes? Dimensions of $W$?

We have $K$ weight vectors, one per class. Each weight vector has length $n + 1$ (one weight per feature, plus $w_0$ for the bias).
The weight matrix $W$ has dimensions $(n+1) \times K$ — rows correspond to features (including the bias feature) and columns correspond to classes.

41. Explain the cross-entropy loss for multinomial logistic regression.

\[L(\boldsymbol{\hat{y}}, \boldsymbol{y}) = -\sum_{k=1}^K y_k \log \hat{y}_k\]
Most terms equal zero because $\boldsymbol{y}$ is a one-hot vector: only the entry $y_c$ corresponding to the correct class $c$ is 1; all other $y_k = 0$. When $y_k = 0$, the term $y_k \log \hat{y}_k = 0$. So the entire sum collapses to a single term: $L = -\log \hat{y}_c$ This means the loss is determined solely by how much probability the model assigns to the correct class. If $\hat{y}_c \approx 1$, then $-\log(1) \approx 0$ (low loss). If $\hat{y}_c \approx 0$, then $-\log(0) \to \infty$ (very high loss).

42. True or false: “Logistic regression outputs are always exactly 0 or 1.”

False. The logistic regression model outputs a number between 0 and 1 (exclusive), which is interpreted as a probability. The output is never exactly 0 or 1 (the sigmoid function approaches but never reaches these values). We apply a threshold (typically 0.5) to convert this probability into a class prediction of 0 or 1, but the raw model output is a continuous value.

43. Code interpretation: shape of output and meaning.

X @ w multiplies a $(100, 5)$ matrix by a $(5, 1)$ vector, producing a $(100, 1)$ vector.
sigmoid(z) applies the sigmoid function element-wise, so the output shape is (100, 1).
Each element of the output represents the model’s predicted probability that the corresponding training example belongs to class 1 (the positive class).

Neural Networks (Conceptual)

44. What is a “neuron” in a neural network?

A neuron (or unit/node) is the basic computational element of a neural network. It receives a vector of inputs $\boldsymbol{x}$, computes a weighted sum plus bias $z = \boldsymbol{w} \cdot \boldsymbol{x} + b$, and then passes the result through an activation function $g$ to produce an output: $a = g(z) = g(\boldsymbol{w} \cdot \boldsymbol{x} + b)$. Each neuron has its own set of weights $\boldsymbol{w}$ and bias $b$.

45. What are the three types of layers? Briefly describe each.

Input layer (layer 0): Contains the input features $\boldsymbol{x}$. No computation happens here; it just passes the data to the next layer.
Hidden layer(s): The intermediate layer(s) between input and output. Each neuron receives inputs from the previous layer, performs its computation ($z$ then activation), and passes its output to the next layer. There can be one or many hidden layers.
Output layer: The final layer that produces the network’s prediction. Its size and activation function depend on the task (e.g., 1 neuron with sigmoid for binary classification, $K$ neurons with softmax for $K$-class classification, 1 neuron with identity for regression).

46. Why are the middle layers called “hidden” layers?

They are called “hidden” because we can directly observe the values at the input layer (the training features $\boldsymbol{x}$) and the correct values at the output layer (the training targets $\boldsymbol{y}$), but we cannot observe what the “correct” values at the middle layers should be. The network determines these intermediate values on its own during training. They are hidden from the training data.

47. How is a single neuron related to logistic regression?

A single neuron with the sigmoid activation function computes exactly the same function as logistic regression: $a = \sigma(\boldsymbol{w} \cdot \boldsymbol{x} + b)$. A neural network can be thought of as many interconnected logistic regression units (when using sigmoid activations), arranged in layers so that the outputs of one set of units become the inputs to the next.

48. Name three common activation functions.

Identity (linear): $g(z) = z$. Typically used in the output layer for regression tasks, where we want the network to output any real number.
Sigmoid: $g(z) = \frac{1}{1+e^{-z}}$. Outputs between 0 and 1. Used in the output layer for binary classification. Was historically used in hidden layers but has been largely replaced by ReLU.
ReLU (Rectified Linear Unit): $g(z) = \max(0, z)$. Outputs 0 for negative inputs and $z$ for positive inputs. The most commonly used activation function for hidden layers in modern neural networks because it is simple to compute and helps with training efficiency.

49. What activation function for the output layer for each task?

(a) Regression: Identity function $g(z) = z$ (no activation / linear), because the output should be any real number.
(b) Binary classification: Sigmoid $g(z) = \frac{1}{1+e^{-z}}$, because we want a probability between 0 and 1.
(c) Multi-class classification: Softmax, because we want a probability distribution over $K$ classes (all outputs between 0 and 1, summing to 1).

50. Explain forward propagation step by step for a network with one hidden layer.

Start with the input features: set $\boldsymbol{a}^{[0]} = \boldsymbol{x}$.
Compute hidden layer: For each neuron $j$ in the hidden layer, compute $z_j^{[1]} = \boldsymbol{w}_j^{[1]} \cdot \boldsymbol{a}^{[0]} + b_j^{[1]}$, then apply the activation function: $a_j^{[1]} = g(z_j^{[1]})$. Collect all these into vector $\boldsymbol{a}^{[1]}$.
Compute output layer: For each neuron $j$ in the output layer, compute $z_j^{[2]} = \boldsymbol{w}_j^{[2]} \cdot \boldsymbol{a}^{[1]} + b_j^{[2]}$, then apply the (possibly different) output activation function: $a_j^{[2]} = g(z_j^{[2]})$. Collect into $\boldsymbol{a}^{[2]}$.
The output of the network is $f(\boldsymbol{x}) = \boldsymbol{a}^{[2]}$.
The key idea is that information flows forward through the network, layer by layer, with each layer’s output becoming the next layer’s input.

51. Write the general formula for neuron $j$ in layer $\ell$.

\[a_j^{[\ell]} = g\left(\boldsymbol{a}^{[\ell-1]} \cdot \boldsymbol{w}_j^{[\ell]} + b_j^{[\ell]}\right)\]
$a_j^{[\ell]}$: the activation (output) of neuron $j$ in layer $\ell$.
$\boldsymbol{a}^{[\ell-1]}$: the vector of activations from the previous layer (the inputs to this layer).
$\boldsymbol{w}_j^{[\ell]}$: the weight vector for neuron $j$ in layer $\ell$.
$b_j^{[\ell]}$: the bias for neuron $j$ in layer $\ell$.
$g$: the activation function.
The dot product plus bias gives $z_j^{[\ell]}$, and $g$ is applied to get the final activation $a_j^{[\ell]}$.

52. Why can a neural network learn more complex functions than logistic regression?

Each individual neuron can only learn a linear combination of its inputs (passed through a non-linear activation). However, by layering multiple neurons, the output of one layer of linear-plus-nonlinear computations becomes the input to the next layer. This composition of functions allows the network to learn increasingly complex, non-linear relationships. Additionally, each layer can be thought of as automatically constructing new, more sophisticated features from the previous layer’s features — the network does its own feature engineering. A single logistic regression unit is limited to a linear decision boundary, but a multi-layer network can learn arbitrarily complex decision boundaries.

53. Do we need to decide what each hidden neuron “means” ahead of time?

No. One of the main ideas of neural networks is that we do not need to figure out what the hidden neurons should represent. The network learns the most useful intermediate representations automatically during training. In the T-shirt example, we assigned interpretable meanings (affordability, awareness, perceived quality) for illustration, but in practice, the hidden neurons may learn features that are not easily interpretable by humans — and that’s okay, as long as they help the network make accurate predictions.

54. Dimensions of weight matrices and bias vectors for a 3-4-1 network.

(a) $W^{[1]}$: The hidden layer receives 3 inputs and has 4 neurons, so $W^{[1]}$ has dimensions $3 \times 4$.
(b) $b^{[1]}$: One bias per neuron in the hidden layer, so $b^{[1]}$ has dimensions $1 \times 4$ (or length 4).
(c) $W^{[2]}$: The output layer receives 4 inputs (from the hidden layer) and has 1 neuron, so $W^{[2]}$ has dimensions $4 \times 1$.
(d) $b^{[2]}$: One bias per output neuron, so $b^{[2]}$ has dimensions $1 \times 1$ (or length 1, a scalar).

55. Why were GPUs important for deep learning?

GPUs (Graphics Processing Units) contain many simple cores designed to perform large numbers of computations in parallel. They were originally designed for computer graphics, which relies heavily on matrix and vector operations — the same operations at the heart of neural network training (matrix multiplications, dot products, element-wise operations). Because training a neural network involves massive amounts of matrix math applied to large datasets, GPUs can perform this work orders of magnitude faster than CPUs. This speedup (from days/weeks down to hours/minutes) made it practical to train the large, deep networks that drive modern AI.

Regularization

56. Define overfitting and underfitting with concrete examples.

Underfitting is when the model is too simple to capture the patterns in the training data. Example: fitting a straight line (degree-1 polynomial) to data that clearly follows a quadratic curve. The model will have high error on both the training data and new data.
Overfitting is when the model is too complex and captures noise or random fluctuations in the training data rather than the true underlying pattern. Example: fitting a high-degree polynomial (e.g., degree 15) to a small dataset that follows a quadratic curve. The polynomial will pass through or near every training point (very low training error) but will produce wild predictions on new data (high test error) because it’s fitting the noise.

57. List three strategies for reducing overfitting.

Collect more training data: With more data, the model can’t “memorize” individual examples and must learn the true underlying pattern.
Feature selection: Reduce the number of features to only those most relevant to the prediction, removing noisy or irrelevant features that the model might latch onto.
Regularization: Add a penalty term to the cost function that discourages large parameter values, which smooths out the model and reduces its ability to overfit.

58. What is regularization?

Regularization is a technique that modifies the cost function by adding a penalty term based on the magnitude of the model’s parameters (weights). This discourages the learning algorithm from assigning overly large values to the weights. By keeping the weights small, the model becomes simpler/smoother and is less likely to overfit the training data. It allows us to keep all features while preventing any single feature from having an outsized influence on predictions.

59. Write the regularized cost function for linear regression (L2). Identify each component.

\[J(\boldsymbol{w}) = \underbrace{\frac{1}{2m}\sum_{i=1}^m \left(f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}) - y^{(i)}\right)^2}_{\text{mean squared error (MSE)}} + \underbrace{\frac{\lambda}{2m}\sum_{j=1}^n w_j^2}_{\text{regularization term}}\]
Mean squared error (MSE): Measures how well the model fits the training data.
Regularization term: Penalizes large weight values. The sum is over $j = 1$ to $n$ (we do not regularize $w_0$).
$\lambda$: The regularization parameter that controls the strength of the penalty.
The $\frac{1}{2m}$ scaling on the regularization term matches the scaling of the MSE term.

60. The two terms have opposing goals. Explain.

The MSE term wants gradient descent to find parameters that make the model’s predictions as close as possible to the training data — it wants to minimize prediction errors, which can lead to large weights if that fits the data better.
The regularization term wants gradient descent to keep the weights as small as possible (close to zero), regardless of how well the model fits the data.
The parameter $\lambda$ controls the balance: a small $\lambda$ prioritizes fitting the data (more risk of overfitting); a large $\lambda$ prioritizes keeping weights small (more risk of underfitting). The goal is to find a $\lambda$ that strikes the right balance.

61. What happens if $\lambda = 0$? What if $\lambda$ is extremely large?

$\lambda = 0$: The regularization term disappears entirely, and we are back to standard (unregularized) linear regression. The model may overfit if there are many features or the data is noisy.
$\lambda$ extremely large: The regularization term dominates the cost function, and gradient descent will drive all weights $w_1, w_2, \ldots, w_n$ toward zero. With all weights near zero, the model is essentially $f(\boldsymbol{x}) \approx w_0$ (a constant/horizontal line), which drastically underfits the data because the model ignores all features.

62. Difference between L1 and L2 regularization.

L2 regularization (ridge) adds $\frac{\lambda}{2m}\sum_{j=1}^n w_j^2$ to the cost function. It penalizes the squared magnitude of the weights, which shrinks all weights toward zero but rarely makes them exactly zero.
L1 regularization (lasso) adds $\frac{\lambda}{2m}\sum_{j=1}^n w_j $ to the cost function. It penalizes the absolute value of the weights, which tends to drive some weights all the way to exactly zero.
L1’s ability to produce exact zeros is useful because it effectively performs automatic feature selection — features whose weights become zero are effectively removed from the model. This is valuable when you suspect that only a few features are truly relevant.

63. What is elastic net regularization?

Elastic net combines both L1 and L2 regularization by adding both penalty terms to the cost function. It offers the benefits of both: it can zero out irrelevant features (like L1) while still handling groups of correlated features gracefully (like L2). It is often a good default choice when you’re unsure which type of regularization to use.

64. Why do we typically not regularize $w_0$?

$w_0$ corresponds to the bias term $b$ from the original formulation. It represents the baseline prediction (like the y-intercept of a line) and is not associated with any particular feature. Regularizing it would penalize the model for having a non-zero baseline, which doesn’t help prevent overfitting — overfitting comes from the model assigning overly large importance to features, not from the baseline value. That said, regularizing $w_0$ often doesn’t hurt much in practice, so some implementations do it anyway for simplicity.

65. Write the regularized gradient descent update and show the “weight decay” form.

The update equation for $w_j$ (where $j > 0$): $w_j = w_j - \alpha\left[\frac{1}{m}\sum_{i=1}^m \left(f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}) - y^{(i)}\right)x_j^{(i)} + \frac{\lambda}{m}w_j\right]$
Rearranging: $w_j = w_j - \alpha\frac{\lambda}{m}w_j - \alpha\frac{1}{m}\sum_{i=1}^m \left(f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}) - y^{(i)}\right)x_j^{(i)}$
\[w_j = w_j\left(1 - \alpha\frac{\lambda}{m}\right) - \alpha\frac{1}{m}\sum_{i=1}^m \left(f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}) - y^{(i)}\right)x_j^{(i)}\]
The term $\left(1 - \alpha\frac{\lambda}{m}\right)$ is a number slightly less than 1 (e.g., 0.999). On every update, $w_j$ is first multiplied by this factor, which slightly shrinks (decays) it. This is why L2 regularization is sometimes called “weight decay” — on each iteration, the weights are decayed (shrunk) slightly before the gradient update is applied.

66. True or false: “Regularization is only useful when you have a large number of features.”

False. While regularization is especially important when you have many features (high-dimensional data is more prone to overfitting), it can also help in other situations. For example, even with a moderate number of features, if you have a small training set, overfitting can occur. Regularization can also help when using polynomial features (even derived from a single original feature), where the higher-order terms can cause overfitting. Regularization is useful whenever there is a risk of overfitting, regardless of the specific number of features.

67. Model performs well on training data but poorly on new data.

(a) This is called overfitting (also sometimes described as the model having high variance).
(b) Two things you could do:
Collect more training data to force the model to generalize rather than memorize.
Apply regularization (e.g., L2) to penalize large weights and smooth out the model.
(Other valid answers: reduce the number of features / feature selection, use a simpler model.)

Cross-Cutting / Synthesis Questions

68. Compare and contrast the cost functions for linear, logistic, and multinomial logistic regression.

Linear regression: Uses the mean squared error (MSE): $J = \frac{1}{2m}\sum(f(\boldsymbol{x}^{(i)}) - y^{(i)})^2$. Measures average squared distance between predictions and actual values. Works well because the resulting cost surface is convex.
Logistic regression: Uses the binary cross-entropy loss: $J = -\frac{1}{m}\sum[y\log(f(\boldsymbol{x})) + (1-y)\log(1 - f(\boldsymbol{x}))]$. Uses logarithms to heavily penalize confident wrong predictions. Produces a convex cost surface when used with the sigmoid model.
Multinomial logistic regression: Uses the categorical cross-entropy loss: $J = \frac{1}{m}\sum(-\sum_k y_k \log \hat{y}_k)$. Extends binary cross-entropy to multiple classes using one-hot encoded targets and softmax outputs.
We can’t use the same cost function for all three because: (1) MSE with the sigmoid model produces a non-convex cost surface, making gradient descent unreliable for classification; (2) the cross-entropy loss is specifically designed for probability outputs and penalizes wrong predictions in a way that’s mathematically compatible with the sigmoid/softmax functions.

69. Same-looking update equation, different algorithms. What makes them different?

The definition of $f$ is different:
Linear regression: $f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}$ (a linear function, output is any real number).
Logistic regression: $f(\boldsymbol{x}) = \sigma(\boldsymbol{w} \cdot \boldsymbol{x})$ (the linear function wrapped in a sigmoid, output is between 0 and 1).
Because $f$ computes different values, the error term $f(\boldsymbol{x}^{(i)}) - y^{(i)}$ evaluates to different numbers in each algorithm, producing different gradients and therefore different parameter updates. The identical algebraic form is a mathematical coincidence, not an indication that the algorithms are the same.

70. Trace: model → cost function → gradient descent → predictions.

Model: We define a mathematical function $f$ (with parameters $\boldsymbol{w}$) that maps input features to predicted outputs. For example, $f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}$ for linear regression.
Cost function: We define $J(\boldsymbol{w})$, which measures how poorly the model (with its current parameters) fits the training data, by averaging the errors across all training examples.
Gradient descent: We use the cost function’s gradients (partial derivatives) to iteratively update the parameters $\boldsymbol{w}$, taking small steps in the direction that reduces $J$ the most. Over many iterations, $\boldsymbol{w}$ converges to values that (approximately) minimize $J$.
Predictions: With the optimized parameters, we use the model $f$ to make predictions on new, unseen data: given new features $\boldsymbol{x}{\text{new}}$, we compute $\hat{y} = f(\boldsymbol{x}{\text{new}})$.

71. Write the regularized cost function for logistic regression. How does gradient descent change?

\[J(\boldsymbol{w}) = -\frac{1}{m}\sum_{i=1}^m \left[y^{(i)}\log(f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)})) + (1 - y^{(i)})\log(1 - f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}))\right] + \frac{\lambda}{2m}\sum_{j=1}^n w_j^2\]
The gradient descent update equation changes (for $j > 0$) by adding the regularization term to the gradient: $w_j = w_j - \alpha\left[\frac{1}{m}\sum_{i=1}^m \left(f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}) - y^{(i)}\right)x_j^{(i)} + \frac{\lambda}{m}w_j\right]$
This is the same modification as for linear regression — the only difference is that $f$ is the logistic regression model ($\sigma(\boldsymbol{w} \cdot \boldsymbol{x})$) rather than the linear regression model.

72. Dataset with 1000 examples and 500 features. Concerned about overfitting?

Yes, this is a situation where overfitting is very likely. Having 500 features relative to 1000 training examples means the model has many degrees of freedom relative to the amount of data, making it easy for the model to find spurious patterns in the training data.
Steps to take:
Apply regularization (L1, L2, or elastic net) to constrain the weights and prevent overfitting.
Feature scaling — with 500 features, they likely have different ranges, which will hurt gradient descent performance.
Feature selection — use domain knowledge or L1 regularization to identify and keep only the most relevant features.
Collect more data if possible — more examples relative to features reduces overfitting risk.
Evaluate the model on held-out data (not used for training) to check for overfitting.