ML Test Bank — Questions

Intro / General Concepts

1. What is the difference between supervised and unsupervised learning? Give an example of each.

2. What is the difference between a regression problem and a classification problem? For each of the following, state whether it is regression or classification:

(a) Predicting the temperature tomorrow in degrees Fahrenheit.
(b) Predicting whether a customer will cancel their subscription (yes/no).
(c) Predicting what letter (A–Z) a handwritten character represents.
(d) Predicting how many minutes a student will spend on homework.

3. Define the following terms: training set, feature, target, training example.

4. Using standard ML notation, what do each of the following symbols represent: $m$, $x^{(i)}$, $y^{(i)}$, $\hat{y}^{(i)}$, $n$?

5. Explain the difference between a model’s parameters and its hyperparameters. Give an example of each from linear regression.

Linear Regression

6. Write the model equation for univariate linear regression. What are $w$ and $b$, and what is the goal of the learning algorithm with respect to them?

7. Write the mean squared error cost function $J(w, b)$ for univariate linear regression. Why do we include the $\frac{1}{2}$ factor?

8. In your own words, explain what the cost function $J$ measures. Why do we want to minimize it?

9. Suppose you have the following three training examples: $(1, 2)$, $(2, 4)$, $(3, 5)$. If your current model is $f(x) = 1.5x + 0.5$, compute the cost $J(w, b)$ by hand.

10. Describe the gradient descent algorithm in plain English (the “hiking downhill” analogy is fine). What does each iteration of gradient descent do?

11. Write the gradient descent update equations for univariate linear regression (both $w$ and $b$). What does each piece of the equation represent?

12. What is the learning rate $\alpha$? What happens if $\alpha$ is set too large? What happens if it is set too small?

13. What is a learning curve (in the context of monitoring gradient descent)? Sketch what a good learning curve looks like. Sketch what a learning curve looks like when $\alpha$ is too large.

14. Explain why gradient descent can converge even with a fixed (constant) learning rate $\alpha$.

15. What is the difference between univariate and multiple linear regression? Write the model equation for multiple linear regression using vector notation (the dot product form).

16. Explain why we use vectorization (e.g., np.dot(w, x) + b) rather than explicit Python for-loops when implementing linear regression. Give at least two reasons.

17. In multiple linear regression, we sometimes add a column of 1s to the left of the feature matrix $X$. Why do we do this? What happens to the bias parameter $b$?

18. What is the normal equation? Write it. Under what circumstances might you prefer using the normal equation over gradient descent (or vice versa)?

19. Suppose you have two features: $x_1$ ranges from 0 to 1000 and $x_2$ ranges from 0 to 5. Why might this cause problems for gradient descent? Name and briefly describe two methods for addressing this issue.

20. Explain what feature engineering is. Give an example of creating a new feature from existing features and explain why it might be useful.

21. Explain how you can use linear regression to fit a polynomial (curved) model to data. Why does feature scaling become especially important when using polynomial features?

22. Suppose your learning curve shows that the cost $J$ is increasing over iterations. What is the most likely cause, and how would you fix it?

23. True or false (explain your answer): “For linear regression with the squared error cost function, gradient descent can get stuck in a local minimum that is not the global minimum.”

24. Consider the following Python code:

def compute_cost(X, y, w):
    m = X.shape[0]
    predictions = X @ w
    errors = predictions - y
    cost = (1 / (2 * m)) * (errors.T @ errors)
    return cost

What does X @ w compute? What does errors.T @ errors compute and why is this useful for the cost function?

25. Suppose you run gradient descent for linear regression and after 1000 iterations the cost is 45.2, and after 1001 iterations the cost is 45.19. Would you say gradient descent has likely converged? Explain your reasoning.

Logistic Regression

26. Why can’t we simply use linear regression for classification problems? What would go wrong?

27. Write the sigmoid function $\sigma(z)$. What is its output range? What is $\sigma(0)$? Roughly what is $\sigma(10)$? $\sigma(-10)$?

28. Write the complete model equation for logistic regression. How does it relate to the linear regression model?

29. How do we interpret the output of a logistic regression model? If the model outputs 0.73 for a particular input, what does that mean?

30. Explain what the decision boundary is in logistic regression. For a model with two features ($x_1$ and $x_2$) and weights $w_0 = -4$, $w_1 = 1$, $w_2 = 1$, derive the equation of the decision boundary and describe what it looks like geometrically.

31. Without higher-order polynomial features, what shape will the decision boundary of logistic regression always be? How can we get a non-linear (e.g., circular) decision boundary using logistic regression?

32. Explain why we do not use the squared error cost function (MSE) for logistic regression. What problem arises?

33. Write the loss function $L(\hat{y}, y)$ for logistic regression (the piecewise version). Explain intuitively why each piece makes sense: what happens to the loss when the prediction is correct? When it is very wrong?

34. Write the simplified (single-equation) cross-entropy loss function. Verify that it is equivalent to the piecewise version by substituting $y = 0$ and $y = 1$.

35. The gradient descent update equation for logistic regression looks identical in form to the one for linear regression. Explain why the two are nonetheless different algorithms.

36. Suppose a logistic regression model has learned weights $w_0 = -2$, $w_1 = 3$, and $w_2 = -1$. For an input $x_1 = 1, x_2 = 0$:

(a) Compute $z = w \cdot x$.
(b) Compute $\sigma(z)$ (you may leave your answer in terms of $e$ or approximate).
(c) What class does the model predict?

37. What is multinomial logistic regression (softmax regression)? When would you use it instead of regular (binary) logistic regression?

38. What is a one-hot vector? Give an example for a classification problem with 4 classes where the correct class is class 3.

39. Write the softmax function. Given the input vector $z = [1, 2, 3]$, explain what the softmax function does to it (you don’t need to compute exact numbers, but describe the output qualitatively).

40. In multinomial logistic regression, instead of a single weight vector $w$, we have multiple weight vectors (one per class). How many weight vectors do we have if we have $n$ features and $K$ classes? What are the dimensions of the weight matrix $W$?

41. Explain the cross-entropy loss function for multinomial logistic regression: $L(\hat{y}, y) = -\sum_{k=1}^{K} y_k \log \hat{y}_k$. Why do most of the terms in this sum equal zero?

42. True or false (explain): “Logistic regression outputs are always exactly 0 or 1.”

43. Consider the following code:

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def predict(X, w):
    z = X @ w
    return sigmoid(z)

If X has shape $(100, 5)$ and w has shape $(5, 1)$, what is the shape of the output of predict(X, w)? What does each element of the output represent?

Neural Networks (Conceptual)

44. In the context of neural networks, what is a “neuron” (or “unit”)? What computation does a single neuron perform?

45. What are the three types of layers in a neural network? Briefly describe each.

46. Why are the middle layers of a neural network called “hidden” layers?

47. Explain how a single neuron in a neural network is related to logistic regression.

48. Name three common activation functions. For each one, write the formula and describe when/where it is typically used in a neural network.

49. What activation function is typically used for the output layer of a neural network performing:

(a) Regression?
(b) Binary classification?
(c) Multi-class classification?

50. Explain what forward propagation is. Describe the process step by step for a network with one hidden layer.

51. Write the general formula for computing the activation of neuron $j$ in layer $\ell$ of a neural network. Define all notation used.

52. Explain in your own words the “key idea” of neural networks: why can a neural network learn more complex functions than logistic regression alone, even though each neuron is essentially doing the same thing as logistic regression?

53. In the T-shirt demand prediction example from class, four input features (price, shipping cost, marketing, material) were connected to three hidden neurons (affordability, awareness, perceived quality). In practice, do we need to decide what each hidden neuron “means” ahead of time? Why or why not?

54. Suppose a neural network has an input layer with 3 features, one hidden layer with 4 neurons, and an output layer with 1 neuron.

(a) What are the dimensions of the weight matrix $W^{[1]}$?
(b) What are the dimensions of the bias vector $b^{[1]}$?
(c) What are the dimensions of the weight matrix $W^{[2]}$?
(d) What are the dimensions of the bias vector $b^{[2]}$?

55. Why was the development of GPUs important for the deep learning revolution? What mathematical operations do GPUs excel at?

Regularization

56. Define overfitting and underfitting. Give a concrete example of each in the context of linear regression.

57. List three strategies for reducing overfitting. Briefly describe each.

58. What is regularization? In general terms, what does it do to the model’s parameters?

59. Write the regularized cost function for linear regression (L2 regularization). Identify and name each component of this cost function.

60. The two terms in the regularized cost function have somewhat opposing goals. Explain what each term “wants” gradient descent to do, and how the parameter $\lambda$ balances them.

61. What happens if $\lambda = 0$? What happens if $\lambda$ is extremely large? Explain each in terms of overfitting/underfitting.

62. Explain the difference between L1 (lasso) regularization and L2 (ridge) regularization. Which one can drive weights exactly to zero, and why is that useful?

63. What is elastic net regularization?

64. By convention, we typically do not regularize $w_0$ (the bias term). Why not?

65. Write the gradient descent update equation for $w_j$ (where $j > 0$) in regularized linear regression. Show how it can be rewritten so that the effect of regularization appears as multiplying $w_j$ by a factor slightly less than 1 on each update. Why is this sometimes called “weight decay”?

66. True or false (explain): “Regularization is only useful when you have a large number of features.”

67. Suppose you train a linear regression model and it performs very well on the training data but poorly on new, unseen data.

(a) What is this problem called?
(b) Name two things you could do to address it.

Cross-Cutting / Synthesis Questions

68. Compare and contrast the cost functions used in linear regression, logistic regression, and multinomial logistic regression. Why can’t we just use the same cost function for all three?

69. The gradient descent update equation for both linear regression and logistic regression can be written as: $w_j = w_j - \alpha \frac{1}{m} \sum_{i=1}^m (f_w(x^{(i)}) - y^{(i)}) x_j^{(i)}$. Despite looking the same, these are different algorithms. What makes them different?

70. Trace through how the following concepts connect: model $\rightarrow$ cost function $\rightarrow$ gradient descent $\rightarrow$ predictions. Explain each step and how one leads to the next.

71. Regularization can be applied to logistic regression and neural networks, not just linear regression. Write the regularized cost function for logistic regression. How does the gradient descent update equation change?

72. Consider a dataset with 1000 training examples and 500 features. Would you be concerned about overfitting? What steps might you take before training a model?