Linear regression — gradient descent
Background
Linear regression fits a model by minimising the mean squared error between predictions and targets. Where the normal equation solves this in closed form, gradient descent reaches the same optimum iteratively — the approach that scales to millions of features and is the literal inner loop of training every neural network. The squared-error surface is convex, so with a sensible step size gradient descent is guaranteed to converge to the global minimum.
Problem statement
Implement linear_regression_gradient_descent(X, y, alpha, iterations) that fits the coefficients by batch gradient descent on the mean squared error. Initialise and repeat the update iterations times:
where is the number of training examples and is the learning rate.
Input
X—np.ndarrayof shape(m, n): the design matrix, with a leading column of ones for the intercept already included.y—np.ndarrayof shape(m,): the target values.alpha—float: the learning rate.iterations—int: the number of full-batch gradient-descent steps.
Output
Returns an np.ndarray of shape (n,): the fitted coefficients , rounded to 4 decimal places (theta[0] is the intercept).
Examples
Example 1 — clean data converges to the exact line
Input: X = [[1, 1], [1, 2], [1, 3]], y = [1, 2, 3], alpha = 0.1, iterations = 1000
Output: [0.0, 1.0]
Explanation: the points lie exactly on , so gradient descent drives toward an intercept of and a slope of .
Example 2 — noisy data converges to the least-squares solution
Input: X = [[1, 1], [1, 2], [1, 3], [1, 4]], y = [6, 5, 7, 10], alpha = 0.05, iterations = 5000
Output: [3.5, 1.4]
Explanation: no line fits all four points, so gradient descent converges to the same minimiser the normal equation gives, .
Constraints
Xalready contains the intercept column; initialise to zeros.- Use the mean gradient (averaged over the examples), scaled by
alpha. - Return an
np.ndarrayof shape(n,)rounded withnp.round(theta, 4). - Choose
alphasmall enough to converge; tests compare withatol=1e-3.
Notes
- The MSE surface is convex, so the only fixed point is the global optimum — given enough iterations and a small enough , gradient descent reaches exactly what the normal equation computes in one step.
- The trade-off: gradient descent costs per step but avoids the matrix inverse, which is why it (and its stochastic/mini-batch variants) wins on large, high-dimensional data.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Clean data y = x converges to intercept 0, slope 1
- •Noisy data converges to the normal-equation (OLS) solution [3.5, 1.4]
- •Returns a 1-D array with one coefficient per column of X
- •More iterations reduce the training MSE