Elastic-Net regression (gradient descent)Medium

Elastic-Net regression (gradient descent)

Background

Elastic Net is linear regression with both an L1 (Lasso) and an L2 (Ridge) penalty on the weights. L1 drives some coefficients exactly to zero (feature selection); L2 shrinks them smoothly and stabilises the fit when features are correlated. Elastic Net blends the two, which makes it the default when you have many, possibly-correlated features. Because the L1 term is non-differentiable at zero there is no closed form, so it is trained by gradient descent (using the L1 subgradient).

Problem statement

Implement elastic_net_gradient_descent(X, y, alpha1, alpha2, learning_rate, max_iter, tol) that fits weights and bias by gradient descent on the elastic-net objective. Initialise w=0w = 0, b=0b = 0 and repeat:

e=(Xw+b)ye = (Xw + b) - y w=1nXe+α1sign(w)+2α2w,b=1niei\nabla_w = \frac{1}{n} X^\top e + \alpha_1 \operatorname{sign}(w) + 2\alpha_2 w, \qquad \nabla_b = \frac{1}{n}\sum_i e_i wwηw,bbηbw \leftarrow w - \eta\,\nabla_w, \qquad b \leftarrow b - \eta\,\nabla_b

Stop early once w1<tol\lVert \nabla_w \rVert_1 < \text{tol}.

Input

  • Xnp.ndarray of shape (n_samples, n_features): feature matrix (bias is a separate scalar, not a column).
  • ynp.ndarray of shape (n_samples,): targets.
  • alpha1float: L1 (Lasso) strength.
  • alpha2float: L2 (Ridge) strength.
  • learning_ratefloat: step size η\eta.
  • max_iterint: maximum number of gradient steps.
  • tolfloat: stop when the L1 norm of the weight gradient falls below this.

Output

Returns a tuple (weights, bias):

  • weightsnp.ndarray of shape (n_features,).
  • biasfloat.

Examples

Example 1

Input:  X = [[0, 0], [1, 1], [2, 2]], y = [0, 1, 2]
        alpha1 = 0.1, alpha2 = 0.1, learning_rate = 0.01, max_iter = 1000, tol = 1e-4
Output: weights = [0.3732, 0.3732], bias = 0.2479

Explanation: the data follows y=xy = x exactly, but the L1 + L2 penalties shrink each weight well below 11 and push the leftover signal into the bias — trading a little fit for smaller, more stable coefficients.

Constraints

  • Initialise weights to zeros and bias to 0.
  • Gradient = data term 1nXe\frac{1}{n}X^\top e + L1 subgradient α1sign(w)\alpha_1\operatorname{sign}(w) + L2 term 2α2w2\alpha_2 w.
  • Stop early if w1<tol\lVert\nabla_w\rVert_1 < \text{tol}; otherwise run max_iter steps.
  • Tests compare with atol=1e-2.

Notes

  • The L1 subgradient np.sign(w) is 00 at exactly w=0w = 0, so a weight that reaches zero feels no further L1 push and can stay sparse — this is what gives Lasso/Elastic-Net their feature-selection behaviour.
  • Pure Lasso is α2=0\alpha_2 = 0; pure Ridge is α1=0\alpha_1 = 0; Elastic Net interpolates. The factor 22 on the L2 term is just the derivative of α2w22\alpha_2\lVert w\rVert_2^2.
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reproduces the reference fit on y = x data
  • Stronger L1 penalty shrinks the weights
  • Returns a weights array of shape (n_features,) and a scalar bias
  • Without regularization, more iterations reduce training MSE