K-means: one iterationMedium

K-means: one iteration

Background

K-means clustering is Lloyd's algorithm, repeated to convergence: assign each point to its nearest centroid, then move each centroid to the mean of its assigned points. This problem is one full iteration, returning both the updated centroids and the assignments so a caller can loop and track convergence. K-means shows up in vector quantisation, codebook learning for compression, k-NN preprocessing, and as the initialiser for richer clustering methods.

Problem statement

Implement kmeans_step(X, centroids), one iteration of:

ai=argminkxick2,ck1{i:ai=k}i:ai=kxia_i = \arg\min_k \lVert x_i - c_k \rVert^2, \qquad c_k \leftarrow \frac{1}{|\{i : a_i = k\}|}\sum_{i:\,a_i=k} x_i

First assign every point to the index of its nearest centroid (Euclidean), then update each centroid to the mean of its assigned points. If a cluster has no assigned points, leave that centroid unchanged (don't produce NaN). Return (new_centroids, assignments).

Input

  • X(N, D) np.ndarray: the data points.
  • centroids(K, D) np.ndarray: the current centroid positions.

Output

Returns (new_centroids, assignments):

  • new_centroids(K, D) after the mean update.
  • assignments(N,) integer array, each in [0,K)[0, K).

Examples

Example 1 — two well-separated clusters

Input:  X = [[0,0],[0.1,0.1],[-0.1,0],  [10,10],[10.2,10],[10,9.8]]
        centroids = [[1,1], [9,9]]
Output: assignments  = [0, 0, 0, 1, 1, 1]
        new_centroids = [mean of first 3 points, mean of last 3 points]

Explanation: the first three points are nearest centroid 0 and the last three nearest centroid 1; each centroid then jumps to the mean of its assigned cluster.

Example 2 — an empty cluster stays put

Input:  X = [[0.1,0],[-0.1,0],[0,0.05]], centroids = [[0,0],[10,10],[-10,-10]]
Output: assignments  = [0, 0, 0]
        new_centroids = [mean of the 3 points, [10,10], [-10,-10]]

Explanation: all points are nearest centroid 0; centroids 1 and 2 get no points, so they are left exactly where they were rather than becoming NaN from a mean over an empty set.

Constraints

  • Assignment is argmin of squared Euclidean distance; vectorise it (e.g. X[:,None,:] - centroids[None,:,:]) rather than looping over points.
  • A cluster with no assigned points keeps its centroid unchanged (guard with if mask.any()); never emit NaN.
  • Tie-breaking is deterministic: np.argmin returns the lowest-index centroid when two are equidistant.
  • assignments is an integer array; with K=1K=1 all points go to centroid 0 and it becomes X.mean(axis=0).

Notes

  • Loss is non-increasing. Lloyd's algorithm guarantees the within-cluster sum of squares does not increase per iteration — a good sanity check that your update moves centroids toward their cluster means.
  • Empty-cluster convention. Production k-means often re-seeds an empty centroid from a random data point; here we leave it unchanged to keep the step deterministic and testable.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Returns the right shapes
  • Diagnostic: clear-cluster case converges to point-cloud means
  • Empty cluster: centroid stays put (no NaN)
  • Tie-breaking is deterministic (when two centroids are equidistant)
  • Single centroid (k=1): all points assigned to it; new centroid = X.mean(axis=0)
  • After one step from a chosen init, repeated calls converge (loss decreases)