Clustering & Finding Structure

Learning without an answer key

Every classical method you have met so far in this rung — regression, trees, nearest neighbors, SVMs — was a form of supervised learning: each example came with a label, and the model learned to map inputs to those known answers. But most data in the wild has no labels. You have a million customers, a pile of sensor readings, a folder of documents — and nobody has told you what the 'right' groupings are. This is the world of unsupervised learning: finding structure when there is no answer key.

The two big jobs in this chapter are clustering — grouping similar points together — and dimensionality reduction — squeezing many features down to a few while keeping what matters. Both rest on a single idea: that the data is not random noise, but lives on some simpler hidden shape, and our job is to expose it. The catch, which we will return to, is that 'similar' and 'structure' are things you define, not things the data announces on its own.

k-means: the workhorse that wants round blobs

k-means is the first clustering algorithm almost everyone learns, and for good reason: it is fast, simple, and often good enough. You tell it how many groups k you want; it plants k centers, assigns every point to its nearest center, then moves each center to the average of its assigned points, and repeats. Each round can only lower the total squared distance, so it always settles down — usually within a handful of passes.

Choose k and drop k centers, usually at random points from the data.
Assign step: label each point with its nearest center.
Update step: move each center to the mean of the points assigned to it.
Repeat assign and update until nothing moves — the clusters have converged.

That elegance comes with strings attached. k-means draws straight-line boundaries and implicitly assumes clusters are round, similarly sized blobs — give it two crescent moons or one fat group and one thin one, and it will carve them wrongly. It is also sensitive to where the centers start, so good implementations run it several times and keep the best. And because distance drives everything, you must apply feature scaling first, or the feature with the largest raw range will silently dominate.

Hierarchical clustering: a family tree of your data

Hierarchical clustering answers a different question. Instead of forcing you to pick k up front, it builds a whole nested tree of groupings. The common bottom-up version starts with every point as its own cluster, then repeatedly merges the two closest clusters, until everything is one big cluster. The record of those merges is a tree called a dendrogram — and you can 'cut' it at any height to read off as few or as many clusters as you like.

The tree itself is the payoff: it shows not just which points group, but how groups nest inside larger groups — species into genera into families, or topics into themes. The cost is speed. A naive implementation compares every pair of clusters at every step, which scales roughly with the cube of the number of points, so plain hierarchical clustering is impractical for very large datasets. You also must choose a *linkage* rule — whether the distance between two clusters means their closest points, their farthest, or their averages — and that choice quietly reshapes the answer.

So a rough rule of thumb: reach for k-means when you have many points and a sense of how many groups you want; reach for hierarchical clustering when the data is smaller and the nested *relationships* between groups are themselves the interesting story.

PCA: finding the directions that matter

The other half of finding structure is reducing dimensions. Real datasets often have hundreds of features, but many of them move together — height and arm-span, or a dozen correlated survey questions. Principal Component Analysis (PCA) finds a new set of axes, called principal components, ordered so the first one points along the direction in which the data spreads out the most, the second along the next-most (at right angles to the first), and so on. Keep just the first few and you have compressed the data while losing very little.

Under the hood, PCA is the eigenvectors and eigenvalues you met in the linear-algebra rung, applied to the data's covariance matrix: the eigenvectors are the principal-component directions, and each eigenvalue tells you how much variance lives along its direction. That is the whole trick — no labels, no training loop, just the geometry of how the data is spread.

# from many features down to 2, for plotting
X = standardize(X)            # center & scale first!
components = top_eigenvectors(cov(X), keep=2)
X2d = X @ components          # each row is now just (x, y)
plot(X2d)                     # eyeball the structure

PCA in four lines: standardize, take the top directions, project, then look.

PCA's two big uses are visualization — squashing high-dimensional data into 2D so you can literally see clusters before running any clustering algorithm — and pushing back against the curse of dimensionality, where distances grow meaningless when features are too many. Two honest caveats: PCA only captures linear structure, so it will miss a curved or twisted shape that a nonlinear method would catch; and its components are blends of your original features, which can make them hard to name or explain. Like k-means, it needs standardized inputs, or the largest-scale feature hijacks the first component.

Using it well — and the honest limits

In practice these tools chain together. A common move is PCA first to cut noise and dimensions, then k-means or hierarchical clustering on the compressed data, and a 2D PCA plot to sanity-check the result by eye. Clustering also powers everyday jobs you have probably used: customer segmentation, grouping documents by topic, compressing an image's colors, and anomaly detection — points that sit far from every cluster are the suspicious ones.

Now the hard truth this field tends to oversell. Clustering will always return clusters — even on pure random noise. The algorithm has no notion of whether the groups it found are real; it just minimizes a number you chose. Different choices of k, of distance metric, of scaling, or of linkage can produce wildly different 'discoveries' from the very same data. So the structure you see is partly a property of the data and partly a property of your assumptions, and an honest analyst keeps the two separate.

The fix is not a better algorithm but better discipline: validate. Does the same grouping hold on a fresh slice of data? Do the clusters mean something to a domain expert? Does the choice of k survive a small change in the random seed? Treat unsupervised results as hypotheses to be checked, not facts to be reported — and you will get the genuine value these methods offer, which is a great deal, without falling for the structure that was never really there.