From SVD to PCA: finding the directions data cares about

Picking up where Volume I left off

In Volume I you proved that every real symmetric matrix has an orthonormal basis of eigenvectors — the spectral theorem — and you met the singular value decomposition A = U Sigma V^T for any matrix. Volume II is one long answer to the question: so what? This first track shows the same two facts solving real problems in data, on graphs, and over time.

Start with a cloud of data points: each row of a matrix X is one observation, each column one feature. The geometry of that cloud — which way it stretches, where it is thin — is exactly an eigenvalue problem in disguise.

The covariance matrix carries the shape

Center the data so each feature has mean zero, then form the covariance matrix C = (1/(n-1)) X^T X. C is symmetric and positive semidefinite, so the spectral theorem applies: C = Q Lambda Q^T with orthonormal eigenvectors and nonnegative eigenvalues. Each eigenvalue is the variance of the data along its eigenvector direction.

PCA, step by step

Center each column of X (subtract its mean).
Compute the SVD X = U Sigma V^T (or eigendecompose C = X^T X / (n-1)).
The columns of V are the principal directions, ordered by decreasing variance.
Project: keep the first k directions, Z = X V_k, to compress n features down to k.

# 2-D toy: data stretched along the line y = x
X = [[ 2.1,  1.9],
     [-1.0, -1.2],
     [ 0.3,  0.1],
     [-1.4, -0.8]]
# after centering, X^T X is roughly
#   C = [ 1.9 , 1.7 ;
#         1.7 , 1.6 ]
# eigenvalues: lambda_1 ~ 3.4  (big),  lambda_2 ~ 0.1  (tiny)
# top eigenvector v1 ~ (0.72, 0.69)  -> the y = x direction
# 96% of the variance lives along v1, so 1 number per point suffices

A cloud hugging y = x collapses to one meaningful axis.

Why this is the same as Volume I's least squares

The first k principal directions give the best rank-k low-rank approximation of X in the least-squares sense — this is the Eckart–Young theorem, and it is why PCA, recommender systems, and word embeddings are secretly the same trick: approximate a big matrix by a low-rank one whose factors are interpretable directions.