Picking up where Volume I left off
In Volume I you proved that every real symmetric matrix has an orthonormal basis of eigenvectors — the spectral theorem — and you met the singular value decomposition A = U Sigma V^T for any matrix. Volume II is one long answer to the question: so what? This first track shows the same two facts solving real problems in data, on graphs, and over time.
Start with a cloud of data points: each row of a matrix X is one observation, each column one feature. The geometry of that cloud — which way it stretches, where it is thin — is exactly an eigenvalue problem in disguise.
The covariance matrix carries the shape
Center the data so each feature has mean zero, then form the covariance matrix C = (1/(n-1)) X^T X. C is symmetric and positive semidefinite, so the spectral theorem applies: C = Q Lambda Q^T with orthonormal eigenvectors and nonnegative eigenvalues. Each eigenvalue is the variance of the data along its eigenvector direction.
PCA, step by step
- Center each column of X (subtract its mean).
- Compute the SVD X = U Sigma V^T (or eigendecompose C = X^T X / (n-1)).
- The columns of V are the principal directions, ordered by decreasing variance.
- Project: keep the first k directions, Z = X V_k, to compress n features down to k.
# 2-D toy: data stretched along the line y = x
X = [[ 2.1, 1.9],
[-1.0, -1.2],
[ 0.3, 0.1],
[-1.4, -0.8]]
# after centering, X^T X is roughly
# C = [ 1.9 , 1.7 ;
# 1.7 , 1.6 ]
# eigenvalues: lambda_1 ~ 3.4 (big), lambda_2 ~ 0.1 (tiny)
# top eigenvector v1 ~ (0.72, 0.69) -> the y = x direction
# 96% of the variance lives along v1, so 1 number per point sufficesWhy this is the same as Volume I's least squares
The first k principal directions give the best rank-k low-rank approximation of X in the least-squares sense — this is the Eckart–Young theorem, and it is why PCA, recommender systems, and word embeddings are secretly the same trick: approximate a big matrix by a low-rank one whose factors are interpretable directions.