From Diagonalization to a Decomposition for Every Matrix

What Vol I could and could not do

In a first course you learned diagonalization: a square matrix A is A = P D P^-1 when it has enough eigenvectors. The spectral theorem made this beautiful for symmetric matrices — orthogonal P and real eigenvalues. But most matrices you meet are not square, and many square ones are not diagonalizable at all. We need a factorization that never fails.

The singular value decomposition is that factorization. Every real m-by-n matrix A factors as A = U Sigma V^T, where U and V are orthogonal and Sigma is a nonnegative diagonal matrix. No assumptions, no exceptions.

Two bases, not one

Diagonalization uses one basis: the same eigenbasis on the input and output. That is exactly why it is so restrictive. The SVD's key move is to relax this: use an orthonormal basis v_1, ..., v_n of the input space and a possibly different orthonormal basis u_1, ..., u_m of the output space. The columns of V are the right [[singular-vectors|singular vectors]], the columns of U the left singular vectors.

The defining relation is wonderfully simple: A v_i = sigma_i u_i. Feed in a right singular vector, get out a scaled left singular vector. The scale factor sigma_i >= 0 is a singular value. Compare this to the eigenvalue relation A x = lambda x, where input and output directions coincide — see singular values vs eigenvalues.

A picture of stretching

Geometrically, A takes the unit sphere to an ellipsoid — this is the geometry of the SVD. V^T rotates the input so the singular directions line up with the axes, Sigma stretches each axis by sigma_i, and U rotates the result into place. Rotate, stretch, rotate: that is every linear map there is.

A 2x2 example: A = [3, 0; 4, 5]

The unit circle  { x : ||x|| = 1 }  maps to an ellipse.
The longest radius of that ellipse  = sigma_1  (largest stretch)
The shortest radius                 = sigma_2  (smallest stretch)

Direction v_1 (input) is sent to direction u_1 (output), scaled by sigma_1.
Direction v_2 (input) is sent to direction u_2 (output), scaled by sigma_2.

So A = U * diag(sigma_1, sigma_2) * V^T  with
   columns of V = v_1, v_2   (orthonormal input axes)
   columns of U = u_1, u_2   (orthonormal output axes)

A maps the unit circle to an ellipse; the semi-axis lengths are the singular values.