What Vol I could and could not do
In a first course you learned diagonalization: a square matrix A is A = P D P^-1 when it has enough eigenvectors. The spectral theorem made this beautiful for symmetric matrices — orthogonal P and real eigenvalues. But most matrices you meet are not square, and many square ones are not diagonalizable at all. We need a factorization that never fails.
The singular value decomposition is that factorization. Every real m-by-n matrix A factors as A = U Sigma V^T, where U and V are orthogonal and Sigma is a nonnegative diagonal matrix. No assumptions, no exceptions.
Two bases, not one
Diagonalization uses one basis: the same eigenbasis on the input and output. That is exactly why it is so restrictive. The SVD's key move is to relax this: use an orthonormal basis v_1, ..., v_n of the input space and a possibly different orthonormal basis u_1, ..., u_m of the output space. The columns of V are the right [[singular-vectors|singular vectors]], the columns of U the left singular vectors.
The defining relation is wonderfully simple: A v_i = sigma_i u_i. Feed in a right singular vector, get out a scaled left singular vector. The scale factor sigma_i >= 0 is a singular value. Compare this to the eigenvalue relation A x = lambda x, where input and output directions coincide — see singular values vs eigenvalues.
A picture of stretching
Geometrically, A takes the unit sphere to an ellipsoid — this is the geometry of the SVD. V^T rotates the input so the singular directions line up with the axes, Sigma stretches each axis by sigma_i, and U rotates the result into place. Rotate, stretch, rotate: that is every linear map there is.
A 2x2 example: A = [3, 0; 4, 5]
The unit circle { x : ||x|| = 1 } maps to an ellipse.
The longest radius of that ellipse = sigma_1 (largest stretch)
The shortest radius = sigma_2 (smallest stretch)
Direction v_1 (input) is sent to direction u_1 (output), scaled by sigma_1.
Direction v_2 (input) is sent to direction u_2 (output), scaled by sigma_2.
So A = U * diag(sigma_1, sigma_2) * V^T with
columns of V = v_1, v_2 (orthonormal input axes)
columns of U = u_1, u_2 (orthonormal output axes)