The iterative bargain
Direct methods transform A until the answer falls out; iterative methods start from a guess x_0 and refine it, producing x_1, x_2, ... that converge to the true x. The bargain is appealing for huge sparse systems: each iteration's main cost is one matrix-vector product A v, which for a sparse A costs only O(nnz) — proportional to the number of nonzeros, not n^2. You never form L or U, never suffer fill-in. You only ever touch A through multiplication.
Stationary iterations: Jacobi and Gauss-Seidel
The oldest idea is a matrix splitting: write A = M - N where M is easy to invert, turn A x = b into M x = N x + b, and iterate x_{k+1} = M^-1 (N x_k + b). In Jacobi iteration M is just the diagonal of A, so each new component is solved using only old values. Gauss-Seidel takes M to be the lower triangle, immediately reusing freshly updated components within the same sweep — usually converging in about half as many steps.
Jacobi sweep for A x = b (a_ii on the diagonal):
for i = 1..n:
x_new[i] = ( b[i] - sum_{j != i} a[i][j] * x_old[j] ) / a[i][i]
x_old <- x_new # all components use OLD values
Gauss-Seidel sweep (in place):
for i = 1..n:
x[i] = ( b[i] - sum_{j < i} a[i][j]*x[j] # already updated
- sum_{j > i} a[i][j]*x[j] # still old
) / a[i][i]
Convergence guaranteed if A is strictly diagonally dominant
(|a_ii| > sum_{j != i} |a_ij| for every row).The leap: Krylov subspaces
Stationary methods waste information: at step k they remember only x_k, throwing away every earlier vector. The modern insight is to keep them all. Starting from the residual r_0 = b - A x_0, repeatedly multiplying by A generates r_0, A r_0, A^2 r_0, ..., and their span is the Krylov subspace K_m = span{r_0, A r_0, ..., A^{m-1} r_0}.
Why is this the right space? Because A^-1 b, the answer you want, can be written as a polynomial in A applied to b (Cayley-Hamilton guarantees it), and that polynomial's powers of A live exactly in the Krylov subspace. So K_m contains an increasingly good approximation to the true solution, and it is built using nothing but the sparse matrix-vector products you can already afford. A Krylov method picks the best approximation inside K_m by some criterion — and the choice of criterion is precisely what distinguishes conjugate gradient from GMRES in the next guide.