Gradient Descent & Least Squares

The gradient points downhill fastest

Across this rung we have used the gradient nabla f as a detector: set it to zero and you find the flat spots, the candidate peaks, pits, and saddles. But a nonzero gradient is not a failure — it is a compass. Recall the central fact of this whole ladder: nabla f points in the direction of steepest increase of f, and its length is exactly how steep that climb is. This is the meaning of the [[gradient-steepest-ascent|gradient as steepest ascent]]. Flip it around and minus nabla f points the opposite way — the direction of steepest decrease, the single compass heading along which f drops fastest from where you stand.

Why is that the steepest direction, and not merely one downhill direction among many? Picture standing on a hillside in fog. The rate at which f changes as you step in some unit direction u is the directional derivative, and a short calculation shows it equals nabla f dotted with u — the projection of the gradient onto your chosen heading. A dot product is largest when the two vectors are aligned, so the climb is steepest when u points exactly along nabla f, and the descent is steepest when u points exactly against it. Every other direction gives a gentler slope; the directions perpendicular to the gradient give no change at all, which is why they trace the level sets.

Walking downhill, step by step

Now turn the compass into a journey. You want the minimum of f but cannot see the whole landscape — you can only feel the slope under your feet. So adopt the most patient strategy imaginable: from wherever you are, take a small step in the steepest downhill direction, then look again, then step again. This is [[steepest-descent|steepest descent]], and the update is a single honest line: x_new = x_old - t times nabla f(x_old). The minus sign sends you downhill, nabla f gives the direction and steepness, and the positive number t — the step length — says how far to commit before re-checking.

The single hardest choice is that step length. Make t too small and you inch forward, taking thousands of tiny steps to cross a gentle plain. Make t too large and you leap clean over the valley floor and land higher on the far slope, perhaps bouncing wider each time until the whole process flies apart. Classical steepest descent solves this by a line search: along the chosen ray it temporarily becomes a one-variable problem, and you pick the t that minimizes f on that ray — the single-variable minimization you already know from Volume I. Each such step lands you where the path levels out, which forces the very next direction to be perpendicular to this one.

That forced perpendicularity is steepest descent's famous flaw. In a long, narrow valley — a ravine far steeper across than along — each step shoots across the floor, overshoots, and the next step shoots back, so the path zigzags down the valley in tiny cross-steps instead of running straight along its length. Convergence crawls. This is not a bug to be patched away; it is intrinsic to following only first-order slope information, and it is precisely what motivates the smarter cousins — conjugate gradients, momentum, and the Newton-type methods that fold in curvature. The plain method is the honest conceptual ancestor of all of them.

Gradient descent: how machines actually learn

Here is the honest link to machine learning, with no magic spared. Training a model means choosing its parameters — the weights of a neural network, say — to make some measure of error as small as possible. That error, viewed as a function of all the weights at once, is the loss: a single number sitting on top of a landscape with possibly millions of dimensions, one per weight. Learning is nothing more or less than minimizing that loss, and the workhorse that does it is [[calc-gradient-descent|gradient descent]] — the same downhill step you just met, scaled up to enormous dimension.

Practical gradient descent makes two pragmatic departures from the textbook ideal. First, it abandons the exact line search: computing the gradient even once over millions of weights is already costly, so instead of solving for the best t each step, you fix a small step size called the learning rate, written eta. The update reads x_new = x_old - eta times nabla f(x_old). Choosing eta is an art — too small and training drags for days, too large and the loss diverges — and much of practical machine learning is, frankly, learning to tune it. Second, the gradient itself is computed by backpropagation, which is just the chain rule applied carefully backward through the layers of the network.

Convexity: when downhill is good enough

Gradient descent has one deep and honest limitation: it only ever walks downhill, so it stops at whatever low spot the path happens to reach — a local minimum, not necessarily the lowest point anywhere. This is the heart of [[local-versus-global-optimum|local versus global optimum]]. On a bumpy landscape with many dimples, where you start determines which dimple you fall into, and the algorithm has no way to know whether a deeper valley lies just over the next ridge. It cannot see the whole map; it only feels the slope beneath it.

There is one beautiful situation where this worry vanishes completely: when f is a [[convex-function|convex function]]. A convex function is one whose graph holds water — it bows upward like a bowl, and the straight chord between any two points on the graph always lies on or above the graph itself, never below. The decisive consequence is a theorem worth memorizing: for a convex function, every local minimum is automatically a global minimum. There are no false valleys to be trapped in, because there is essentially one valley. Any honest downhill method — gradient descent included — is then guaranteed to find the true best answer.

Convexity has a clean test that ties back to the earlier guides in this rung. For a smooth function, f is convex exactly when its Hessian is positive semidefinite everywhere — the multivariable way of saying the surface curves upward (or at least never downward) in every direction at every point. So convexity is just the second-derivative test made global and permanent: not merely a bowl at one stationary point, but a bowl everywhere at once. Be honest, though, about the cost of this guarantee: most genuinely interesting problems, neural-network training above all, are sharply non-convex. There the clean theorem lapses, every guarantee softens to a merely local statement, and that gradient descent still works so well is part hard-won engineering and part empirical good luck.

Least squares: fitting as optimization

Now point all of this machinery at the most common task in applied science: fitting a model to data. You have a cloud of measured points and want the straight line that passes 'closest' to them, but no single line can hit them all — there are more data points than there are knobs (slope and intercept) to turn. The honest move is to define what 'closest' means and then minimize it. [[calc-least-squares|Least squares]] chooses the line that makes the sum of the squared vertical gaps between line and data as small as possible. Squaring the gaps punishes large misses harder, keeps every error positive so they cannot cancel, and — crucially — yields a smooth function you can differentiate.

Here is where the whole rung clicks into place. Call the slope m and intercept b; the total squared error E(m, b) is a function of just those two variables — a surface over the (m, b) plane. And it is a sum of squares, so it is a perfect upward-opening bowl: convex, with a single global minimum and no saddles or false dips anywhere. To find the best line we do exactly what guide one taught: set the partial derivatives to zero, dE/dm = 0 and dE/db = 0, and solve. Because the bowl is convex, that one stationary point is guaranteed to be the global best fit. Fitting a line is not a separate art — it is multivariable optimization wearing everyday clothes.

Setting those two derivatives to zero produces a small linear system — and in matrix form it is the famous [[normal-equations|normal equations]]. Write the model as A x = b, where A holds the inputs, x the unknown parameters, and b the data; minimizing the squared length of the residual A x - b and setting the gradient to zero gives A-transpose-A x = A-transpose-b. The word 'normal' is geometric: at the best fit the residual is made perpendicular — normal — to the space the model can reach, so the least-squares solution is precisely the orthogonal projection of the data onto that space, with the leftover error sticking out at a right angle. Squaring the error turned a vague wish for 'closeness' into a clean projection you can solve in one shot.

Fit a line  y = m x + b  to data points (x_i, y_i),  i = 1..N

  residual at point i :   r_i = (m x_i + b) - y_i
  squared error       :   E(m, b) = sum_i  r_i^2          (a convex bowl)

set both partials to zero:
  dE/dm = 2 sum_i x_i (m x_i + b - y_i) = 0
  dE/db = 2 sum_i    (m x_i + b - y_i) = 0

=> the normal equations (2 x 2 linear system in m and b):
  [ sum x_i^2 , sum x_i ;  sum x_i , N ] * [ m ; b ] = [ sum x_i y_i ; sum y_i ]

TWO routes to the SAME answer:
  closed form  ->  solve the 2x2 system directly       (one shot)
  iterative    ->  gradient descent on E(m,b)           (step downhill)

Fitting a line is multivariable optimization: a convex squared-error bowl, solved either in closed form by the normal equations or iteratively by gradient descent — both land at the same minimum.

Two roads to the same valley

Least squares now reveals a small marvel that ties this guide's two halves together. The exact same best-fit line can be reached two utterly different ways. The normal equations give it in one shot — a closed-form answer, solve the linear system and you are done. Or you can ignore the formula entirely and just run gradient descent on the squared-error bowl: start with any slope and intercept, repeatedly nudge them against the gradient of E, and watch the line creep toward the data. After enough iterations the parameters settle at the very same place the normal equations name directly. One road is algebra, the other is a walk downhill, and because the bowl is convex they cannot disagree.

Why ever choose the slow walk when a formula exists? Honesty about scale. The normal equations require forming and solving a system whose size grows with the number of parameters, and forming A-transpose-A also quietly squares the problem's sensitivity to rounding error, so careful software prefers a QR or singular-value factorization for stability rather than the raw formula. More decisively, when the model has millions of parameters — every modern one does — the closed form becomes impossibly large to assemble, while a single gradient step stays cheap. That is the real reason iterative descent rules machine learning: not that the exact answer is unknown, but that walking to it, one cheap step at a time, is the only thing that scales.

Step back and see the whole arc of this rung. You began by finding the flat spots where the gradient vanishes and classifying them with the Hessian; you learned to optimize under constraints with Lagrange multipliers; and now you have turned the gradient from a detector into a vehicle, riding it downhill to train models and to fit data. Critical points, the second-derivative test, constraints, and descent are not four separate tricks — they are one calculus of finding the best, told from four angles. That single idea, scaled up, is the engine inside modern engineering and machine learning alike.