How Machines Actually "Learn"

Two ways to make a computer do something

Earlier in this rung you met AI as a field and saw how it differs from ordinary software. Now we zoom in on the single move that powers almost all of today's AI — and the honest place to begin is by contrasting it with the older way. For most of computing history, to make a machine do a task you wrote rules: explicit, step-by-step instructions that a programmer thought through in advance. "If the email contains the word 'lottery', mark it as spam." The intelligence lives entirely in the human who wrote the rules; the computer just obeys.

This rule-writing approach is the heart of classic symbolic AI and the expert systems that handcrafted thousands of if-then rules from human knowledge. It works beautifully when the task is clean and the rules are knowable. But think about recognising a cat in a photo, or telling a friendly message from a sarcastic one. Could you write down every rule? Whiskers? Plenty of cats are hidden behind a couch. The truth is you recognise a cat effortlessly but cannot fully explain how — so you cannot hand those rules to a computer.

Machine learning flips the arrow. Instead of writing the rules yourself, you show the machine many examples — thousands of photos already labelled "cat" or "not cat" — and let a procedure find the rules for you. The programmer no longer specifies *what* makes a cat a cat; they build a system that discovers that pattern from the data. Same goal, opposite direction: rules out of examples, rather than answers out of rules.

Examples, features, and labels

To learn from examples, a machine needs the examples in a form it can chew on: numbers. Take predicting a house's price. Each house becomes a row of measurable inputs — size in square metres, number of bedrooms, age, distance to the city centre. Each such input is a feature. The thing you want to predict — the price — is the label (also called the target). One house, with its features and its known price, is a single training example.

When every example comes with a known label — every house already has a price tag, every photo already says cat or not — the machine can compare its guesses against the right answers and correct itself. This is supervised learning, by far the most common and best-understood flavour, and the one we will follow for the rest of this guide. (There are other flavours — learning patterns from unlabelled data, or learning from trial-and-error rewards — but you'll meet those in the next guide.)

A model is a machine with adjustable knobs

At the centre of all this sits a model. Don't picture anything mysterious: a model is just a mathematical function with some adjustable numbers inside. It takes the features in and produces a prediction out. Those adjustable numbers are the parameters. Learning is nothing more — and nothing less — than finding good values for them.

The simplest honest example: predict price as a weighted sum of features. price = w1 x size + w2 x bedrooms + ... + b. Each w is a parameter — a knob saying how much that feature matters — and b is an offset. Turn the knobs and the same model becomes a different price-predictor. A modern language model is the very same idea scaled up grotesquely: instead of four knobs it has hundreds of billions, but each one is still just a number being tuned.

It helps to separate two kinds of numbers. Parameters are what the machine sets *during* learning — the w's and b above. A hyperparameter is a choice *you* make before learning starts — how many knobs to allow, how big a step to take when adjusting them. Parameters are learned; hyperparameters are dialled in by the human running the show. Mixing the two up is one of the most common beginner confusions.

Training: tuning the knobs to fit the data

So how do the knobs get set? You start with random values — the model's first guesses are nonsense. You feed it a training example, it predicts, and you measure how wrong it is: the gap between its prediction and the true label. Average that wrongness across the data and you have a single number called the loss — lower is better. Training is the search for parameter values that make the loss small. This is what people mean by "fitting a model to the data."

The search itself is wonderfully mechanical. For each knob you ask: if I nudge this one up a little, does the loss go up or down? Then you nudge every knob the way that lowers the loss, just a touch, and repeat — thousands or millions of times. That patient downhill walk is gradient descent, and some version of it powers nearly all modern training. No insight, no understanding — just "which tiny adjustment makes me less wrong," done relentlessly.

Start the parameters at random values.
Run examples through the model to get predictions.
Measure the loss — how far the predictions are from the true labels.
Nudge every parameter slightly in the direction that lowers the loss.
Repeat steps 2-4 until the loss stops improving much.

params = random()
repeat many times:
    pred  = model(features, params)
    loss  = how_wrong(pred, labels)
    grad  = which_way_lowers(loss, params)
    params = params - step_size * grad   # nudge downhill
# params now "fit" the training data

The whole training loop in spirit: predict, measure error, nudge the parameters downhill, repeat.

Training versus using — and the trap of memorising

Once the knobs are set, training is over and the model can be *used*: hand it a brand-new house it never saw, and it produces a price in a single forward pass — no more knob-turning. This split is training versus inference. Training is the slow, expensive, one-time (or occasional) process of finding the parameters; inference is the fast, cheap, repeated act of running the finished model on new inputs. The chatbot answering you is doing inference; its months of training already happened, and its knobs are frozen.

This is also where honesty matters most. The goal was never to do well on the houses we already know the prices of — we wanted to predict *new* ones. A model that simply memorises the training data, like a student who memorises the answer key without learning the subject, scores perfectly on what it saw and flops on anything new. That failure is overfitting, and fighting it is one of the central struggles of the whole field.

So when someone says a machine "learned," you can now translate it precisely: a model with adjustable parameters was shown labelled examples, and a downhill search tuned those parameters until its predictions on the examples grew accurate — ideally in a way that carries over to new cases. No magic, no understanding, no inner life. Just data, a function, and a great many small corrections. Hold onto that picture; everything heavier in this ladder is built on it.