Fitting a line: the whole idea in one picture
You already know the setup from earlier in this ladder: supervised learning hands you a dataset of examples, each with input features and a known answer, the label. Linear regression is the simplest honest answer to one question — given the features, predict a *number*. Picture a scatter of dots: house size on one axis, price on the other. Linear regression draws the single straight line that threads through the cloud as snugly as possible, and that line is your model.
A line is just `y = w*x + b`. The slope `w` is a weight — how many dollars per extra square meter — and `b` is the intercept, the baseline price when the feature is zero. With many features you simply stack more terms: `y = w1*x1 + w2*x2 + ... + b`. That is the entire model. It has only a handful of parameters, and learning *is* finding the numbers `w` and `b` that make the line fit.
What 'best fit' means, and how the line learns
Out of the infinitely many lines you could draw, which is *best*? We need a number that scores how wrong a line is — a loss function. For linear regression the classic choice is mean squared error: for each point, take the gap between the line's guess and the true value, square it (so over- and under-shooting both hurt, and big misses hurt a lot), then average. The best line is the one that makes this total squared error as small as possible.
How do we find it? Two ways. Because squared error over a linear model forms a smooth bowl-shaped (convex) surface, there is a tidy closed-form formula that lands on the exact bottom in one shot. But the method that scales — and the one every later model in this field leans on — is gradient descent from the previous rung: start with a random line, nudge `w` and `b` a little downhill on the error surface, repeat. Either way you reach the same valley, because there is only one.
From numbers to yes-or-no: logistic regression
Now flip the question. Instead of *how much*, ask *which class*: is this email spam or not, will this patient relapse or not? That is classification. We would love to reuse the line, but a line runs from minus-infinity to plus-infinity, while a yes/no answer should be a probability living between 0 and 1. Logistic regression solves this with one elegant move: compute the usual weighted sum `w*x + b`, then squash it through the sigmoid function.
z = w1*x1 + w2*x2 + ... + b # same weighted sum as linear regression p = 1 / (1 + exp(-z)) # sigmoid: squashes z into (0, 1) # p is now a probability, e.g. P(spam) prediction = "yes" if p >= 0.5 else "no"
The sigmoid is an S-curve: very negative scores flatten toward 0, very positive ones toward 1, and the middle passes smoothly through 0.5. We no longer use squared error here — a probability calls for a different ruler, the log loss (cross-entropy), which punishes a model harshly for being confidently wrong. Despite the name, logistic regression *classifies*; the word 'regression' is a historical leftover, because under the hood it still regresses a linear score.
The decision boundary
Where exactly does 'yes' turn into 'no'? At the threshold `p = 0.5`, which happens precisely when the inner score `z = w*x + b` equals zero. That equation — `w*x + b = 0` — is itself a line (or, with more features, a flat plane). It is the decision boundary: the model says 'yes' on one side and 'no' on the other. Logistic regression can only ever draw a *straight* boundary; this is its defining strength and its honest limit.
The 0.5 cutoff is a choice, not a law. If missing a fraud is far costlier than a false alarm, slide the threshold down to 0.3 and you catch more positives at the price of more false flags — the trade-off you will later read off a ROC curve. Notice too that the distance from the boundary maps to confidence: points hugging the line get probabilities near 0.5 (a shrug), while points far out push toward 0 or 1 (a firm answer).
A straight boundary cannot carve out a circle or a spiral. If your two classes are tangled in a way no line can separate, logistic regression will underfit no matter how long you train — a textbook case of inductive bias, the assumption baked into the model about what answers are even allowed. The fix is not always a neural network: hand-craft a clever feature, or hand the job to a support vector machine or decision tree coming up later in this rung.
Why these old workhorses still win
The quiet superpower of both models is interpretability. Each learned weight is a sentence you can say out loud: 'holding everything else fixed, one more bedroom adds about $18,000', or 'this word raises the log-odds of spam by 0.4'. You can audit it, defend it to a regulator, and spot when it has latched onto something it shouldn't. Most large models cannot offer that; here it falls out for free, which is why this sits at the heart of explainable AI.
Read a coefficient carefully, though. A big weight can simply mean that feature was on a larger scale, which is why you usually apply feature scaling first so the numbers are comparable. And a weight reports *association*, not cause: 'ice cream sales predict drownings' is a true regression, a false story. Conflating the two is the single most common abuse of these models.
With few parameters these models rarely overfit, and a light touch of ridge or lasso regularization keeps them honest on small, wide, or noisy data — exactly the tabular problems that fill real businesses. They train in seconds, run on a phone, and give a strong baseline that any fancier model must beat to justify itself. Start here. Often, you will not need to go further.