k-Nearest Neighbors & Naive Bayes

Two algorithms that barely deserve the name

You've already met models that *learn* — linear regression fits a line, a decision tree grows splits, a random forest votes across many trees. This guide introduces two methods that feel almost like cheating, because one of them does no real training at all, and the other does only a little counting. They are k-nearest neighbors and naive Bayes — and despite their simplicity, both are still reached for daily in real systems.

Both are forms of supervised learning: you hand them labeled examples — each a feature vector with a known label — and they learn to label new points. What makes them special is *how little machinery* they need. They are perfect for building a baseline: the first thing you try before you let anyone wheel in a neural network.

k-NN: just ask your neighbors

The idea behind k-nearest neighbors is something you already do. To guess whether a new restaurant is good, you look at the most similar restaurants you know and go with the majority verdict. k-NN does exactly that: to classify a new point, it finds the *k* training examples closest to it and lets them vote. Want to predict a number instead of a class? Average the neighbors' values. That's the whole algorithm.

"Closest" means a distance in feature space — usually plain Euclidean distance, the straight-line gap you'd measure with a ruler if each feature were an axis. The choice of *k* is a hyperparameter you tune: small *k* (say 1) hugs every wiggle of the data and tends to overfit noise; large *k* smooths things out but can blur real boundaries — a clean illustration of the bias–variance tradeoff you met earlier.

predict(x, k):
    dists = [ distance(x, xi) for (xi, yi) in training_set ]
    nearest = the k examples with smallest dists
    return majority_vote(label of each in nearest)   # or average, for regression

k-NN in four lines. Notice there is no training loop — the work happens entirely at prediction time.

Where k-NN quietly breaks

k-NN has two honest weaknesses worth knowing before you trust it. First, *scale matters*: if one feature is measured in dollars (0–100,000) and another in years (0–80), the dollar axis dominates the distance and the year feature is ignored. Always apply feature scaling first — this is not optional polish, it changes the answer.

Second, and more subtly, k-NN suffers badly from the curse of dimensionality. As you add features, the space inflates so fast that *every* point becomes roughly equidistant from every other — "nearest" loses its meaning. In hundreds or thousands of dimensions, the comforting intuition of "look at who's nearby" simply stops working, which is one reason k-NN shines on low-dimensional, tidy problems and fades on raw high-dimensional data like images.

None of this makes k-NN obsolete — far from it. Its core trick (find similar items, copy their answer) is exactly how modern retrieval works: turn things into embeddings and search a vector database for nearest neighbors. So the humble idea you're learning here scales up, in disguise, to the heart of today's largest systems.

Naive Bayes: counting your way to a guess

Naive Bayes comes from a different angle: probability. It rests on Bayes' theorem, which tells you how to flip a conditional probability. We want P(class | features) — the chance an email is spam given the words it contains. Bayes' theorem rewrites this in terms of things we *can* count from training data: how often spam emails contain each word, and how common spam is overall.

The catch is computing the probability of a whole *combination* of features at once — that needs impossibly much data. So naive Bayes makes one bold, deliberately wrong assumption: that all features are independent given the class. In the spam example, it pretends the word "free" tells you nothing about whether "money" also appears, once you already know the email is spam. That lets it multiply each feature's probability separately — turning a hopeless joint estimate into a handful of easy counts.

"Naive" is the algorithm admitting this assumption is false — words in real text are obviously correlated. The genuinely surprising fact is that it often *doesn't matter*. Even when the independence assumption is badly violated, the class with the highest score is frequently still the right one, because the errors tend to inflate scores without flipping which class wins. This is why naive Bayes remains a famously strong text-classification baseline.

When simple wins

It's tempting to assume bigger models always win, but the no free lunch theorem says no single method dominates across all problems. What decides the winner is the match between a method's built-in assumptions — its inductive bias — and the actual shape of your data. Naive Bayes assumes feature independence; k-NN assumes nearby points share labels. When those assumptions roughly hold, these tiny models can match or beat far heavier ones.

Simple methods win in concrete situations: when you have little labeled data (a deep network would just overfit); when you need an answer in milliseconds with no GPU; when someone must *explain* the decision to a regulator or a customer; or when you simply need a baseline to prove a fancier model earns its keep. A naive Bayes spam filter or a k-NN recommender can ship in an afternoon and run on a phone.

The honest framing is not "simple vs. powerful" but "matched vs. mismatched." Start with the cheapest thing that could work, measure it properly, and let the data tell you whether complexity is worth it. More often than newcomers expect, the answer is no — and that restraint is itself a mark of skill.