Decision Trees & Random Forests

A flowchart the machine writes for itself

You already met linear regression and logistic regression, which draw a single straight boundary through your data. A decision tree takes a completely different, wonderfully human approach: it asks a chain of yes/no questions. "Is the passenger's age under 10? If yes, did they travel in third class? If no, ..." — each answer sends you down a branch, until you reach a leaf that says the final guess. It is exactly the flowchart you might draw by hand to make a decision, except the machine figures out the questions on its own.

How does it pick the questions? Greedily, one at a time. At each node it scans every feature and every possible cut point, and chooses the split that makes the resulting groups as pure as possible — meaning each group leans heavily toward one label. A common way to measure that messiness is entropy: a node where everyone shares the same answer has zero entropy, and the tree's job is to keep slicing until impurity drops. Then it repeats inside each child, deeper and deeper.

The tree that memorized the answer key

Here is the trouble. If you let a tree keep splitting, it will not stop until every leaf is perfectly pure — often a single training example sitting alone in its own leaf. At that point the tree has not *learned* the pattern; it has memorized the training data, carving out a tiny custom box around each point. This is the overfitting you read about earlier, in its purest form: flawless on data it has seen, embarrassingly wrong on anything new.

A single deep tree sits at one extreme of the bias–variance tradeoff: it has tiny bias but enormous variance. Change a handful of training rows and the whole tree can reshape itself, growing entirely different questions. That instability is the symptom that should make you nervous — a model whose mind changes wildly with the data is a model that has latched onto noise, not signal.

The first defense is pruning: stop the tree early (limit its depth, or require each leaf to hold at least, say, 20 examples), or grow it fully and then snip back branches that do not earn their keep on held-out data. Pruning trades a little training accuracy for much better generalization. It helps — but a lone tree, however carefully trimmed, rarely competes with what comes next.

Wisdom of the crowd: bagging

There is an old observation: ask one expert and you might get a wildly wrong answer; ask a thousand and average them, and the errors cancel out. A high-variance tree is exactly such a jittery expert. So instead of fighting to stabilize one tree, we embrace its wobble and average many of them. This is the heart of ensemble learning — combining many imperfect models into one stronger whole.

But if we trained every tree on the same data, we would just get the same tree a thousand times — no diversity, no cancellation. Bagging (short for *bootstrap aggregating*) creates the diversity. For each tree, it draws a random sample of the training rows *with replacement* — so some rows appear twice, others not at all. Each tree thus sees a slightly different world, learns slightly different questions, and makes slightly different mistakes.

Then you aggregate: for classification, the trees vote and the majority wins; for regression, you average their numbers. Because each tree's errors point in random, uncorrelated directions, they tend to cancel — while their shared, genuine signal reinforces. The math is gentle but powerful: averaging many independent estimates shrinks variance without raising bias. The crowd is steadier than any one voice in it.

The random forest's extra twist

A random forest is bagging applied to decision trees, plus one clever extra ingredient. Bagging alone has a hidden weakness: if one feature is very strong, almost every tree will pick it for its first split, so the trees end up looking alike — and similar trees make *correlated* mistakes, which average poorly. The forest breaks this herd behavior.

The twist: at *every split*, each tree is allowed to consider only a random subset of the features (a common choice is the square root of the total). A tree might be forbidden from using the dominant feature at a given node, forcing it to discover the second- and third-best questions instead. The result is a forest of genuinely *different* trees whose errors are far less correlated — and that decorrelation is precisely what makes the averaging work so well.

forest = []
for t in range(n_trees):
    rows  = sample_with_replacement(training_data)   # bagging
    tree  = grow_tree(rows, features_per_split=sqrt(F))  # random features
    forest.append(tree)

def predict(x):
    votes = [tree.predict(x) for tree in forest]
    return majority(votes)            # or mean(votes) for regression

A random forest in a nutshell: many trees, each on a bootstrap sample, each split picking from a random handful of features — then they vote.

A lovely free bonus: the rows left out of each tree's bootstrap sample (the *out-of-bag* examples) form a built-in test set for that tree. Averaging the forest's accuracy on these unseen rows gives an honest error estimate almost for free, without setting aside a separate validation split.

When forests win — and an honest caveat

Random forests remain one of the most reliable workhorses in supervised learning, and they are still many practitioners' first move on messy tabular data — spreadsheets of mixed numbers and categories. They need almost no feature scaling, shrug off outliers and irrelevant columns, rarely overfit even with hundreds of trees, and work well out of the box with little tuning. On the structured data that fills most businesses, a forest often matches or beats a neural network while needing a fraction of the data and effort.

Be honest about the limits, though. A forest sacrifices the single tree's prized readability — you traded one explainable flowchart for a committee of hundreds. It also struggles to extrapolate: because every prediction is an average of training leaves, a forest can never output a value beyond the range it saw, so it is poor at smooth trends or genuine forecasting. And for perceptual data — images, audio, raw text — where the signal lives in spatial or sequential structure, the deep learning methods later in this ladder pull decisively ahead.