Activation Functions

The flat-world problem

You already know what a single [[artificial-neuron|artificial neuron]] does: it takes its inputs, multiplies each by a [[weight|weight]], adds them up, adds a [[bias-term|bias]], and passes that one number along. Every operation in that list is a straight-line operation — scaling and adding, nothing more. That is exactly what a [[linear-map|linear map]] is, and it has a deceptively dull property: do one after another, and you never get anything new.

Here is the punchline that surprises everyone the first time. Imagine you stack ten layers of these neurons, each layer feeding the next, no activations in between — a deep, impressive-looking network. Mathematically, that whole tower collapses. Multiply ten weight matrices together and you just get one matrix; the ten layers behave *identically* to a single layer. All that depth, and the most it can ever draw is a flat plane through your data. It cannot bend.

The fix is almost embarrassingly small. After each neuron computes its weighted sum, we pass that number through one more step — a simple, *non-straight* function — before handing it on. That step is the [[activation-function|activation function]]. It is the kink in the hose, the little bend that, repeated across a layer, lets the next layer build a fold, and the layer after that build a fold-of-folds. Bend by bend, the network gains the freedom to trace any curve at all.

Why a bend is enough

It feels like cheating that one humble bend per neuron should buy so much. But there is a real theorem behind it. The [[universal-approximation-theorem|universal approximation theorem]] says that a network with just one [[hidden-layer|hidden layer]] of these bent neurons — given enough of them — can approximate *any* continuous function you like, to any accuracy you ask for. Non-linearity is the single ingredient that unlocks this. It is the difference between a tool that can only draw rulers and one that can draw anything.

One more thing the bend must do, quietly: it has to be *differentiable* (or close enough). Remember that a network learns by nudging weights in the direction that lowers error, and to know that direction we need the slope of every step. So an activation can't just be any wiggle — it must be one whose slope we can compute and pass back. Keep that in mind; it explains why the four classic activations look the way they do.

The squashers: sigmoid and tanh

The oldest bend is the [[sigmoid-function|sigmoid]], a smooth S-shaped curve. Hand it any number — minus a million or plus a million — and it gently squashes the answer into the range 0 to 1. Huge negatives flatten toward 0, huge positives flatten toward 1, and right in the middle it rises steeply through 0.5. Its appeal is intuitive: the output reads like a soft "how on is this neuron?", or a probability. For decades it was the default, and it is still the natural choice for the *final* neuron of a yes/no classifier.

Its close cousin is [[tanh-activation|tanh]], the same S-shape but squashing into −1 to 1 and centered on zero. That zero-centering is a genuine improvement: when a layer's outputs are balanced around zero, the next layer's gradients behave better and learning tends to be quicker. If you reach for a squashing activation in a hidden layer at all, tanh almost always beats sigmoid.

But both squashers share a quiet flaw that nearly stalled deep learning for years. Look at the flat tails of the S: out there, the curve is almost horizontal, so its slope is almost zero. In a deep stack, the learning signal travels backward by *multiplying* these slopes together, layer after layer. Multiply many near-zero numbers and the signal shrinks to nothing before it reaches the early layers — they stop learning. This is the famous [[vanishing-gradient|vanishing gradient]] problem, and it is the squashers' Achilles heel.

ReLU: the lazy genius

The activation that broke the logjam is almost insultingly simple. The [[relu|ReLU]] — rectified linear unit — does one thing: if the input is positive, pass it through unchanged; if it's negative, output zero. That's the entire function. It looks like a flat floor that suddenly tilts upward at zero. No exponentials, no curves to compute — just "keep the good news, drop the bad."

def relu(x):
    return x if x > 0 else 0

# slope (what backprop sends back):
#   x > 0  ->  1   (full signal passes, undimmed)
#   x < 0  ->  0   (this neuron is 'off', no signal)
#
# sigmoid(x) slope peaks at just 0.25 and fades to ~0 in the tails;
# relu's slope is a clean 1 wherever it's active -> gradients survive depth.

ReLU in one line, and why it dodges vanishing gradients: on the active side its slope is exactly 1, so the learning signal passes back undimmed instead of being squashed toward zero layer by layer.

That clean slope of 1 is the magic. The learning signal flows back through active ReLU neurons without being dimmed, so even very deep stacks keep training. Add that it is dirt-cheap to compute, and you understand why ReLU became the default activation for hidden layers across vision, language, and almost everything else. The bulk of the [[deep-learning|deep learning]] boom sits on this one-line trick.

ReLU is not flawless, and it's worth being honest about its quirk. Because its output and slope are both zero for any negative input, a neuron can get pushed into the negative zone and *stay* there — its gradient is zero, so it never updates again. People call these "dying ReLUs." The cure is a small tweak (leaky ReLU, GELU, and friends let a trickle through on the negative side), but the headline holds: plain ReLU is still the sensible first thing to try in a hidden layer.

Softmax: the committee vote

The other three activations act on one neuron at a time. [[softmax|Softmax]] is the odd one out: it looks at a *whole row* of final neurons at once and turns their raw scores into a set of probabilities that add up to exactly 1. Picture a classifier whose last layer has one neuron per class — cat, dog, bird. Softmax takes their three scores and answers, in effect, "70% cat, 25% dog, 5% bird." It exaggerates the leader and suppresses the rest, but always keeps the total at 100%.

That is exactly why softmax lives almost exclusively at the *output* of a multi-class classifier, never in the hidden layers. It is the layer that translates a network's internal opinions into a clean, comparable answer. And there's an honest caveat worth carrying: a confident-looking 99% from softmax is *not* a calibrated truth — a model can be loudly, fluently wrong. Softmax gives you a tidy distribution, not a guarantee that the distribution is right.

Where this fits in the build

Step back and see the shape of what you now have. A neuron does its weighted sum; an activation bends the result; a layer is a row of those; a stack of layers with bends in between is a [[multilayer-perceptron|multilayer perceptron]] — a real, expressive network. The activation is the one component standing between a glorified spreadsheet and a thing that can learn the curve of a face or the grammar of a sentence.

An input arrives and flows through the layers — each neuron sums, then bends with its activation. This is the forward pass you met last guide.
At the end, sigmoid or softmax shapes the raw scores into a usable answer or probability.
Error is measured, then sent backward — and crucially, each activation's *slope* decides how much learning signal flows back through it.
Weights nudge a little; repeat across the data until the bends settle into a shape that fits.

That third step is the bridge to what comes next. The reason activations had to be differentiable, the reason vanishing gradients mattered, the reason ReLU's clean slope of 1 was a breakthrough — all of it points at one mechanism we have only hinted at: how a network sends its mistakes backward and teaches itself. That mechanism is [[backpropagation|backpropagation]], and it is the subject of the next guide. You now have every piece it acts upon.