Stacking Layers: the Multilayer Perceptron

From one neuron to a team

In the last guide you built a single artificial neuron: it takes some inputs, multiplies each by a weight, adds a bias, and pushes the sum through an activation function. That little machine is powerful, but it has one stubborn limit — on its own it can only separate data with a single straight cut. The classic example is XOR: four points where no single line can put the right ones on each side. One neuron simply cannot do it.

The fix is almost insultingly simple: use more than one neuron, and let some neurons listen to other neurons instead of to the raw input. A whole row of neurons working in parallel is a layer. When you stack layers so that the outputs of one feed the inputs of the next, you have built a multilayer perceptron, or MLP — the original, foundational neural network. Everything fancier you have heard of, from image recognizers to chatbots, is a descendant of this idea.

Why does stacking help? Because the first layer can carve the input into several straight pieces at once, and the next layer can combine those pieces into a bent, kinked shape that no single line could ever make. Two neurons in the first layer plus one to combine them is enough to crack XOR. Depth, in other words, buys you the ability to draw curves out of straight lines.

Three kinds of layer: input, hidden, output

An MLP is organized into three roles. The input layer is not really neurons at all — it is just the slots where your numbers enter: one slot per feature. If you feed it a 28×28 grayscale image, the input layer has 784 slots, one per pixel. It does no computation; it just holds the values.

The output layer is the last row, and its job is to deliver the answer in the shape you need. For a yes/no question you might use one neuron with a sigmoid; for a ten-way choice (which digit is this?) you use ten neurons and a softmax so the outputs read as probabilities that sum to one. The output layer's size is dictated entirely by the problem, not by your taste.

Everything in between is a hidden layer — hidden because you never see its values directly; they are internal scratch work. Here is the key shift in thinking: a hidden layer's neurons do not detect things you named in advance. Instead the network *invents* its own intermediate features — maybe one neuron fires for round shapes, another for vertical strokes — and it discovers these on its own during training. The number and size of hidden layers are choices you make: they are hyperparameters, knobs you set before training rather than values the data hands you.

Fully connected: every wire to every neuron

In an MLP each layer is fully connected (also called *dense*): every neuron in a layer receives a wire from *every* value in the layer before it. If the previous layer has 100 outputs and this layer has 50 neurons, that is 100 × 50 = 5,000 weights, plus one bias per neuron. Each of those weights is a number the network will learn.

This is why we love matrices. Instead of tracking 5,000 multiplications one by one, we pack the weights into a grid — a weight matrix — and the whole layer's computation becomes a single matrix multiply followed by adding the bias vector. The linear-algebra rung you climbed earlier was not busywork: a layer literally *is* an affine map (a linear map plus a shift), and the matrix is its compact written form.

The forward pass: data flowing uphill

Running an input through the network to get an answer is called the forward pass. It is wonderfully mechanical: take the input vector, do the layer's matrix multiply and bias add, apply the activation, and you have the next layer's values. Hand those to the next layer and repeat. The numbers flow strictly one direction — input toward output — which is why an MLP is a *feedforward* network with no loops.

Start with the input vector x (your raw features).
For each layer: compute z = W·x + b — multiply by the weight matrix, add the bias.
Apply the activation: a = f(z), bending the result (e.g. a ReLU or sigmoid).
Feed a forward as the input to the next layer; repeat to the end.
Read the final layer's output as your prediction.

def forward(x, layers):
    a = x
    for (W, b, activation) in layers:
        z = W @ a + b        # matrix multiply, then add bias
        a = activation(z)    # the nonlinear bend
    return a                 # final-layer output = prediction

The entire forward pass of an MLP — a loop over layers, three lines of real work inside.

That is genuinely all there is to making a prediction. The mystery is not in the forward pass — it is in *where the good weights come from*. At first they are random, so the output is nonsense. Learning means nudging every weight a tiny bit to make the answer less wrong, over and over. The machinery that figures out which way to nudge each weight is backpropagation paired with gradient descent — and that is exactly the subject of the next guide.

Universal approximation — and what it does not promise

There is a famous result that sounds almost magical: the universal approximation theorem. It says that an MLP with even a single hidden layer can approximate *any* continuous function to any accuracy you like — provided the hidden layer is allowed to be wide enough. In plain words: this simple architecture is, in principle, expressive enough to represent essentially any input-to-output mapping you could want.

This is also why, in practice, people reach for *deep* networks rather than one impossibly wide hidden layer. A theorem may permit a shallow solution, but deeper nets tend to reach the same accuracy with far fewer neurons, and they encourage hierarchical features — simple parts in early layers, richer combinations later. The fully-connected MLP is the honest starting point; specialized cousins like the convolutional network add structure that makes learning on images or text far more efficient.

So here is the picture to hold: layers of fully-connected neurons, an activation between each, data flowing forward from input to output. That is the body of the network. What it still lacks is a way to improve itself — to turn the random starting weights into good ones. Give it that, and the inert machine becomes one that learns. That is where we go next.