Autoencoders & GANs

From classifying to creating

Until now in this rung, the networks you met have all been *discriminators* in spirit: feed in an image, get out a label. A convolutional net sees a photo and says "cat". That is enormously useful, but it is only half of intelligence. The other half is *generation* — being handed nothing in particular and producing something plausible: a face that never existed, a sentence that was never written. This guide is about two of the earliest and most instructive ways a deep network learned to do that.

Both share a quiet but radical idea you already met as [[representation-learning|representation learning]]: instead of being told what features matter, the network discovers a compact internal description of the data on its own. Generation then becomes the art of running that description *backwards* — turning a short code back into a full, rich example. Autoencoders learn that code by reconstruction; GANs learn it by competition.

The autoencoder: learning by squeezing

An [[autoencoder|autoencoder]] is a network trained to copy its input to its output — which sounds pointless until you see the catch. In the middle sits a narrow layer, the *bottleneck*, far smaller than the input. The first half, the [[encoder-decoder|encoder]], compresses the input down into that tight code; the second half, the *decoder*, expands the code back out into a reconstruction. Because the bottleneck is too small to memorize everything, the network is forced to keep only what matters and throw away the rest.

The training signal is wonderfully simple: a reconstruction [[loss-function|loss]] measuring how far the output drifts from the original input, minimized by ordinary gradient descent. No labels are needed — the input *is* the target. That makes the autoencoder a clean example of self-supervised learning: it manufactures its own homework out of raw, unlabeled data.

x       --> [ encoder ] --> z   (the code: small)
z       --> [ decoder ] --> x'  (reconstruction)
loss = distance(x, x')          (push x' toward x)

The whole autoencoder in four lines: encode to a small code z, decode back, and penalize the gap.

That code in the middle has a name you will keep meeting: an [[embedding|embedding]]. It is a short vector — maybe 32 numbers standing in for a 784-pixel image — where *distance means similarity*. Two similar inputs land near each other; dissimilar ones land far apart. This is the same trick that lets word embeddings place "king" near "queen", and it is why the encoder, once trained, is valuable all on its own as a feature extractor for downstream tasks.

Autoencoders come in many flavors that share this skeleton. A *denoising* autoencoder is fed a corrupted input and asked to reconstruct the clean one, which forces it to learn structure rather than copy pixels. And when the encoder and decoder are mirror-image convolutional stacks, you get the U-Net shape that later became the workhorse inside modern diffusion image generators — proof that this humble idea did not retire, it got promoted.

The GAN: a forger and a detective

The [[generative-adversarial-network|generative adversarial network]], or GAN, takes a completely different route to generation — not reconstruction, but a contest. Picture two networks locked in a duel. The *generator* is a forger: it takes a random noise vector and tries to paint a convincing fake image. The *discriminator* is a detective: it is shown a mix of real images and the forger's fakes, and must call each one real or fake.

They train against each other. Every time the detective catches a fake, that failure is backpropagated into the forger, teaching it to fool the detective better next time. Every time the forger slips one past, the detective sharpens its eye. Neither has a fixed target to copy — the generator never sees a single real image directly; its *only* learning signal is the discriminator's verdict. The two improve in lockstep, an arms race that, when it works, drives the fakes toward photorealism.

Draw a batch of real examples from your dataset, and have the generator produce a batch of fakes from random noise.
Train the discriminator to label reals as real and fakes as fake — a normal classifier step.
Now freeze the discriminator and train the generator to make fakes the discriminator labels as real — flipping the goal.
Repeat. As each side adapts to the other, the generated samples slowly march from noise toward realism.

Notice what is shared with the autoencoder and what is flipped. Both build a *decoder*-like network that turns a small code into a full sample. But the autoencoder learns its code by demanding faithful reconstruction, while the GAN never reconstructs anything — it learns purely from the pressure of a critic that gets smarter as the artist does. That adversarial pressure is what gives GAN samples their famous crispness, where autoencoder reconstructions tend to come out blurry.

Why GANs are so hard to train

Here is the honesty this field too often skips. A GAN is not minimizing one loss toward one bottom; it is two networks chasing a *moving target* the other one controls. That makes the training a delicate balancing act rather than a smooth descent. If the discriminator gets too good too fast, it rejects every fake with total confidence, the generator receives no useful gradient, and learning stalls. If the generator gets ahead, it can exploit the detective's blind spots.

The most infamous failure is mode collapse: the generator discovers one image (or a tiny handful) that reliably fools the detective, and just produces that over and over. Its fakes are convincing but lack all variety — it found a cheap win instead of learning the whole data distribution. A pile of engineering tricks (gradient penalties, careful learning rates, alternative loss formulations) tames these problems, but a GAN that trains cleanly the first time is the exception, not the rule.

Where this fits, and what came after

From roughly 2014 to 2020, GANs were the undisputed kings of image generation, producing the eerily realistic faces that filled "this person does not exist" demos. But it is worth being clear-eyed: that crown has largely passed. Diffusion models — which generate by learning to reverse a gradual noising process — now produce more diverse, more controllable images and train far more stably, and they power most of the text-to-image systems you have heard of. GANs are still used where speed matters (they generate in a single forward pass), but the field moved on.

So why learn them? Because the *ideas* outlived the architecture. The bottleneck-and-decode skeleton of the autoencoder is the backbone of diffusion models and of how large language models compress text. The adversarial idea — training one network against a learned critic — reappears all over modern AI, from how some image and audio models add a sharpening loss to the way preference-based fine-tuning pits a model against a reward critic. Understanding the forger-and-detective game once makes a dozen later systems click instantly.

Hold onto one through-line as you finish this rung. Every architecture you have met — the convolutional net, the recurrent net, the autoencoder, the GAN — is really a different answer to the same question: *what shape should a network have so it learns the right internal representation on its own?* The next rung leaves images and sequences behind to meet the architecture that answered that question so powerfully it swallowed the whole field — but you now have the two halves, recognition and generation, that everything after it builds upon.