Flipping the arrow: from seeing to making
Everything in this rung so far has run in one direction: an image goes in, a label or a box or a mask comes out. A conv net sees a cat; a vision transformer sees the parts and how they relate. Generation reverses that arrow. Instead of *image → meaning*, we want *meaning → image*: hand the model the idea "a fox reading a newspaper in the rain" and have it produce pixels that match. The hard part is that there is no single right answer — a billion different pictures could honestly fit those words.
So a generator is not a function that maps one input to one output. It is a way to sample from a vast space of plausible images — the same way rolling dice samples from the numbers 1 to 6. Earlier approaches tried this with a GAN, pitting a forger against a detector, and with a variational autoencoder that squeezed images through a tight bottleneck. Both worked, but were finicky to train. The method that took over the field is gentler and, once you see it, almost obvious.
Diffusion: learning to undo noise
Here is the trick behind a diffusion model. Take a real photo and gradually sprinkle in random noise — a little, then more, then more — until after hundreds of tiny steps it is indistinguishable from TV static. That destruction is easy; it is just adding noise. The clever move is to train a network to run the process backwards: given a noisy image and a number telling it *how* noisy, predict the noise that was added so it can be subtracted away. Do that, and you can peel noise off one layer at a time.
Now the payoff. To *generate*, you do not start from a photo at all — you start from pure noise, fresh random static, and ask the trained network the same question over and over: "what noise do you see here?" Subtract a sip of it, repeat. After many steps the static resolves into a clean, coherent image. The network never memorized that picture; it learned the general shape of "what real images look like" and used it to sculpt order out of randomness. The workhorse network doing the noise-prediction is usually a U-Net — an encoder–decoder that shrinks the image down to capture the big picture, then expands it back to full resolution while keeping fine detail.
Latent diffusion: doing it in a smaller world
Running hundreds of denoising steps on a full-size image — millions of pixels each pass — is brutally expensive. The breakthrough that made image generation cheap enough for everyone is latent diffusion. The idea: don't diffuse the pixels at all. First train an autoencoder that compresses an image into a small grid of numbers — a latent — that captures its essence, maybe 48 times smaller, and can be decoded back into a faithful picture. Then run the whole noisy-to-clean diffusion dance inside that compact latent space.
The flow becomes: noise in the latent space → denoise step by step → a clean latent → decode it once into full-resolution pixels. Because the latent is so much smaller, every denoising step does a fraction of the work, and the model can focus on *meaning* rather than wasting effort reproducing exact pixel textures (the decoder handles those). This is the architecture behind the open models you may have run on your own laptop. The price is a subtle one: the decoder can smear the very finest detail, which is one reason generated text on signs and the structure of small faces often come out garbled.
Steering with words: text-to-image
So far the model can dream up *some* coherent image, but not the one you asked for. Text-to-image adds a steering wheel. Your prompt is turned into an embedding — a list of numbers capturing its meaning — typically by a model like CLIP, which was trained to place matching image–text pairs near each other in a shared space. At every denoising step, that text embedding is fed into the U-Net through attention, so the noise prediction is nudged toward content that matches your words. The picture is sculpted *conditioned on* the prompt.
latent = random_noise() # start from static
for t in reversed(timesteps): # e.g. 50 steps
text = encode(prompt) # meaning of your words
eps = unet(latent, t, text) # predicted noise, nudged by text
latent = step(latent, eps, t) # subtract a sip of noise
image = decoder(latent) # latent -> full pixels (once)One more knob worth knowing: guidance strength. The model can be asked how hard to lean on the prompt versus its own sense of what looks natural. Crank it up and you get images that hew tightly to your words but can turn garish and over-saturated; ease it off and you get softer, more natural pictures that may wander from the request. There is no free lunch here — it is a genuine trade-off you tune by taste, not a setting with one correct value.
Editing and super-resolution: the same idea, conditioned differently
Once you grasp "denoise, conditioned on something," a whole toolbox opens up — you just change *what* the model is conditioned on. Hand it a real photo with part painted out and ask it to fill that region (inpainting). Start the denoising from a partly-noised version of an existing image instead of pure static, and it nudges the picture toward your prompt while keeping the original composition (img-to-img). Condition on an edge map or a pose skeleton, and you control layout precisely. The denoising engine never changes; only the conditioning does.
Super-resolution is the same move again: condition the generator on a small, blurry image and ask it to produce a larger, sharp one. But here honesty matters a lot. The model is not *recovering* detail that was lost — that information is simply gone. It is inventing plausible detail consistent with the low-res input. The crisp brick texture or the sharpened licence plate may look convincing and be entirely fabricated. That is fine for making a photo prettier; it is dangerous if anyone treats an upscaled image as evidence of what was really there.
Honest limits, artifacts, and what the hype skips
These models are astonishing, and they are not magic. They have no model of physics or anatomy — only statistics of pixels. So they produce hands with six fingers, reflections that disobey geometry, shadows falling the wrong way, and text that dissolves into letter-like scribbles. These are not bugs to be patched; they flow from the method. The model fills each region with what *locally* looks plausible and has no global ledger checking that the fingers add up to five. Newer models get better at this, but "better" is not "solved."
There is also a deeper limit. A generator can only sample from the world its training data showed it, so it inherits that data's biases — ask for "a doctor" or "a beautiful person" and watch which faces it defaults to. It can blend and recombine, but it does not invent genuinely novel styles from nothing the way the marketing implies. And because the same machinery makes convincing fakes, the honest framing for image generation includes its shadow: deepfakes and the live, unsettled question of whose images were used to train it.
Step back and the whole rung connects. You learned how pixels become tensors, how a conv net reads them, how a transformer relates the parts — and now how the same building blocks, run in reverse, can conjure images from noise. Seeing and making turn out to be two directions of one skill: a model that has truly learned the statistics of the visual world can both recognize it and, carefully, dream it back.