JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
All guides

Diffusion & Multimodal Models

The same era that gave us ChatGPT also taught machines to paint, listen, and see. Here we open the hood on diffusion — the denoising trick behind image generators — and on multimodal models that fuse text, pixels, and sound into one shared space.

Generation by destroying, then undoing the damage

You already know how a language model generates: it predicts the next token, over and over, turning a probability distribution into prose. Image generation took a stranger and, at first, almost backwards-sounding route. The dominant approach today is the [[diffusion-model|diffusion model]], and its central idea is this: instead of learning to *draw* a picture, learn to *clean up* a picture. Train a network to remove a little noise from a slightly-corrupted image, and you can chain that skill thousands of times to turn pure static into a photograph.

The training setup is delightfully self-supervised, so it needs no human labels at all. Take a real image, add a measured pinch of random Gaussian noise, and ask the network: *what noise did I just add?* Because you added it, you know the exact answer — the target is free. Do this at every level of corruption, from barely-speckled all the way to total static, and the network gradually learns the shape of "what real images look like" by learning what doesn't belong. This noising direction is called the forward process; the cleanup direction the model actually learns is the reverse process.

To make a brand-new picture, you start from pure noise — a screen of random static that resembles no image at all — and run the learned cleanup step many times in sequence. Each step nudges the static a hair closer to something plausible, and after enough steps a coherent image precipitates out, like a Polaroid developing in reverse. Crucially the result is *new*: it isn't retrieved from a database and it isn't a collage of training pictures. It is a fresh sample from the distribution the model absorbed, which is exactly what makes diffusion a generative method rather than a lookup.

Steering the noise: text-to-image and latent diffusion

Cleaning up noise into *some* plausible image is impressive, but the magic of a tool like Stable Diffusion or DALL·E is that you can say what you want. That is [[text-to-image|text-to-image]] generation, and the mechanism is conditioning: at every denoising step the network is fed not just the noisy image but also an encoding of your prompt. The denoiser then doesn't just push toward "any real image" — it pushes toward "a real image that matches *these words*." The text encoding usually comes from a contrastively-trained vision-language model, which is the bridge we'll meet in the next section.

There's a practical snag: a high-resolution image is millions of pixels, and running thousands of denoising passes directly on all of them is brutally expensive. The fix that made image generation cheap enough for laptops is [[latent-diffusion|latent diffusion]]. Instead of diffusing in raw pixel space, you first squeeze the image through an autoencoder into a much smaller latent grid — a compressed code that keeps the meaningful structure and throws away pixel-level minutiae. You run the whole noising-and-denoising dance in that small latent space, then decode the final latent back into full-resolution pixels at the very end.

The workhorse that actually does the denoising is usually a [[u-net|U-Net]] — the same encoder–decoder-with-shortcuts shape you saw used for segmentation in the vision rung. Its symmetric down-then-up path lets it reason about both the coarse layout ("a dog, centred, facing left") and the fine texture ("individual hairs") in one pass, and the skip connections keep sharp detail from getting lost in the squeeze. Newer systems swap the U-Net for a transformer backbone, but the conditioning-plus-denoising recipe stays the same.

noisy = pure_noise
for t in reversed(range(T)):          # many denoising steps
    pred_noise = unet(noisy, t, text_embedding)
    noisy = step_back(noisy, pred_noise, t)   # remove a little
image = decode(noisy)                  # latent -> pixels
The reverse process in pseudocode: start from static, predict-and-subtract noise over many steps, all conditioned on the prompt's embedding, then decode the latent to pixels.

One shared space for words and pixels

How does a model trained on pixels even understand the phrase "a corgi astronaut"? The connective tissue is a [[multimodal-model|multimodal model]] — one that maps different kinds of input (text, images, audio) into a single shared representation, so that related things land near each other regardless of which sense they came from. The landmark example is [[clip-vision-language|CLIP]]. It was trained on hundreds of millions of image-caption pairs scraped from the web with a simple objective: make the embedding of an image sit close to the embedding of its true caption, and far from the captions of other images.

That contrastive training is the whole trick. After it, the picture of a beach and the words "a sunny beach" land at nearly the same address in the shared space, while "a snowy mountain" lands far away. Once words and images share coordinates, you get astonishing freebies: zero-shot classification (compare an image's embedding to the embeddings of candidate label words — no task-specific training needed), image search by description, and the text-conditioning signal that steers the diffusion model from the last section. CLIP is the reason a text-to-image system knows what "astronaut" should look like.

The same idea generalizes far beyond images. Audio can be turned into spectrogram "pictures" and embedded the same way; video adds a time axis; even protein structures and tabular data can be projected into shared spaces. The recipe is remarkably uniform: pick an encoder for each modality, train so that things which mean the same thing end up nearby, and suddenly one model can translate between senses. That uniformity — different doors into one room — is what people mean when they call a model genuinely *multimodal* rather than just bolting an image caption onto a text model.

Foundation models: one base, many jobs

Step back and a pattern across this whole rung snaps into focus. A language model, CLIP, and a diffusion model all share a method: train one very large network on a vast, mostly-unlabelled pile of data with a self-supervised objective, then adapt that single base to a swarm of downstream tasks. A model used this way is a [[foundation-model|foundation model]] — the term names the *role*, not the architecture. The same trained CLIP becomes a classifier, a search engine, and a diffusion steering-wheel, just by being pointed at different jobs.

This is also why the cutting edge increasingly *is* multimodal by default. The frontier systems behind ChatGPT-style products no longer keep a separate text model and image model in different boxes; they take text, images, and audio in through one network and can produce several of those out. Foundation models also reach well beyond chat — they fold proteins, predict weather, and read medical scans. The lesson of the era is less "build a clever model for each problem" and more "build one broad base and adapt it," an instinct sometimes summarized as letting scale and general methods do the heavy lifting.

It's worth being precise about *why* this works, because the honest answer is partly "we don't fully know." Empirically, performance improves smoothly and predictably as you grow data, parameters, and compute together — the [[scaling-laws-capability|scaling laws]] that quietly underwrite the whole field. But those laws describe loss going down, not understanding going up; they are an observed regularity, not a guarantee of any particular capability, and they say nothing about hitting human-level reasoning. Treat them as a reliable engineering trend, not a prophecy.

What these models do — and what they don't

It is easy to overread a stunning generated image as evidence of understanding. A diffusion model has learned an extraordinarily rich statistical map of what images tend to look like — but it has no model of physics, anatomy, or fact. That's why hands once came out with six fingers, text in images turns to gibberish, and reflections disobey geometry: the model is matching the *look* of plausible pixels, not reasoning about a world. The same caution applies to multimodal chat models, which can confidently describe things that aren't in an image — the visual cousin of the hallucination you met in the LLM rung.

There are real ethical weights here too, and they deserve plain naming rather than either panic or hand-waving. These systems are trained on enormous scrapes of human-made images and text, which raises unresolved questions about consent and copyright for the creators whose work is in the pile. The same generative power that makes art tools delightful also makes convincing fakes of real people cheap to produce. None of this requires the models to be malicious or "intelligent" — these are ordinary tools with broad reach, and the responsibility sits with how people build and deploy them.