Vision Transformers

From the grid to a sentence of patches

You already know two big ideas from earlier in this ladder. From computer vision, the convolutional neural network reads an image by sliding small filters across the pixel grid, building up features edge by edge, region by region. From the language rungs, the transformer reads a sentence by letting every word look at every other word through self-attention. The Vision Transformer asks a deceptively simple question: what if we feed an image to a transformer the same way we feed it a sentence?

The catch is that a transformer expects a short list of tokens, but a single 224x224 image has over 150,000 pixels — far too many to let every pixel attend to every other one. The trick is to chop the image into a grid of small square patches, say 16x16 pixels each. A 224x224 image then becomes a tidy 14x14 grid, just 196 patches. Each patch is flattened into a vector and passed through a small linear layer that turns it into an embedding — exactly the kind of dense vector a transformer eats. A patch has become a word.

Inside the Vision Transformer

The Vision Transformer (ViT) assembles the pieces in a way that should feel familiar from the language side. First, each patch embedding gets a positional encoding added to it — a learned vector that says "this is patch 5, in row 1, column 5" — so the network can recover the layout it lost when it made the list. Then the whole sequence flows through a stack of standard transformer blocks, each one applying multi-head self-attention followed by a small feed-forward network.

Here is where ViT differs sharply from a convolutional network. A CNN only ever combines nearby pixels in its early layers; it takes many stacked layers before a filter's receptive field is wide enough to relate a corner of the image to its center. Self-attention has no such limit. In the very first block, the patch covering a dog's ear can directly attend to the patch covering its tail, no matter how far apart they sit. The whole image is "in view" from layer one.

To produce a single answer for the whole image — say a class label — ViT borrows another trick from language models: it prepends one extra learnable token, often called the [CLS] token, that belongs to no patch. As attention runs, this token gathers information from all the real patches, and its final vector is fed to a small classifier head. It is a designated note-taker whose only job is to summarize the picture.

patches  = split_image(img, patch=16)      # 196 patches of 16x16
tokens   = linear_project(flatten(patches))# each patch -> embedding vector
tokens   = [CLS] + tokens                  # prepend the summary token
tokens   = tokens + positional_encoding    # add back the 'where'
z        = transformer_blocks(tokens)       # attention all the way down
label    = classifier_head(z[CLS])          # read the answer off [CLS]

A Vision Transformer in six lines: patchify, embed, add position, attend, read out.

Why CNNs don't simply disappear

It is tempting to declare the convolution dead, but the honest story is more interesting. A CNN is built with strong inductive biases baked in: it assumes nearby pixels are related (locality) and that a cat is a cat wherever it appears in the frame (translation equivariance). Those assumptions are gifts — they let a CNN learn from a modest dataset. A ViT makes almost none of those assumptions; attention treats all patches as equally connectable. That freedom is exactly its weakness on small data.

The original 2020 ViT paper made this brutally clear. Trained from scratch on ImageNet (about a million images), ViT actually lost to a comparable ResNet. It only pulled ahead once it was pretrained on a vastly larger dataset of hundreds of millions of images — then fine-tuned on the smaller target task. With enough data, the transformer learns the locality a CNN was handed for free, and then keeps going. The lesson is a recurring one in this field: a flexible model plus enormous data can beat a hand-crafted prior, but only past a scale threshold.

CLIP: teaching pictures and words to share a space

Once images and text both live as sequences of tokens, a tantalizing possibility opens up: could one model understand both at once? CLIP (Contrastive Language–Image Pre-training, 2021) is the breakthrough answer. It uses two encoders — one image encoder (often a ViT) and one text encoder (a transformer) — and trains them to place a picture and its caption at the same spot in a shared embedding space. No hand-labeled categories required; the supervision comes free, from hundreds of millions of image–caption pairs scraped off the web.

The training objective is contrastive learning, and the idea is gorgeous in its simplicity. Take a batch of, say, 256 image–caption pairs. Encode every image and every caption into vectors. Now the model is shown all 256x256 possible image–text combinations and told: the 256 real pairs should score high (pull them together), and all the mismatched pairs should score low (push them apart). It never learns a fixed list of labels. It learns the far richer skill of judging how well any caption matches any image.

The payoff is striking: CLIP can do zero-shot image classification on categories it was never explicitly trained on. To ask "is this a photo of a corgi?", you simply encode the text "a photo of a corgi" and several other candidate sentences, encode the image, and pick whichever caption sits closest to the image in the shared space. You can invent new categories at test time just by writing new sentences — no retraining, no labeled examples.

What this unlocks — and what it really can't do

CLIP's shared image–text space turned out to be a quiet foundation under much of modern AI. It is the bridge inside many multimodal models that can describe a photo or answer questions about it, and its text encoder is what lets a text-to-image system steer image generation from a written prompt — the topic of the very next guide on diffusion. Treating patches as tokens did not just give us a new classifier; it gave vision a common language with text.

Now the honest limits. CLIP recognizes; it does not localize. It will happily tell you a fire hydrant is in the image but, on its own, cannot draw a box around it — that still needs detection methods. It is also surprisingly weak at counting ("three cats" often looks much like "two cats" to it) and at fine spatial relations ("the cup left of the plate"). And because its training captions were scraped from the open web, CLIP absorbs the web's biases and blind spots — it knows the visual world the internet chose to photograph and caption, not the world as it is.

Keep one more thing in proportion. "Zero-shot" sounds like magic, but it only works for concepts that were richly represented somewhere in those hundreds of millions of training captions. Ask CLIP about a rare medical condition or a niche industrial part it never saw described, and the magic evaporates. The model is not reasoning about novel categories from first principles; it is retrieving and recombining patterns it has, in some diffuse sense, already seen. That is genuinely powerful — and genuinely not the same as understanding.