From BERT to Generative Models

The problem: where do labels come from?

By this point in the rung you have seen the pieces: text broken into tokens, words turned into dense vectors, and the Transformer that lets every token look at every other through self-attention. But a Transformer is just a powerful empty vessel — millions of parameters waiting to be set. To set them, classic supervised learning needs labeled examples, and labeling text is brutally expensive: someone has to mark up sentences by hand.

The breakthrough was to notice that ordinary text already contains its own answers. The internet is full of sentences nobody labeled — but every sentence quietly tells you which word comes next, or which word fits a gap. If you can turn that free structure into a training signal, you no longer need annotators; you need a hard drive. This is the heart of self-supervised learning, and it is what made modern language models possible.

Masked language modeling, and BERT

Earlier in this rung you met language modeling in its original form: predict the *next* word, left to right. BERT (2018) flipped the task. Instead of reading one direction, it randomly hides about 15% of the tokens and asks the network to fill them back in, using the words on both sides at once. This is masked language modeling (MLM), and the word "both" is the whole point — meaning often depends on what comes after, not just before.

input:  the cat sat on the [MASK] and purred
target:                     mat

# the model sees the WHOLE sentence (left + right)
# and must guess the hidden token from context

Masked language modeling: hide a token, predict it from both sides.

Why does this teach anything useful? To guess "mat" the network has to absorb grammar, common sense, and the relationships between words — all squeezed out of the single task of filling gaps. Run it over billions of sentences and the embeddings it learns stop being generic; they become context-aware. The same word "bank" gets a different internal vector in "river bank" than in "savings bank," because attention reshapes each token using its neighbors.

This unfolds in two stages. First pre-training: months of masked-word guessing on raw text, building a general-purpose language understander. Then fine-tuning: bolt a tiny output layer on top and train briefly on a small labeled set for your actual task — sentiment analysis, named-entity recognition, or question answering. Because the heavy lifting already happened in pre-training, this is just transfer learning for text, and it slashed the data each task needs.

Encoder, decoder, or both

The original Transformer had two halves — an encoder that reads the input and a decoder that writes the output — joined for sequence-to-sequence jobs like translation. Later models often kept only one half, and which half they keep is the single most useful thing to understand about a model's personality.

An encoder (like BERT) is bidirectional: every token attends to the entire sentence at once. That makes it excellent at *understanding* — classifying, tagging, searching — but it cannot naturally generate fluent text, because it was never trained to produce one word after another. A decoder (like the GPT family) is the opposite. It is masked so each token can only see what came before it, and it is trained purely on next-word prediction. That left-to-right constraint is exactly what lets it write — it generates one token, appends it, and repeats. This is autoregressive decoding.

The bridge to large language models

Here is the plot twist that surprised the whole field. BERT's bidirectional reading was widely thought to be the smarter design — and for understanding tasks, it often is. But the boring left-to-right decoder turned out to scale further. As researchers grew the decoders and fed them more text, something unplanned happened: a model trained only to predict the next word started to *follow instructions*, answer questions, and do tasks nobody fine-tuned it for, simply by being shown a few examples in its prompt. That last ability is in-context learning.

So a large language model is, at its core, the decoder you just met — grown enormous and trained on a staggering amount of text. The recipe that links BERT to ChatGPT is short: same Transformer block, same self-supervised pre-training idea, but next-word prediction instead of masked-word filling, scaled until new behaviors appear. Because the same frozen base now serves countless downstream uses, people call it a foundation model.

What does "scale until new behaviors appear" actually mean? Empirically, loss falls in smooth, predictable curves as you add data, parameters, and compute — these are the scaling laws. The surprising part is that certain skills seem to switch on only past a size threshold rather than improving gradually. Be careful here: a lot of that "sudden emergence" can be an artifact of how we measure success, and it is one of the most contested claims in the field. Bigger reliably means lower prediction loss; it does not reliably mean a leap to a new cognitive ability.

Honest limits, and what to remember

It is tempting to read "predicts the next word" and conclude these models truly understand language the way you do. Resist that. A decoder is optimizing one objective — make the next token statistically likely given the previous ones. Out of that pressure comes astonishing fluency, but also confident errors: a model will produce smooth, plausible text that is simply false, because plausibility, not truth, is what it was trained for. That failure has a name you will meet often, and it is not a bug to be patched away — it is built into the objective.

Pull the rung together and the whole arc is one idea repeated at growing scale. You went from counting words, to tokenizing them, to static word vectors, to context-aware vectors from attention, to two ways of pre-training on free text. The encoder branch gave us BERT and deep language understanding; the decoder branch, grown vast, gave us the generative models now reshaping how people work. The later rungs of this ladder — using LLMs, evaluation, safety — all build on exactly the machinery you just traced.