The Genetic Code: Reading Codons

A four-letter alphabet, a twenty-letter language

By the end of the last rung we had a finished, edited messenger RNA — a single-stranded copy of a gene, written in the four RNA letters A, C, G, and U. That message now has to be turned into a protein, and proteins are written in a completely different alphabet: a chain of amino acids drawn from a set of twenty. So the cell faces a translation problem in the most literal sense. It must convert a text written in four letters into a text written in twenty. How?

Count the possibilities and the answer almost falls out on its own. If the cell read one letter at a time, four letters could name only four amino acids — far too few. Reading two at a time gives sixteen combinations (four times four), still short of twenty. But reading three letters at a time gives sixty-four combinations (four times four times four) — comfortably more than enough. That is exactly what life settled on. The message is read in non-overlapping groups of three, and each triplet is called a codon. A codon is the basic word of the genetic code.

The reading frame: where you start changes everything

Here is the catch that makes the triplet idea subtle. The mRNA does not come with spaces between its codons. It is just one long run of letters, like a sentence written WITHNOGAPSATALL. So *where* you begin grouping into threes — the reading frame — completely changes what the message says. Shift your starting point by even one letter and every codon downstream is redrawn into a different word.

mRNA letters:  A U G C C U A C G G G A U A A

frame +0:     AUG CCU ACG GGA UAA   -> Met-Pro-Thr-Gly-STOP
frame +1:     A UGC CUA CGG GAU AA  -> (garbage, different words)
frame +2:     AU GCC UAC GGG AUA A  -> (garbage, different words)

One strand of letters, three possible reading frames. Only one of them spells the intended protein; the cell must lock onto the right starting point.

This is why the next two ideas — a fixed starting codon and clear stopping codons — are not optional housekeeping. They are what nails the reading frame in place. Without an agreed start, the cell would have no way to know which of the three frames is the real message, and the same letters could be read three different (and mostly meaningless) ways.

Start, stop, and the punctuation of a gene

Reading almost always begins at one special codon: AUG. This is the start codon, and it does double duty. It sets the reading frame — telling the ribosome "begin counting threes from here" — and it also codes for an amino acid, methionine. So nearly every freshly made protein begins with a methionine (which the cell often trims off afterward). That single AUG is the anchor that resolves the whole frame problem from the previous section.

Reading ends at any one of three codons — UAA, UAG, and UGA — collectively the stop codons. These are different from all the others: they do not name any amino acid. There simply is no tRNA that matches them. When the reading machinery arrives at a stop codon, no amino acid can be delivered, the protein chain is released, and translation halts. So a stop codon works like a period at the end of a sentence — it carries no "letter" of its own, it just marks the end.

Notice how this divides up the sixty-four codons. One of them (AUG) is the start, three of them are stops, and the remaining sixty all name amino acids. Sixty codons sharing the work of just twenty amino acids — that imbalance is the next big idea, and it is the reason the code is so robust.

Redundancy and wobble: why sixty-one for twenty

Sixty-one codons (the sixty-four minus the three stops) specify twenty amino acids, so most amino acids are named by more than one codon. The amino acid leucine, for example, has six different codons; methionine and tryptophan have just one each. This many-codons-per-amino-acid property is called redundancy (or degeneracy). It is not sloppiness — it is a feature. Because the spare codons usually differ only in their *third* letter, a typo in that last position often still spells the same amino acid, so the protein comes out unchanged. The code has a built-in shock absorber against small copying errors.

There is a clever twist in how this is read. Each amino acid is delivered by a transfer RNA whose three-letter anticodon pairs with the codon. You might expect the cell to keep sixty-one different tRNAs, one for every coding codon — but it does not. The pairing at the third codon position is loose: a single tRNA can recognize several codons that differ only in that last letter. Francis Crick named this slack the wobble hypothesis. It is why a cell gets away with far fewer tRNAs than codons, and it dovetails perfectly with the redundancy we just met — the third letter is the forgiving one in both the code and the reading of it.

One codebook for nearly all of life

Now the part that should genuinely stop you in your tracks. The assignment of codons to amino acids — AUG means methionine, UUU means phenylalanine, and so on — is almost the same in a bacterium, a redwood tree, a mushroom, and you. We call the code nearly universal. The word "nearly" matters and we will come back to it, but first sit with the main fact: a single shared codebook runs through essentially the whole living world.

Why is that astonishing? Because there is nothing about chemistry that *forces* AUG to mean methionine. The pairing of codon to amino acid is more like a convention than a law of physics — many other codebooks would have worked just as well. So the fact that every branch of life uses the *same* arbitrary convention is the strongest single clue that all of it descends from one common ancestor. The code was fixed long ago in a shared ancestral cell and then was so deeply baked into how life works that changing it would scramble every protein at once — far too costly to ever undo. This is the molecular echo of the shared-ancestry idea you met back when we first compared bacteria, archaea, and eukaryotes.

Now the honest "nearly." The code is not perfectly universal. A handful of organisms and, more often, the mitochondria inside our own cells have a few reassigned codons — for instance, a codon that means "stop" in the standard code can read as an amino acid in human mitochondria. These exceptions are rare and small, and they are exactly what we would expect from a shared code that occasionally drifted in an isolated corner. They do not undo the universality; they confirm it has a history. (And this universality is precisely what makes genetic engineering possible: a human gene placed into a bacterium is read correctly, because the bacterium uses the same codebook.)

Putting it together

Step back and the genetic code is just a lookup table with a few rules of grammar. We have now spelled out those rules — read in triplets, in a fixed frame, from a start to a stop, with redundancy as a safety net and one near-universal codebook for almost all of life. This completes the *information* half of the story: we can now say exactly what an mRNA *means*. What we have not yet seen is the machine that physically does the reading.

That machine is the ribosome, and the codon-by-codon reading it performs is the second half of the central dogma — translation proper. In the next guide we will watch the ribosome lock onto a start codon, pull in the matching tRNAs one codon at a time, and stitch their amino acids into a growing chain. Everything in this guide is the rulebook; what comes next is the reading aloud.