The Genetic Code

The mismatch that demanded a code

You arrive at this rung already holding the two ends of the story. From the transcription rung you know how a gene becomes a strand of RNA, written in just four letters — A, C, G, U. From the protein rung you know that a finished protein is a chain of amino acids, drawn from a palette of twenty. The question that defines this rung is the bridge between them: how does a message in a four-letter alphabet name twenty different things? The answer is the [[molbio-genetic-code|genetic code]], the cell's lookup table from RNA to protein.

Count the possibilities and the design almost falls out on its own. If one RNA letter named one amino acid, you could spell only 4 — far too few. Reading the letters in pairs gives 4 x 4 = 16, still short of twenty. But reading them three at a time gives 4 x 4 x 4 = 64 combinations, comfortably more than enough. So the code reads the message in blocks of three. Each three-letter block is a [[molbio-codon|codon]], and one codon names one amino acid. AUG, GCA, UUU — each triplet is a single word in the protein language.

Sixty-four codons for twenty amino acids leaves a generous surplus, and the code spends it in two ways. Three of the sixty-four are reserved as stop signals — full stops that say "the protein ends here" — and one codon, AUG, does double duty as both the start signal and the codon for the amino acid methionine. The remaining sixty-one all name amino acids. With twenty amino acids sharing sixty-one codons, most amino acids get more than one codon each. That surplus is not waste; as you will see, it is one of the code's quiet safety features.

Reading the dictionary

By convention the dictionary is written for the messenger RNA, read 5'-to-3' — the same direction the ribosome will travel. The very first codon a cell uses is almost always [[molbio-start-codon|AUG]], which sets the spot where reading begins and contributes the protein's first amino acid (methionine). From there the cell steps along three letters at a time, looking up codon after codon, until it hits one of the three [[molbio-stop-codon|stop codons]] — UAA, UAG, or UGA — which name no amino acid at all. There the chain is finished and released.

mRNA   5'- A U G   G C A   A A A   U U U   U A A -3'
           Met   Ala   Lys   Phe   STOP
            |     |     |     |      |
          start                    stop (no amino acid)

  reading frame = where you start cutting into triplets
  same letters, frame shifted by 1:
   ...A U G G   C A A   A A U   U U U   A A... -> different protein

A short mRNA read 5'-to-3' as codons: AUG starts, a stop codon ends, and shifting where the triplets begin reads out a completely different message.

It is worth pausing on what the code is and is not. It is a pure lookup table — UUU always means phenylalanine, in your liver, in a banana, in a soil bacterium. It carries no punctuation between words: there are no commas marking codon boundaries, so the only thing that keeps the triplets aligned is where reading first began. And it is read in one fixed direction. Everything else in this rung — the adapter tRNA you will meet next, and the ribosome that holds the message — exists to physically carry out this table, codon by codon.

Spare words: degeneracy and the wobble

Because sixty-one codons share the work of naming twenty amino acids, almost every amino acid is spelled by several different codons. Leucine has six codons; alanine has four; only methionine and tryptophan get exactly one each. This many-codons-to-one-amino-acid property is called [[code-degeneracy|degeneracy]] (or redundancy). Crucially, degeneracy does not make the code ambiguous: any one codon still means exactly one amino acid. It is a one-way fan-out — many spellings, one meaning — never a word with two meanings.

Look closely and the redundancy is not random — it clusters in the third letter. Codons for the same amino acid usually agree in their first two positions and differ only in the third: GCU, GCC, GCA, and GCG all mean alanine. Francis Crick explained why with the [[wobble-hypothesis|wobble hypothesis]]. The tRNA adapter reads a codon by pairing its own three-letter anticodon against it, but the pairing at the third position is loose — it "wobbles" — so a single tRNA can recognize several codons that differ only there. That is why a cell needs far fewer than sixty-one tRNAs to read all sixty-one sense codons.

How the dictionary was cracked

None of this was obvious in 1960. Researchers were sure a code existed but had no idea which triplet meant what. The breakthrough came from Marshall Nirenberg and Heinrich Matthaei, who fed a cell-free protein-making mixture an artificial RNA made entirely of one letter — poly-U, just ...UUUUU... The mixture produced a protein made entirely of phenylalanine. UUU meant phenylalanine: the first word in the dictionary, read out by experiment rather than guessed.

Two further advances filled in the rest. Nirenberg and Philip Leder devised a trick where short, defined three-letter RNAs each snagged just the matching tRNA, letting them assign codons one at a time. And Har Gobind Khorana learned to chemically synthesize RNAs with exact repeating patterns — UCUCUC..., AAGAAGAAG... — whose protein products pinned down codons whose meaning depended on the reading frame. Between them, by 1966, all sixty-four codons had meanings. Nirenberg and Khorana shared the 1968 Nobel Prize for the feat.

The reading frame, and why a single base matters

Since the code has no commas, where you start cutting the message into triplets is everything. That starting offset is the [[molbio-reading-frame|reading frame]]. The same string of letters can be read in three different frames depending on whether you begin at the first, second, or third base, and each frame produces an entirely different sequence of codons. Think of the English string THEFATCATATEABIGRAT: grouped from the start it reads THE FAT CAT, but begin one letter over and you get gibberish — HEF ATC ATA. The letters never changed; only the grouping did.

This is why inserting or deleting a base is so much more drastic than swapping one. Adding or removing a single letter mid-gene shoves every codon downstream over by one — a [[molbio-reading-frame|frameshift]] — so from that point on the ribosome reads a garbled string of unrelated codons and almost always trips over a premature stop codon, truncating the protein into nonsense. Connecting back to the mutation rung: a frameshift is usually far more damaging than a point substitution precisely because it corrupts not one word but the entire rest of the sentence. (Deleting or adding three bases is gentler — it removes or inserts one whole codon and keeps the frame intact.)