Sanger Sequencing

Copying is easy; reading is the hard part

By this rung you can do remarkable things to DNA. You can cut it, paste it, and slip it into a bacterium; with PCR you can take one faint stretch of a sequence and copy it a billionfold in an afternoon. But copying is not the same as reading. A tube holding a billion copies of a gene still does not *tell* you its sequence — the actual order of letters, A-T-G-C-C-A and so on down the strand. That order is the whole point: it is the information the cell reads as DNA -> RNA -> protein, the thing a mutation changes, the message you ultimately want. So the question that defines this guide is simple to ask and was, for a long time, brutally hard to answer: given a strand of DNA, how do you find out the exact order of its bases?

The trouble is that a single base is unimaginably small, and the four letters are chemically almost identical — A, T, G and C differ only in a ring or two of atoms. You cannot put a strand under a microscope and squint at the letters; nothing is that sharp. The breakthrough, invented by Frederick Sanger in 1977, sidesteps the problem entirely. Instead of trying to *see* the bases, it converts the invisible question "what is the next letter?" into a visible one: "how long is this fragment?" Lengths you can measure. The genius is the bridge between the two — a way to make a strand stop growing exactly when a particular letter is added, so that the length of the stopped fragment tells you where that letter sits.

The sabotaged building block that stops the chain

To see the trick you need one fact from an earlier rung. When DNA polymerase copies a strand, it adds each new nucleotide onto the same spot every time: a chemical hook on the previous sugar called the 3'-hydroxyl, the 3'-OH. That hook is what the next nucleotide bonds to. No 3'-OH, no place to attach — the chain simply cannot grow another letter. This is exactly why strands extend in the 5'-to-3' direction, the rule you already know. Hold on to this: the 3'-OH is the growing tip, the place the next base hooks onto.

Now the heart of the method. Alongside the normal building blocks, Sanger sequencing slips in a tiny fraction of sabotaged ones called dideoxynucleotides, or ddNTPs. A dideoxynucleotide is almost a perfect forgery: it looks so much like a real base that the polymerase happily picks it up and adds it to the strand. But it is missing exactly one thing — that 3'-OH hook. The name says it: "di-deoxy" means *two* oxygens gone instead of the usual one. So the moment a ddNTP is added, the chain is poisoned at the tip. There is nowhere for the next nucleotide to attach, and synthesis on that strand stops dead, frozen at that letter. One missing oxygen atom is the whole basis of reading DNA.

A ladder of fragments, smallest to largest

Picture what that pile of fragments looks like. From the same starting point — a short primer that, just as in PCR, gives the polymerase a place to begin — strands grow outward and stop at scattered points. One molecule's strand happened to stop after 1 base, another after 2, another after 3, and so on, all the way up. Because termination hit every position in some molecule, you end up holding fragments of length 1, 2, 3, 4, 5... a continuous staircase, each step exactly one base taller than the one below it. The crucial extra fact is that you know *which letter* each fragment ends in, because the terminator that stopped it is the last base it carries.

How do you sort millions of these by length, when neighbouring fragments differ by a single base out of hundreds? With gel electrophoresis, a tool you have met before. DNA carries a uniform negative charge along its sugar-phosphate backbone, so an electric field drags every fragment toward the positive end; the gel is a molecular sieve that holds back long fragments more than short ones. Shorter pieces slip through faster and travel further. Modern machines run this in ultra-thin capillaries with resolution so fine they separate a 200-base fragment from a 201-base one — single-base resolution, which is exactly what reading one letter at a time demands.

Template being copied (5'->3'):  T A C G G T C ...
Complement built by polymerase:  A T G C C A G ...

Each fragment STOPS at its terminator (shown lowercase):

  a                <- stops at base 1, ends in A
  a t              <- stops at base 2, ends in T
  a t g            <- stops at base 3, ends in G
  a t g c          <- stops at base 4, ends in C
  a t g c c        <- stops at base 5, ends in C
  a t g c c a      <- stops at base 6, ends in A
  a t g c c a g    <- stops at base 7, ends in G

Sort by length (short -> long) and read the END letter of each rung:

  A  T  G  C  C  A  G  ...   <- the sequence, read straight off

Each terminated fragment is one rung of a ladder; line them up shortest to longest and the final letter of each rung, read in order, spells the sequence.

From a ladder of colours to a read

The modern, automated version adds one elegant touch that makes the whole thing readable by a machine. The four terminators each carry a *different fluorescent dye*: ddA glows green, say, ddT red, ddG yellow, ddC blue (the exact colours vary). Now every fragment is not only a particular length but also tipped with the colour of its final base. As the capillary separates the fragments by length and they file past a laser one by one — shortest first — a detector reads off the colour of each in turn. The string of colours, shortest to longest, *is* the sequence: green-red-yellow-blue-blue-green spells A-T-G-C-C-A. A graph of those coloured peaks marching across the screen is the famous chromatogram, the raw face of Sanger data.

Set up one reaction. Mix the single-stranded template, a primer, DNA polymerase, all four normal nucleotides, and a small dose of the four dye-labelled dideoxy terminators.
Copy and terminate. The polymerase extends the primer; at each base it usually adds a normal nucleotide but sometimes a terminator, stopping that strand and tipping it with one colour.
Sort by length. Run the mixture through a capillary gel; shorter fragments come out first, so the fragments line up in length order, one base apart.
Read the colours. A laser and detector record the colour of each fragment as it passes; shortest to longest, the run of colours spells the sequence — that string is your read.

The decoded string of letters that comes out is called a read — the basic unit of every sequencing technology, the same word you will meet again for newer methods. A good Sanger read runs roughly 500 to 1000 bases before the fragments get too long for the gel to resolve cleanly and the colours start to blur. That length is a real strength: a single Sanger read is long enough to span a small gene or confirm a cloned fragment in one go, and each base usually comes with a quality score saying how confident the call is.

Still the gold standard — and its honest limits

Sanger sequencing was the engine of the original Human Genome Project, the international effort that read a human genome for the first time. Doing three billion bases roughly a thousand at a time meant millions of reads, more than a decade, and billions of dollars — heroic, but plainly too slow and costly to repeat for every patient or species. That pressure is exactly what drove the next-generation methods you will meet next, which trade Sanger's careful one-at-a-time reading for reading millions of short fragments in parallel and collapse the cost a millionfold.

But here is the honest twist, and a common misconception worth correcting: "superseded" does not mean "obsolete". Sanger sequencing is still the everyday gold standard for short, accurate reads. When you need to be *sure* of one stretch — to confirm a single gene, check that a clone came out right, or double-check a suspicious variant flagged by a next-gen run — Sanger is the method labs trust to settle it. Its read is the careful, definitive one, not the mass-produced one. A next-generation result is often considered confirmed only once it has been re-read by Sanger.