Next-Generation Sequencing

The bottleneck Sanger could not break

In the previous guide you met Sanger sequencing, and it is genuinely beautiful: a polymerase copies your strand, and every so often a chain-terminating dideoxynucleotide caps the growing copy, so you end up with a nested ladder of fragments of every length, each labelled by its final letter. Run that ladder out and read off the colours, and you get the sequence. The catch is the word *one*. One reaction reads one fragment — a few hundred to about a thousand bases — down a single lane. To read a whole human genome of three billion bases this way, you have to clone it into millions of pieces, run millions of separate reactions, and stitch the answers together. That is exactly how the Human Genome Project did it, and it took roughly a decade and billions of dollars.

So the bottleneck was never accuracy — Sanger reads are excellent. It was throughput: how many letters you can read per dollar and per day. Sanger is a craft process, one tube and one capillary at a time, and you cannot meaningfully scale a craft to billions of bases. The obvious dream was to stop reading fragments one after another and instead read a huge number *simultaneously*, side by side, in the same small space. That dream is what next-generation sequencing — also called *massively parallel sequencing* — finally delivered, and it is the reason a genome that once cost a nation's research budget now costs about as much as a smartphone.

Reading by synthesis, a million spots at once

The dominant flavour of next-generation sequencing works by a trick called sequencing by synthesis, and the core idea is one you already half-know: watch a polymerase build a complementary strand and write down each letter *as it is added*. First the genome is shredded into short pieces and millions of those fragments are scattered and stuck across a glass slide, each one in its own tiny spot. Each lone fragment is then copied in place — amplified, much like a tiny local PCR — into a dense cluster of a thousand or so identical molecules, so that whatever signal one molecule gives, the whole cluster gives a thousandfold and you can actually see it. Now you have a slide carrying *millions of clusters*, each a pure colony of one fragment, all ready to be read at the same time.

Add all four bases at once, but each one carries a coloured tag and a chemical 'cap' that blocks the next base from joining. The polymerase adds exactly one correct base to every cluster, then stalls.
Take a photograph of the whole slide. Each cluster glows in one of four colours, telling you which single letter was just added there — millions of letters read in one snapshot.
Chemically snip off the colour tags and the caps, freeing every chain to accept its next base.
Repeat the add-photograph-snip cycle a few hundred times. Stack the photos in order and each cluster spells out its fragment, letter by letter.

Reads, coverage, and putting the puzzle back together

What rolls off the machine is not a genome but a flood of short pieces. Each fragment you read gives one read — a string of letters, typically only 100 to 300 bases long for the synthesis method. A single human-genome run produces hundreds of millions, even billions, of these reads. Crucially, the genome was shattered *randomly*, so the reads overlap one another at random, like tearing many copies of the same book into confetti and tipping them all into one heap. The fact that the pieces overlap is the whole key to reassembling them, and it is why you deliberately read far more total letters than the genome actually contains.

That deliberate excess has a name: coverage, or *depth*. If you read enough fragments that, on average, every position in the genome is covered by 30 different overlapping reads, you have *30x coverage*. Depth is your safety net. Each individual read carries some error, and any one base might land in a read that misread it; but when 30 reads independently agree that a given spot is a G, you can trust it, and when they split 15-for-A and 15-for-G, you have caught a real variant — one copy of the chromosome differs from the other. Low coverage is cheap but leaves gaps and uncertain calls; high coverage costs more but buys confidence. Choosing a depth is the everyday economics of a sequencing experiment.

GENOME:    ...A C G T T A G C C A T G A C ...   (the truth we want)

Reads (short, overlapping, error-prone):
           A C G T T A G
               G T T A G C C A
                     A G C C A T G
                         C A T G A C
           --------------------------------
ALIGN +    A C G T T A G C C A T G A C   <- overlaps let us
VOTE       every column read many times      rebuild the sequence

Coverage 4x here: each base sits under ~4 reads, so a single misread loses the vote.

Many short reads overlap by chance; aligning them and taking a majority vote at each position both reconstructs the sequence and corrects random errors. More overlap means higher coverage and more confidence.

Turning that heap of overlapping reads back into a continuous sequence is genome assembly, a giant jigsaw solved by computer. When you have a known reference genome for the species, the job is easier: you simply find where each read best matches the reference and lay it down there, like sorting confetti against a finished picture on the box lid. Building a genome *from scratch* — with no reference — is harder, because you must find the overlaps purely by matching read to read, and repetitive stretches that look identical everywhere can stall the puzzle. This is the moment biology becomes computing: the wet lab hands off to software and statistics, and reading a genome turns into a problem of bioinformatics.

Long reads: a strand through a nanopore

The synthesis method has one stubborn weakness: its reads are *short*. A few hundred bases is fine for spotting single-letter differences against a reference, but it is hopeless for spanning a long repetitive region, because a 150-base read that lands inside a stretch of identical repeats could have come from anywhere in that stretch — the puzzle has many identical pieces. The answer is a completely different, *third-generation* approach: nanopore sequencing, which reads a single DNA molecule directly, with no copying and no synthesis at all.

Picture a membrane with a single protein pore in it, just wide enough for one strand of DNA to thread through, and a tiny electric voltage pushing ions across that hole as a steady current. Now feed a DNA molecule through the pore. As each base passes the narrowest point, its particular shape and size choke the current by a characteristic amount — the four letters squeeze it differently — so the strand writes itself out as a wobbling electrical trace, and software decodes that trace back into A, T, G, and C. Because you simply keep threading the same molecule, the read does not stop at a few hundred bases: nanopore reads routinely run to tens of thousands of bases, sometimes more than a million, long enough to stride right across the repeats that defeat short reads.

Why cheap sequencing changed biology

Once you can read DNA by the billions of bases cheaply, the same machine reads far more than genomes. Convert a cell's RNA back into DNA — using reverse transcriptase, the same enzyme you met in the central-dogma rung — and sequence that, and you are doing RNA-seq: instead of asking *what genes are in this cell*, you ask *which genes is this cell switched on and how loudly*, by counting how many reads land on each gene. Push the same idea down to one cell at a time and you get single-cell sequencing, which has revealed that tissues we once treated as uniform are in fact mosaics of distinct cell states. The reader of letters became a universal meter for what cells are doing.

It reaches the clinic too. Sequencing a tumour reveals exactly which mutations drive it, pointing toward a drug aimed at that specific change. A pregnant person's blood carries traces of fetal DNA that sequencing can screen for chromosomal conditions without a needle near the womb. During a pandemic, reading the genome of a circulating virus from thousands of samples is how new variants get spotted and tracked within days. None of this would be affordable at Sanger throughput; it is precisely the millions-at-once parallelism that makes sequencing a routine diagnostic rather than a heroic one-off project.

One last honest note, so you carry a true picture up the ladder. The famous phrase 'the thousand-dollar genome' refers to the cost of generating the raw reads — it does not include the assembly, the bioinformatics analysis, the storage, or the hard work of interpreting what a variant actually *means* for a person. Cheap reading did not make biology simple; it shifted the bottleneck downstream, from the bench to the data. We can now read genomes faster than we can understand them, and the science of turning a torrent of letters into knowledge is, in many ways, where molecular biology is busiest today.