Sequencing & Assembling Genomes

The problem: machines read snippets, not chromosomes

On the previous rung you met the machines that read DNA letter by letter — Sanger's elegant chain-termination method, and the massively parallel next-generation sequencers that read millions of molecules at once. But there is a catch those guides hinted at and we must now confront head-on: no sequencer reads a whole chromosome. Each one reads only a short stretch — a read — typically a few hundred bases for short-read machines, perhaps tens of thousands for long-read ones. A human chromosome is hundreds of *millions* of bases long. So a sequencer never hands you the genome; it hands you a colossal pile of overlapping snippets and leaves you to put them in order.

Picture it as a library catastrophe. Take one fat encyclopedia, photocopy it many times over, then shred every copy into thousands of overlapping strips — and shuffle the lot into one enormous heap. No strip is labelled with a page number. Your job is to reassemble the original book from the strips alone, using nothing but the fact that strips from the same region share overlapping text. That, almost exactly, is the computational puzzle of building a genome from sequencing reads. The reason we shred so many copies is that the overlaps between strips are the *only* clue to how they fit together; without overlaps there is nothing to glue.

Shotgun sequencing: break it up on purpose

The strategy that beat this puzzle has a wonderfully blunt name: [[shotgun-sequencing|shotgun sequencing]]. Instead of trying to read a chromosome end to end, you deliberately blast many copies of the genome into random fragments — as if firing a shotgun at the book — sequence the fragments, and then let a computer reconstruct the order from their overlaps. The randomness is the point. Because each copy shatters in different places, a read that *ends* in the middle of one fragment sits squarely in the *middle* of another, and that staggering is exactly what stitches the puzzle back together.

How many copies must you shred? This is where coverage — a number you met last rung — earns its keep. Coverage is just the average number of reads that pile up over any one position: if your reads, laid end to end, add up to thirty times the genome's length, you have thirty-fold (30x) coverage, meaning each base is read about thirty times on average. You need redundancy for two reasons. First, errors: any single read can miscall a base, but if thirty independent reads cover that spot, the true letter wins the vote. Second, gaps: reads land randomly, so to be confident that *every* stretch got hit at all, you must oversample heavily — like needing far more than fifty raffle tickets to be sure you have covered all fifty numbers.

Assembly: from reads to contigs to scaffolds

Now the heap of reads goes to a computer, and [[genome-assembly|genome assembly]] begins. The core move is the obvious one: find reads whose ends overlap, and where two reads share a long, exact run of letters, infer that they came from the same place and merge them. Chain enough overlaps together and a long, continuous stretch of sequence emerges — a [[contig-scaffold|contig]] (from *contig*uous). A contig is a piece of the genome the assembler is confident about: an unbroken sequence with no gaps inside it. A first assembly is not one contig but thousands of them, each a solid island of certainty.

READS (random fragments, overlapping):
  ...ATGCCAGTTAC
        CAGTTACGGATC
              ACGGATCTTGAA

  overlaps line up the shared letters:
  ATGCCAGTTAC
      CAGTTACGGATC
          ACGGATCTTGAA
  --------------------------
CONTIG (merged, gap-free):
  ATGCCAGTTACGGATCTTGAA

SCAFFOLD (contigs ordered + oriented; ? = sized gap not yet read):
  [contig 1]----?NNNN?----[contig 2]----?NN?----[contig 3]

Overlapping reads merge into a gap-free contig; paired-end links then order and orient several contigs into a scaffold, leaving sized but unread gaps (written as runs of N).

Why do the islands not just join into continents? Because of repeats. Genomes are full of sequences that recur almost identically in many places — you met repetitive DNA and transposons on an earlier rung — and a repeat longer than your reads is a trap. The assembler sees the same stretch of letters arriving from a dozen unrelated locations and cannot tell which copy any read belongs to, so rather than guess wrong it stops the contig at the edge of the repeat. Repeats are the single biggest reason short reads leave a genome in pieces, and the deepest reason long reads matter: a read long enough to span a whole repeat, anchored in unique sequence on both sides, walks straight across the trap that defeats a short one.

To bridge the gaps without reading them, assemblers use a clever trick called paired-end reads: you sequence both ends of a longer fragment whose total length you know, even though the middle stays unread. If one end lands in contig A and the other in contig B, you have learned that A and B are neighbours and roughly how far apart — a rope thrown across a canyon you cannot yet see the bottom of. Ordering and orienting contigs this way produces a scaffold: contigs placed in their correct sequence and direction, with the still-unread gaps between them recorded as runs of the letter N, each run sized as accurately as the paired-end ropes allow. A scaffold is the genome's skeleton — the right bones in the right order, with some joints still to be filled in.

Reference genomes and annotation: from letters to meaning

Once a species has been assembled well — every chromosome a long, ordered scaffold — that result becomes a [[molbio-reference-genome|reference genome]]: an agreed, high-quality template that everyone in the field shares. Its power is that it converts assembly into mere alignment. Once a good human reference exists, you never have to assemble a human genome from scratch again; you sequence a new person's reads and simply *map* each read onto the reference, like laying transparent strips over a master copy and reading off only where they differ. This is why the second human genome was vastly cheaper than the first. One honest caveat, though: a reference is a representative, not a universal truth. Any single reference under-represents the diversity of a whole species, which is exactly why the field is now moving toward *pangenomes* — references built from many individuals rather than one.

But a finished sequence is still just three billion letters — A, T, G, C with no labels. Knowing the letters is not knowing what they *do*, and closing that gap is [[molbio-genome-annotation|annotation]]: finding the genes and features hidden in the raw text. Annotation works partly by signal-reading and partly by comparison. From the sequence alone, software hunts for the tell-tale marks you have studied — a promoter upstream, a start codon, the canonical splice sites at exon-intron borders, an in-frame stop — to predict where genes lie. Then it leans on evidence: it aligns the new sequence against known genes from other species and against real RNA reads from RNA-seq, because a stretch that is actually transcribed and conserved is far likelier to be a real gene than one merely guessed from the letters. Annotation is an interpretation layered on top of the sequence, not a property of the DNA itself — and like any interpretation, it gets revised as evidence improves.

The Human Genome Project and the lesson it left

Everything above was forged in one extraordinary effort: the [[molbio-human-genome-project|Human Genome Project]], a roughly thirteen-year international push, ending with a draft in 2001, to read a complete human genome for the first time. It cost on the order of billions of dollars and took the coordinated work of labs across many countries; today an equivalent genome can be sequenced for a few hundred dollars in a day or two. That collapse in cost — far steeper than the famous Moore's law of computing — is the project's most tangible legacy, and it is what makes everything else on this rung, from comparing thousands of genomes to sequencing a single cell, even thinkable.

It also left a subtler legacy worth being honest about. The 2001 "complete" genome was neither complete nor truly one person's: it was a draft, a mosaic from several donors, and it left perhaps eight percent of the genome — the hardest, most repetitive regions, exactly the ones short reads stumble on — unfinished. Those gaps were only filled in 2022, by a separate effort using long reads, more than two decades after the celebration. And the project's most quoted finding was a humbling one: only ~20,000 genes. The genome turned out to be smaller in gene count and larger in mystery than almost anyone had expected.

Which brings us to the deepest point of this whole guide, the line you should carry up the rest of the ladder: sequencing a genome is not the same as understanding it. Finishing the human sequence was like finally obtaining a complete list of every word in a language you cannot yet read — an indispensable start, and almost nothing on its own. The letters do not tell you which stretches are genes, how those genes are switched on and off, how their proteins cooperate, or why one base change causes disease while a million others do nothing. Turning that raw text into biological understanding is precisely the work of bioinformatics, comparative genomics, and systems biology — the disciplines the rest of this rung is about. The genome is not the answer; it is the dictionary you finally get to start reading.