Reading History in Sequences

The genome as a historical document

You climbed this whole ladder to see how a genome is copied, transcribed, translated and repaired — a beautiful machine for getting a sequence faithfully from parent to child. But that very faithfulness has a side effect that this rung is built on. Because DNA is copied so accurately and changes so slowly, every genome alive today is also an archive: a running record of every change that was passed down since the dawn of life. The same molecule that carries instructions for *building* an organism also carries the story of *where that organism came from*. The central idea of molecular evolution is to read that second message — to treat a sequence not as a blueprint but as a manuscript, copied and re-copied across billions of years, still legible if you know how to look.

What makes the manuscript readable is the one process you already know as the engine of change: [[mutation-fitness-spectrum|mutation]]. A miscopied base here, a swapped letter there — these are the edits that accumulate down the generations. And the crucial fact, met earlier in this ladder, is that most mutations are neutral: they neither help nor harm, so they ride along and pile up almost like clockwork. That steadiness is what turns mutation from mere damage into a *measuring tool*. The more time two lineages have spent apart, the more independent edits each has collected, and so the more their sequences will have drifted apart. Difference, in other words, is a proxy for elapsed time — and that single insight is the foundation of everything in this rung.

Comparing two sequences tells a story of descent

Take the same protein from two species — say the oxygen-carrying haemoglobin of a human and of a horse — and write the two amino-acid sequences out one above the other so that matching positions line up. This is an alignment, and a single one already tells a story. Most columns match exactly; a scattered few differ. The matches are not coincidence: two unrelated sequences would agree only about as often as random chance allows, whereas these agree almost everywhere. That overwhelming similarity is the signature of [[sequence-homology|homology]] — the two proteins are not similar because they happen to do similar jobs, they are similar because they are *literally the same ancestral protein*, inherited down two diverging lines from a creature that lived long ago. The differences are the edits each lineage made on its own copy in the time since.

Now count those differences, and the alignment becomes a clock. If neutral edits accumulate at a roughly steady pace, then the number of differences between two sequences estimates how long ago their lineages split — the time back to their last common ancestor. This is the logic of the [[molecular-clock|molecular clock]], and it is what sequence comparison buys you: not just *that* two species are related, but a rough *when*. Human and chimp proteins differ by very little, so their split was recent; human and yeast proteins differ a great deal, so that branching is ancient. Be honest about the assumption, though — the clock is not a precise stopwatch. Different genes tick at different rates, the rate can speed up or slow down between lineages, and so a molecular date is always an estimate, best trusted when it is calibrated against an independent anchor such as a well-dated fossil.

Conserved versus variable: where selection leaves its mark

Look harder at that alignment and you notice the differences are not spread evenly. Some columns are *identical* in every species you add — frozen solid across hundreds of millions of years — while others change at almost every branch. The frozen columns are [[conserved-variable-sites|conserved sites]]; the churning ones are variable sites. Why the difference? Not because mutation avoids the conserved spots — mutation strikes everywhere blindly. The conserved columns stay the same because nearly every mutation that lands there breaks something essential, and the organism carrying it leaves fewer offspring, so the change is quietly removed from the population. That filtering-out of harmful variants is [[purifying-selection|purifying selection]], and a conserved site is its fingerprint: a position so important that evolution could not afford to let it move.

For protein-coding genes there is an even sharper reading, and it rides on the redundancy of the genetic code you already know. Because the code is degenerate, some DNA changes swap the amino acid (a *non-synonymous* change, dN) while others leave the protein untouched (a *synonymous* change, dS). Synonymous changes barely feel selection, so they accumulate near the neutral rate; non-synonymous changes alter the protein and are filtered. The ratio of the two — the [[dn-ds-ratio|dN/dS ratio]] — turns selection into a single number you can read straight off the alignment.

one protein, three species, aligned column by column:

  human   M  V  H  L  T  P  E  E  K  S  A  V
  horse   M  V  H  L  T  P  E  E  K  T  A  V
  yeast   M  V  H  L  S  G  Q  E  K  N  A  V
          |  |  |  |  ^  ^  ^  |  |  ^  |  |
        conserved sites      variable sites
        (purifying selection)  (drift tolerated)

  few differences  -> recent common ancestor
  many differences -> ancient split

  dN/dS < 1  purifying selection (site / gene matters)
  dN/dS ~ 1  drifting, little constraint
  dN/dS > 1  positive selection (change favoured)

A single alignment does double duty: the count of differences estimates time since the common ancestor, while the pattern of which columns stay frozen reveals which residues selection refuses to let change.

Why molecules often beat fossils

It is tempting to think bones are the gold standard of history and sequences a clever afterthought. Often it is the other way round. Fossils are rare, patchy, and biased — soft-bodied creatures, microbes and deep-sea life leave almost nothing behind, and whole branches of the tree of life have *no* fossil record at all. A genome, by contrast, is carried by every living descendant, so a single drop of blood or a leaf holds a near-complete archive of that lineage's history. Where a fossil gives you a handful of bones at one frozen moment, a sequence gives you hundreds or thousands of independent characters — every base a separate little witness — and lets you compare organisms that share no skeleton to compare, like a bacterium and a redwood. That sheer quantity of evidence is why a phylogenetic tree built from molecules is usually far better resolved than one drawn from anatomy alone.

What a sequence can and cannot say

Reading history off sequences is powerful, but it carries real caveats, and a good molecular historian states them out loud. Similarity must be true homology, not a coincidental resemblance or a borrowed gene: bacteria and other microbes swap DNA sideways through horizontal gene transfer, so a single gene's tree can disagree with the species' tree, and you must compare many genes before trusting a branch. The clock can mislead when rates shift between lineages. And saturation eventually blurs the deepest comparisons — over enough time a site may mutate, mutate back, and mutate again, hiding changes so that very ancient distances are systematically underestimated. None of this breaks the method; it just means a sequence is evidence to be weighed, not an oracle to be obeyed.

Step back and the chapter's promise comes into focus. A genome is a historical document written in four letters, edited by mutation, proofread by selection, and never fully erased. A single alignment is enough to tell a story of descent: how many letters differ measures *when* two lineages parted, and which letters refuse to differ reveals *what* in them matters most. From here the rest of this rung opens up — the next guides take this idea and build it outward: into how new genes are *born* rather than merely conserved, into the great trees that organise all of life, and into the molecular fingerprints that let a single sequence identify the species it came from.