Comparative & Functional Genomics

From one genome to a shelf of them

In the previous guide you assembled and annotated a single genome — stitched the reads into chromosomes and drew best-guess lines around the genes. But a lone genome is like a book in a language you barely speak: you can see the words, yet you cannot tell which ones carry the meaning and which are filler. A human genome has about three billion letters and only roughly 20,000 protein-coding genes, which together occupy under two percent of the sequence. So the burning question of this rung is not *what is in the genome* but *which parts matter, and what do they do*. [[molbio-comparative-genomics|Comparative genomics]] answers the first half of that question with a beautifully simple move: instead of staring harder at one genome, you line several up side by side and let evolution tell you what it cared about.

The logic rests on something you met far down this ladder: most mutations are neutral, and changes accumulate steadily over millions of years. Run that forward across many species descending from a shared ancestor and a pattern emerges. Sequence that does nothing important drifts freely — it collects mutations like an old wall collects graffiti, until two species' versions barely resemble each other. But sequence that *does* something vital cannot drift: nearly every change there breaks the gene and is quietly weeded out by [[purifying-selection|purifying selection]], the slow eviction of harmful variants. The result is that important DNA stays stubbornly the same across species, while unimportant DNA scrambles. Conservation is the fingerprint of function — and it is visible only when you compare.

Orthologs, paralogs, and the family tree of genes

Before you can compare genes across species you must match them up correctly, and here a crucial distinction lives. When you find the human gene and the mouse gene that are clearly relatives — descended from the *same* gene in the last common ancestor of humans and mice — those are orthologs. They are the 'same gene in two species', the ones you compare when you want to learn function, because they usually still do the same job. But genes also multiply *within* a genome: a stretch of DNA is occasionally duplicated, leaving two copies side by side, and those copies and their descendants are paralogs — relatives born of duplication rather than speciation. Telling orthologs from paralogs is the first careful step of any comparison, because confusing them quietly corrupts everything built on top.

Paralogs are not noise, though — they are how evolution invents new things. After a gene is duplicated, one copy can keep doing the original job while the spare is free to drift and pick up mutations that would have been fatal in a sole copy. Most spares simply rot into a [[gene-families-and-pseudogenes|pseudogene]], a broken relic that no longer makes a protein. But once in a long while the freed copy stumbles into a useful new role, and a gene family is born — like the cluster of globin genes, all paralogs of one ancestor, now specialised for carrying oxygen in the embryo, the foetus, and the adult. So duplication followed by divergence is one of biology's chief engines of novelty, and you can read the whole history straight off the sequence-similarity pattern of a gene family.

Reading selection: conserved sites and dN/dS

Once orthologs are lined up, you can read selection down to the single letter. Stack the same gene from a dozen mammals and look column by column: some positions are *identical* in every species, others vary freely. The frozen columns are the [[conserved-variable-sites|conserved sites]] — the active-site residue of an enzyme, the very base an essential regulator grips — places where change was lethal and so never survived. The variable columns tolerated change and so collected it. This alignment, read as a heat-map of conservation, is the single most powerful way to point at *which letters in a gene actually do the work*, long before you run a single experiment.

For protein-coding genes there is an even sharper instrument, and it leans on the redundancy of the genetic code you already know. Because the code is degenerate, some DNA changes swap the amino acid (a *non-synonymous* change, dN) while others leave the protein untouched (a *synonymous* change, dS). Synonymous changes are nearly invisible to selection, so they pile up at the neutral background rate; non-synonymous changes alter the protein and so are filtered by selection. Comparing the two rates as a ratio — the [[dn-ds-ratio|dN/dS ratio]] — turns that filtering into a number. A dN/dS well below 1 means protein-changing mutations were being purged: the gene is under purifying selection, conserved, important. Around 1 means changes pass freely, hinting the sequence is not constrained. And the rare value *above* 1 is a red flag for the opposite force — positive selection, change actively favoured, the signature of a gene being driven to evolve, as in an immune protein racing against a pathogen.

align one gene across species, read each column:

  human   ... A T G  C A C  G G T  A A A  T C C ...
  mouse   ... A T G  C A T  G G C  A A A  A C C ...
  chimp   ... A T G  C A C  G G T  A A G  T C C ...
  dog     ... A T G  C A T  G G A  A A A  T C T ...
            |  | | |  ^      ^      | | ^
          conserved  silent (dS)   conserved   varies

  dN/dS  <  1   ->  purifying selection  (gene matters, conserved)
  dN/dS  ~~ 1   ->  little constraint    (drifting / neutral)
  dN/dS  >  1   ->  positive selection   (change favoured)

Stacking orthologs turns evolution into a readout: frozen columns mark functional sites, and the ratio of protein-changing (dN) to silent (dS) substitutions scores the selection acting on the whole gene.

Synteny and conserved non-coding islands

Comparison works above the level of single genes, too. When you align whole chromosomes between two species you find long blocks where the *same genes sit in the same order* — a shared gene neighbourhood inherited intact from the common ancestor. This preserved gene order is called synteny, and it is enormously useful: it lets you carry knowledge from a well-studied genome onto a freshly sequenced one ('the gene next to this landmark in mouse should be the gene next to the matching landmark in human'), and the places where synteny *breaks* mark the chromosomal rearrangements — inversions, fusions, translocations — that reshaped genomes over evolutionary time. Synteny is the large-scale grammar that survives even as individual letters churn.

The most striking payoff of comparison, though, lands in the *non-coding* genome. Recall the premature label 'junk DNA' — the assumption that the 98 percent of our genome outside protein-coding genes was inert filler. Comparative genomics dismantled that idea elegantly. Scanning aligned mammal genomes turned up thousands of stretches that code for no protein yet are as conserved as the most essential genes — some barely changed across hundreds of millions of years. These conserved non-coding elements could not have stayed frozen by accident; such relentless conservation only happens when purifying selection is guarding a function. And indeed many of them turned out to be regulatory switches — enhancers and other control elements that decide when and where genes turn on. Evolution had been quietly flagging the regulatory genome for us all along; we just had to compare to see the flags.

Functional genomics: asking the genome directly

Comparison tells you *that* a stretch matters; it rarely tells you *what it does*. For that, functional genomics takes the opposite approach: rather than inferring function from evolution, it walks across the genome and measures activity directly, position by position. Where does a regulatory protein actually land on the DNA? Which regions are transcribed into RNA, even if they make no protein? Which stretches are wrapped tight in silent chromatin and which lie open and accessible? Each of these is a real, measurable signal, and reading them genome-wide turns a static letter sequence into a living map of what every part is *doing* in a given cell.

The landmark effort here is the [[encode-project|ENCODE project]] — the Encyclopedia of DNA Elements — a vast, multi-lab campaign to assign a function to every base of the human genome by stacking dozens of these assays across many cell types. ENCODE mapped where transcription factors bind, which histone marks decorate which regions, where chromatin is open, and how much of the genome is copied into RNA. Its headline finding made a splash and a controversy: a large fraction of the genome shows *some* biochemical activity. That sounds like the final death of 'junk DNA', but here honesty matters — 'biochemically active' is a much weaker claim than 'functional in the sense that natural selection maintains it'. Some pervasive activity is genuine regulation; some is incidental noise, the low-level pervasive transcription a busy genome throws off. The two views — conservation versus measured activity — are complementary, and the most trustworthy functional elements are the ones flagged by *both*.

None of this is a single person reading sequence by eye — it is the work of [[molbio-bioinformatics|bioinformatics]], the discipline of aligning millions of reads, scoring conservation across genomes, and overlaying functional tracks with software. And a humbling honesty runs through the whole enterprise: knowing that an element is conserved, or that a protein binds it, still does not tell you the full story of *what it accomplishes in the organism*. Even predicting a protein's shape from its sequence — long the field's hardest problem — leapt forward enormously with AlphaFold, yet is not fully solved, and shape is only a hint toward function. Comparative and functional genomics narrow the search dramatically and point you at the right letters; the final word on function still comes from experiment.

Putting it together: history and function from sequence

Stand back and the two halves of this guide click together. Comparative genomics uses evolution as a free, billion-year experiment: by asking what selection refused to change, it tells you *which* parts of a genome are important — conserved coding sites, low dN/dS genes, frozen non-coding islands, preserved synteny. Functional genomics then asks the genome directly what those important parts *do* — where proteins bind, what is transcribed, what lies open — building the regulatory map that ENCODE pioneered. One reads history off the sequence; the other reads activity off the cell; and the elements that light up in both are the ones you can most trust.

This also quietly reframes the old riddle of complexity. Humans carry only about 20,000 protein-coding genes — fewer than some plants, no more than a tiny worm has by the same count — so the gene *list* cannot be what makes us elaborate. Comparative and functional genomics point to the answer: the difference lives largely in the regulatory genome, in the vast web of switches deciding when and where each gene fires. The genes are a parts list shared widely across animals; the wiring diagram is where much of the divergence hides. That is the perfect handoff to the rest of this rung, where single-cell methods and systems thinking take this static map of *what can happen* and turn it into the dynamic story of *what is happening*, gene by gene, cell by cell.