JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
All guides

Phylogenetic Trees & the Tree of Life

Every sequence carries the memory of its ancestors. This guide shows how to turn a stack of aligned DNA into a tree of life — how it is built, how to read it honestly, and the deep history it has already revealed.

From an alignment to a branching history

Earlier in this rung you learned to line two sequences up and read their differences as evolutionary distance: more substitutions, more time apart. A [[phylogenetic-tree|phylogenetic tree]] simply takes that idea and runs it across many species at once. Picture a dozen versions of the same gene stacked in a sequence alignment — one row per species, the letters in tidy columns. Two species whose rows differ in only a handful of columns are close relatives; two whose rows have drifted far apart are distant ones. The tree is just the family history that best explains *that whole pattern of similarities and differences at once* — a branching diagram in which every split is a moment when one ancestral lineage became two.

There are several ways to build the tree, but they share one logic. The simplest, *distance methods*, boil each pair of sequences down to a single number — how different they are, corrected for the fact that the same site can mutate twice and hide changes — and then cluster the closest pairs together step by step. The more powerful *character-based methods* keep every column and search for the tree that best fits all of them: maximum parsimony prefers the tree requiring the fewest total mutations, while maximum likelihood and Bayesian methods adopt an explicit model of how letters change over time and ask which tree makes the observed data most probable. They are slower but far more honest about the messiness of real sequence change, and they are the workhorses of the field today.

How to read a tree (and what it does not say)

A tree is read at its tips and its joints. The tips (the leaves) are the species or sequences you actually have; the internal nodes are inferred common ancestors you never see; and a group consisting of an ancestor *plus all of its descendants* — a single branch you could snip off whole — is a clade. Clades are the real claim a tree makes: 'these organisms share a common ancestor not shared by anything outside the group'. Crucially, a bare branching tree is unrooted — it shows who is related to whom but not which way time flows. To give it a direction you add an outgroup, a species you know branched off earlier than everything else; the point where it joins becomes the root, the deepest ancestor, and now the whole tree reads as a flow of time from root to tips.

Because a tree is an *inference* from limited data, every branch deserves a confidence score, and the standard one is [[bootstrap-support|bootstrap support]]. The trick is wonderfully simple: take your alignment's columns and randomly resample them — with replacement, so some columns appear twice and others drop out — to build a slightly scrambled fake dataset, then rebuild the tree. Do this a thousand times and ask, for each branch in your original tree, *in what fraction of the thousand replicates did this exact grouping reappear?* A branch that shows up in 98 percent of them is robust; one that appears in 55 percent is a shrug — the data barely prefer it to the alternatives. So a published tree without support values is half a result. Honest trees wear their uncertainty out loud.

an unrooted tree shows relationship; an outgroup adds time:

   unrooted (who is related to whom)        rooted with an outgroup

     human   chimp                                  +-- human
        \    /                                 +----+
         \  /                                  |    +-- chimp
   mouse--*--* --frog          ROOT --- frog --+
         /                                     |    +-- mouse
        /                                      +----+
      dog                                           +-- dog

   clade = a node + ALL of its descendants (one branch you can snip off whole)
   bootstrap: resample columns 1000x, count how often each branch reappears
An unrooted tree states relationships; adding a known-early outgroup roots it and sets the direction of time. A clade is any branch you could cut off whole, and a bootstrap value is the fraction of resampled datasets in which that branch reappeared.

The three-domain tree of life

The single most consequential tree ever drawn came from one cleverly chosen molecule. To compare *all* of life — a bacterium, a mushroom, a human, a pond alga — you need a gene that every cell on Earth carries, that does a job so essential it has barely changed in billions of years, yet still varies enough to record the deepest splits. Carl Woese realised the small-subunit ribosomal RNA (the RNA at the heart of the ribosome you met when you learned translation) is exactly that universal yardstick: every living thing builds proteins, so every living thing has it. When he sequenced it across the living world in the 1970s, the result overturned a textbook certainty.

For a century life had been split into two by appearance: things with a nucleus and things without. The ribosomal-RNA tree revealed instead [[molbio-three-domains-of-life|three primary domains]]. The 'simple bacteria' actually fell into two profoundly separate groups — true Bacteria, and a second lineage of microbes called Archaea that, despite looking just like bacteria under a microscope, run their molecular machinery differently and are, astonishingly, our *closer* relatives. The third domain, the Eukarya — us, plants, fungi, amoebae — branches off near the archaea. The lesson is humbling: the visible diversity of plants and animals is a thin twig, while the real sweep of life's history lives in the microbial world we cannot see. This [[three-domain-tree|three-domain tree]] is molecular phylogenetics' founding triumph — a fact about deep history that no fossil or microscope could have delivered, read straight out of a sequence.

Trees at work: pathogens, people, and barcodes

Phylogenetics is not only about billion-year deep time; it works just as well on weeks. When a new pathogen spreads, sequencing its genome from many patients and building a tree turns the outbreak into a readable history. Because a virus accumulates a few mutations every time it copies, samples that are close on the tree caught the infection from a recent shared source, while distant ones diverged long ago. This *molecular epidemiology* — phylogenetics in fast motion — can show that two hospital cases came from the same chain of transmission, estimate roughly when a virus first jumped into humans, and trace which variant seeded which wave. It leans on the same logic as the [[molecular-clock|molecular clock]] you met earlier: counting substitutions and reading them as elapsed time.

The same machinery reconstructs *our own* story. Building trees from human DNA — especially from mitochondrial DNA and the Y chromosome, which pass down through only the mother or only the father and so are not shuffled each generation — shows that the deepest branches of humanity are all African, with non-African populations sitting on younger twigs that split off later. That branching pattern is the molecular signature of an out-of-Africa expansion: a tree of human migration, read from blood and cheek swabs rather than from bones. The very same tree-thinking, run on the individual-level DNA differences between people, is what underlies tracing ancestry and relatedness within our species.

Phylogenetics also gives biology a barcode scanner. [[molecular-barcoding|DNA barcoding]] picks one short, standard gene — a stretch of a mitochondrial gene for animals, a chloroplast gene for plants, a ribosomal region for fungi — that varies just enough to differ between species while staying nearly constant within one. Sequence that one region from an unknown sample, compare it to a reference library, and you can name the species: the fish in a mislabelled fillet, the insect larva too young to identify by eye, the mix of organisms in a scoop of seawater or a swab of soil. It is fast and powerful, but honest about its limits — barcoding works only as well as the reference database behind it, can stumble on very recently split species, and is a tool for *identification*, not for resolving deep evolutionary trees.

Reading a deep event: how mitochondria joined the cell

The most spectacular thing a tree can do is testify to an event no one witnessed and no fossil records. Your cells run on mitochondria, the little compartments that burn food for energy. They are odd in a telling way: a mitochondrion carries its *own* small circular genome, separate from the DNA in the nucleus, and it builds its own ribosomes. The radical explanation, [[molecular-evidence-endosymbiosis|endosymbiosis]], says a mitochondrion is the domesticated descendant of a free-living bacterium that, well over a billion years ago, was engulfed by an ancestral host cell and never digested — two organisms fused into one. For a long time this was a bold story. Phylogenetics turned it into a near-certainty.

  1. Take the ribosomal RNA gene from the mitochondrion's own little genome and place it on the universal tree of life — the same yardstick Woese used.
  2. It does not land near its host's nuclear genes, where you might expect. It lands deep inside the Bacteria — specifically among a group of free-living bacteria, a clade strongly supported by bootstrap.
  3. Cross-check with more genes: the mitochondrion's gene-reading machinery and the layout of its tiny genome also look bacterial, not eukaryotic — multiple independent lines of sequence all point to the same bacterial ancestry.
  4. Conclusion: the mitochondrion is a former bacterium, now a permanent resident. Most of its original genes migrated into the nucleus over time, leaving only the small remnant genome it still keeps. The chloroplasts of plants tell the identical story, tracing back to a captured photosynthetic bacterium.

Sit with what just happened. A diagram built from sequence alone reached back more than a billion years and identified the long-lost free-living cousins of the powerhouse humming inside every one of your cells right now. No fossil could have done this; the event left no bones, only a paper trail written in DNA. That is the deep claim of this whole rung made vivid — *every genome is a historical document* — and the phylogenetic tree is the tool that learns to read it. From here the ladder turns from history to the present and the clinic, where these same sequences become a way to understand, diagnose, and treat disease.