Anatomy of a Genome

Opening the whole cookbook

In the earlier guides you met the gene as both a unit of heredity and a physical stretch of DNA, and you met the genome as the whole cookbook rather than a single recipe. This rung opens that cookbook and asks a structural question: what is actually written inside, and how is so much of it crammed into a space you cannot see? Start with the sheer scale. The human genome is roughly three billion base pairs in each set, and a cell carries two sets — about six billion letters. Stretched end to end, the DNA in one human cell would run about two metres, yet it folds into a nucleus a few millionths of a metre across. That is like packing forty kilometres of fishing line into a single grain of rice, and doing it so neatly that any gene can still be found and read on demand.

Three billion is a big number, but it is also strangely small. Written out one letter per second it would take about a century to recite, yet the same information fits in well under a gigabyte of computer storage — less than a single photo from a modern phone. The genome is not impressive because it is large; it is impressive because of how densely it is organized, how it is read selectively, and how reliably it is copied. This guide maps the anatomy: what fraction is genes, what the rest is, and how prokaryotes and eukaryotes lay it all out so differently.

Genome sizes are wildly different — and the sizes lie

Sweep your eye across the tree of life and the first thing you notice is how enormously genome sizes vary. A small bacterium might carry under two million base pairs; a typical fungus a few tens of millions; a fruit fly about 140 million; a human three billion. So far this might look like a tidy ladder, with more DNA for fancier creatures. Then the ladder collapses. The marbled lungfish carries a genome roughly forty times the size of ours. A modest flowering plant, Paris japonica, holds about fifty times more DNA than a human. Some single-celled amoebas dwarf us by a hundredfold. Clearly, the amount of DNA in a cell is not a measure of how sophisticated the organism is.

This mismatch has a name: the C-value paradox. The C-value is simply the amount of DNA in one set of an organism's chromosomes. The 'paradox' is that C-values do not track complexity, and worse, two organisms that look equally complex can differ in genome size by a factor of tens. For a long while this was genuinely baffling, because biologists half-expected DNA quantity to mean something about organismal complexity. The resolution, which became clear once we could actually read genomes, is the heart of this guide: most of a large genome is not extra genes. It is non-coding DNA, and especially repeated sequence, that has accumulated for reasons that have little to do with making the organism more elaborate.

Inside the genome: a little coding, a lot of everything else

Open the human genome and break it down by what each stretch does. The broadest cut is into genes and the DNA between them, but the surprise is the proportions. Only about one to two percent of our genome directly codes for protein. The rest is non-coding DNA: regulatory switches that decide when and where a gene is read, genes whose product is a working RNA that never becomes protein, vast tracts of repetitive sequence, and the fossilized remains of viruses that inserted themselves into our ancestors' DNA long ago. Repeats alone make up roughly half the human genome — copies of short motifs tandemly stacked, and mobile elements that have spread themselves throughout it over evolutionary time.

Human genome (~3,000,000,000 bp per set), very roughly by category:

  protein-coding sequence (exons) ...... ~1-2%   <- spells out proteins
  regulatory + RNA genes + introns ..... varies  <- controls / non-protein RNA
  repetitive & transposon-derived DNA .. ~50%    <- repeats, mobile-element relics

  ~20,000 protein-coding genes total (about the same as a tiny worm)

A rough anatomy of the human genome — coding sequence is a thin slice; repeats are huge.

For decades, much of this non-coding mass was dismissed as 'junk DNA.' That was a premature label, and being precise about why matters. We now know a great deal of the non-coding genome does real work — above all in regulation, deciding which genes switch on in which cell and when. At the same time, the opposite over-correction is also wrong: it is not true that every base is functional. Some sequence genuinely is inert filler or selfish repeats riding along for free. The honest statement is the careful one: 'non-coding' means 'not translated into protein,' not 'useless,' and the functional fraction is somewhere between the old 'almost none' and the breathless 'all of it.'

And the gene count itself is the most humbling number of all. After the genome was first read, the tally settled at only about twenty thousand protein-coding genes — roughly the same as a millimetre-long roundworm, and fewer than some plants. Being a human does not require many more parts in the list than being a worm. What differs is how the parts are deployed: spliced into multiple proteins, switched on and off in different cells, and wired into networks where genes regulate one another. Complexity lives in the orchestration, not in the length of the parts list.

Two ways to lay out a genome: bacteria versus us

You already met the prokaryote–eukaryote divide as the deepest split in cellular life. That divide shows up vividly in how each kind of cell stores its genome. A bacterium has no nucleus, so its genome sits loose in the cytoplasm as a single, usually circular chromosome, supercoiled and bundled into a dense region called the nucleoid — a clump of DNA, not a membrane-walled room. Bacterial genomes are compact and gene-dense: little spacing between genes, few interruptions inside them, and very little repetitive filler. On top of the main chromosome, many bacteria also carry plasmids, small circles of extra DNA that can be passed between cells and often carry handy traits like antibiotic resistance.

A eukaryotic cell does it the opposite way. Our genome is sealed inside a membrane-bound nucleus, broken into several linear chromosomes (23 pairs in humans), and — crucially — wrapped around protein spools. This is the great packaging trick that the rest of this rung explores in detail: the DNA winds around histone proteins to form beads called nucleosomes, which coil and fold into chromatin, which folds again and again until two metres of DNA fits inside a microscopic nucleus. That folding is not just storage; it is also control, because how tightly a region is packed helps decide whether its genes can be read at all. Eukaryotic genomes are also far roomier than bacterial ones: spread out, full of regulatory DNA, interrupted genes, and repeats.

Putting the anatomy together

Pull the pieces into one picture by walking through what you would find if you dissected a genome from the outside in.

Start at the whole genome: the complete set of an organism's DNA, ranging from under two million letters in a small bacterium to billions in a plant or animal — and remember, that size predicts neither gene count nor complexity.
Find where it lives: loose in the cytoplasm as a single circular chromosome plus plasmids in a prokaryote, or sealed in a nucleus across several linear chromosomes in a eukaryote.
Zoom into the contents: in us, only ~1-2% codes for protein; the rest is regulatory DNA, RNA genes, and especially repeats — roughly half the human genome is repetitive sequence.
Count the genes: only about 20,000 protein-coding genes in a human — and resist the urge to read that number as a complexity score, since orchestration, not count, is what matters.

With that anatomy in hand, the rest of this rung becomes a tour of the parts you have just laid out. Next you will zoom into a single gene to see its exons, introns, and regulatory signals up close. Then you will tackle the 'junk' DNA question head-on, and finally watch the packaging trick in action as DNA wraps around histones into chromatin — the answer to how all of this fits inside a nucleus while still being readable on demand.