JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
All guides

The Structure of a Gene

A gene is far more than its protein-coding letters. Walk along the DNA and meet the promoter, exons and introns, the untranslated ends, and the distant enhancers that decide when it speaks.

A gene is a neighbourhood, not just a sentence

In the previous guide you met the gene as both a unit of heredity and a physical stretch of DNA, and you learned the deflating truth that being human takes only about 20,000 protein-coding genes. Now we zoom all the way in and walk along one of those stretches, base by base, to see what it is actually made of. The first thing to unlearn is the idea that a gene is just the run of letters that spells out a protein. The protein-coding part is the headline, but a real gene is more like a whole neighbourhood: the house where the protein recipe is written, plus the doorbell, the address, the on-switches near the door, and other switches that can sit surprisingly far down the street.

To navigate this neighbourhood we need a sense of direction. Recall that a DNA strand runs from a 5' end to a 3' end, like a one-way street, and that the two strands are antiparallel, running in opposite directions. When transcription reads a gene, it copies one strand into RNA and moves in the 5'-to-3' direction along the new RNA. By convention we draw a gene with its start on the left and write positions relative to where transcription begins: everything before the start is 'upstream' (negative numbers), everything after is 'downstream' (positive numbers). Hold onto that map; every part we meet sits at a definite place on it.

Walking the length of a eukaryotic gene

Let us take the tour, from upstream to downstream, of a typical eukaryotic gene. Just before the gene proper lies the promoter: a stretch of DNA that does not get copied into the message but acts as the launch pad for transcription. It is where the RNA-making machinery is recruited and aimed. A famous landmark inside many promoters is the TATA box (a short A-and-T-rich sequence like TATAAA), which helps position the start. The promoter sets the question 'should this gene be read, and from exactly which base?' rather than supplying any of the recipe itself.

At the transcription start site the copying begins, and the rest of the gene is transcribed into a long RNA. But the very first part of that RNA is not protein recipe either: it is the 5' untranslated region, or 5' UTR. The ribosome will later land here and scan along it before it reaches the start codon (the three letters A-T-G in the DNA, AUG in the RNA) where protein-building actually begins. Think of the UTR as the cover note and address label on the message: it carries signals about how efficiently and where the message should be translated, but it is not part of the protein.

Now comes the strangest part of the eukaryotic layout. The coding stretch is not continuous. It is broken into pieces called exons (the parts that stay in the final message) separated by introns (the parts that get cut out). In the human genome introns are often vastly longer than the exons they interrupt: a gene can sprawl across many thousands of bases of DNA, yet only a fraction of that ends up specifying protein. This exon-intron organization is why the freshly made RNA must be edited before it can be read, a topic the transcription rungs ahead will unpack in detail. After the last exon comes the 3' UTR, another untranslated tail, and somewhere in it a polyadenylation signal that tells the cell where to end the message and add a protective tail.

upstream <----- transcription start (+1) -----> downstream

  [enhancer] .... [PROMOTER] | 5'UTR [EXON1]~intron~[EXON2]~intron~[EXON3] 3'UTR [polyA signal]
     far away      launch pad |  ATG (start codon).....stop codon
     not copied   not copied  |  <-------- transcribed into one long RNA -------->
                              |  <-- introns later cut out, exons spliced together -->
A typical eukaryotic gene from upstream to downstream: only the exons (minus the UTRs) end up specifying protein.

The switches that sit far from the gene

The promoter says where transcription starts, but it says little about how often, in which tissue, or at what stage of life. That decision is made largely by separate regulatory elements, the most famous being enhancers. An enhancer is a short stretch of DNA that binds regulatory proteins and dials a gene's transcription up. Its astonishing feature is that it can sit thousands, even hundreds of thousands, of bases away from the promoter it controls, sometimes inside an intron, sometimes downstream of the whole gene. Because DNA is a flexible, bendable molecule rather than a rigid ladder, the strand can loop so that a distant enhancer is brought physically next to the promoter, like folding a long ribbon to touch two faraway points together.

Enhancers do not act alone. Silencers turn transcription down, and insulators act like fences, stopping an enhancer from reaching across to genes it should not touch. A single gene is often governed by several such elements at once, each responding to different signals, and their combined vote sets the final rate. This is why the same gene can be loud in one cell type and silent in another even though the DNA letters are identical: the difference lies in which regulatory proteins are present to read these switches. We will see in the regulation rungs that this distributed, combinatorial control is the main reason a modest set of genes can build a richly varied organism.

Bacteria do it differently — and more compactly

You met the prokaryote-eukaryote divide back in the foundations rung; here it shows up in the very architecture of a gene. A typical bacterial gene is strikingly streamlined. There are almost no introns, so the coding region usually runs continuously from start codon to stop codon. The bacterial genome is dense, with little spacer DNA, and genes are packed close together. The promoter is simpler too: instead of a TATA box read by a large committee of proteins, a bacterial promoter is recognized directly by a swappable piece of the RNA-making enzyme called a sigma factor.

There is a deeper structural twist. Bacteria frequently bundle several related genes in a row under a single promoter, transcribing them all onto one shared RNA. This arrangement is the operon, and it lets a cell switch a whole set of related jobs (say, every enzyme needed to digest one sugar) on or off with one decision. Eukaryotes almost never do this; each of their genes typically gets its own promoter and its own message. So the contrast is sharp: a bacterial gene is a lean, continuous, often-shared instruction, while a eukaryotic gene is a long, interrupted, individually-regulated one with its switches scattered across the surrounding DNA.

Why all this extra DNA? A gene is more than its protein

Step back and add it up. Promoter, two UTRs, several introns, and a constellation of distant enhancers, silencers, and insulators: in a human gene, the letters that actually specify the protein are usually a minority of the DNA involved. This is the gene-level face of something you met last guide, the gulf between coding and non-coding DNA. The extra material is not waste. It is the apparatus of control: it decides whether, when, where, and how much a gene speaks. Once 'junk DNA' was a fashionable label for everything non-coding; today we know much of it is doing exactly this regulatory work, even if some of it truly is inert.

The split structure pays a second dividend. Because the coding region is parcelled into exons, the cell can stitch them together in more than one way. Through alternative splicing, one gene's exons can be combined into several different final messages, each yielding a distinct protein. This is the molecular reason the old slogan 'one gene, one protein' is retired: the average human gene gives rise to more than one protein. The intron-exon layout is not merely tolerated clutter; it is what makes this versatility possible, letting roughly 20,000 genes encode a far larger repertoire of proteins.

  1. Find the promoter and any enhancers: the launch pad and the volume knobs that decide if and how loudly the gene is read.
  2. Mark the transcription start, then the 5' UTR: the cover note read before the recipe begins.
  3. Trace the exons and introns: only the exons (minus the UTRs) carry protein recipe; the introns get cut out.
  4. End at the 3' UTR and polyadenylation signal: the closing label that says where to stop and how long the message survives.

Why dwell on anatomy now, before we have even watched transcription happen? Because every later chapter is a story about these parts. Transcription is the machinery reading the promoter and copying the exons-and-introns into RNA. RNA processing is the editing that removes the introns and splices the exons. Regulation is the conversation between enhancers, silencers, and the proteins that read them. By learning the layout first, you will recognize each player when it walks on stage — and you will already understand the deepest point: a gene is not its protein-coding sequence alone, it is that sequence plus all the instructions that govern when and how it is used.