Finding one address in a billion-letter book
In the previous guide you met transcription as a whole — DNA copied into RNA, in three acts of initiation, elongation, and termination — and you met [[molbio-rna-polymerase|RNA polymerase]], the crab-claw enzyme that does the actual writing. But that left a genuinely hard question hanging. A bacterial chromosome is a few million base pairs long; a human one runs into the hundreds of millions. Every one of those base pairs is chemically the same handful of letters. So how does the polymerase know *where* a gene begins, out of all that look-alike sequence? It cannot read the whole genome looking for a likely spot — that would take far too long.
The answer is that genes do not begin silently. Just before a gene there sits a short, recognizable stretch of DNA — a posted address — and the polymerase is built to spot exactly that pattern. This signpost is the [[molbio-promoter|promoter]]. It is a piece of DNA, not a protein, and it is not itself copied into the useful part of the RNA; it is pure instruction. A promoter says three things at once: *start here*, *read this strand*, and *go in this direction*. Because it has a fixed orientation, naming a promoter automatically decides which of the two strands is the template the polymerase reads, and which way the enzyme will travel.
Two boxes: the bacterial promoter up close
The cleanest place to learn how a promoter works is in bacteria such as E. coli, the workhorse model organism from the foundations rung. A bacterial promoter is compact, and almost all of its recognition rides on two short motifs of DNA. One sits around 10 base pairs upstream of the start site — the -10 box, also called the Pribnow box after the scientist who spotted it. The other sits around 35 base pairs upstream — the -35 box. The polymerase does not need to read the whole gene to find its start; it just needs to find these two little landmarks the right distance apart, and the start site falls predictably just downstream of them.
Each box has a 'typical' sequence the cell aims for, called the consensus. For the common E. coli promoter the -10 box reads close to 5'-TATAAT-3' and the -35 box close to 5'-TTGACA-3', written on the coding strand. The word consensus is honest about something important: hardly any real promoter matches these letters exactly. The consensus is the *average* of many promoters — the sequence each one resembles to a greater or lesser degree. That A-T-rich -10 box is no accident. Recall from the nucleic-acid rungs that an A-T pair is held by only two hydrogen bonds while a G-C pair has three, so an A-T-rich stretch is the easiest place to peel the two strands apart — exactly what must happen right here for copying to begin.
-35 box 17 bp spacer -10 box +1
5'...T T G A C A....................T A T A A T....N N N...gene-->3' coding strand
3'...A A C T G T....................A T A T T A....N N N...gene-->5' template strand
^^^^^^ ^^^^^^ ^
sigma reads here Pribnow box start site (first RNA base)
upstream <----------------------------------------> downstreamNotice the gap between the boxes in that sketch. The spacing matters as much as the sequences. The two motifs sit roughly 17 base pairs apart, and that distance is no accident either: it is the spacing that lets one polymerase molecule touch *both* boxes at the same time, the way a single hand can grip two rungs of a ladder only if they are the right distance apart. A promoter with the boxes too close together or too far apart binds the polymerase poorly even if both sequences are otherwise perfect.
Sigma: the part that does the reading
Here is a subtlety that catches many people: the core RNA polymerase, the part that builds RNA, cannot actually find a promoter on its own. Left to itself the core enzyme sticks to DNA almost anywhere, with no idea where genes start. The promoter-reading is done by a separate, detachable protein called the [[sigma-factor|sigma factor]] (written with the Greek letter sigma). Snap a sigma factor onto the core enzyme and you get the complete, search-capable machine — the [[bacterial-promoter-and-sigma-factor|holoenzyme]]. The core writes; sigma reads the address.
Sigma physically recognizes the -10 and -35 boxes. Parts of it reach into the DNA's major groove — the wider of the two spiral channels you met in the double-helix guide, where the edges of the base pairs are readable from outside without prying the strands apart — and make contacts that 'feel' the right sequence, much like a key feeling the shape of a lock. Crucially, sigma is detachable for a reason: a single core enzyme can pair with *different* sigma factors, and each sigma reads a different flavour of promoter. E. coli's everyday sigma (called sigma-70) handles most housekeeping genes, but when the cell is heat-shocked or starving it deploys alternative sigmas that recognize different boxes, switching on whole emergency programs of genes at once. Swapping the address-reader is itself a way to control which genes get transcribed.
Strong, weak, and the loudness of a gene
Now the payoff, and it is the deepest idea in this guide. Promoters are not simply 'present or absent.' How *closely* a promoter matches the consensus sequence sets how readily the holoenzyme grabs it — and therefore how often that gene gets transcribed. A promoter whose boxes are near-perfect copies of TATAAT and TTGACA, the right 17 bp apart, is a strong promoter: the polymerase binds it eagerly and fires again and again, churning out many RNA copies. A promoter whose boxes are sloppier matches is a weak promoter: the polymerase binds it rarely, so the gene is transcribed only now and then. The sequence itself is a volume knob.
This is why the very *sequence* of a promoter is a layer of built-in regulation, set before any regulatory protein ever shows up. A cell wants buckets of ribosomal RNA all the time, so the genes for it sit behind blazingly strong promoters. It wants only a trickle of certain regulatory proteins, so those hide behind deliberately weak ones. And the knob is not fixed at one setting: regulatory proteins you will meet soon — activators that help the polymerase bind, repressors that block it — work largely by tweaking how well the polymerase engages this same promoter. This is the concrete reason that the start of transcription is the cell's main control point: change how easily a gene's start gets read, and you change how much of that gene's product the cell makes.
Engineers borrow this knob shamelessly. When a lab wants a bacterium to crank out gobs of a useful protein — insulin, say — they place the gene behind a famously strong promoter, and to make it switchable they often add an operator the cell can block, so the gene stays off until they flip it on. That whole trick, which you will see in detail in the gene-regulation rung, only works because promoter strength is a real, tunable, sequence-encoded quantity.
Opening the helix: the bubble and the hybrid
Recognizing the promoter is only the first move. Finding the address does not yet copy anything — to read a base you must expose it, and the bases are hidden on the inside of the double helix, paired up and stacked like the rungs buried in the middle of a twisted ladder. So once the holoenzyme is locked onto the promoter, it pries the two strands apart over a short stretch — about a dozen base pairs — turning closed double-stranded DNA into a little open pocket of unpaired single strands. That melted pocket is the [[molbio-transcription-bubble|transcription bubble]].
Two honest details about the bubble. First, the polymerase opens it by itself — unlike DNA replication, transcription needs no separate helicase to unzip the strands; the enzyme is its own unwinder. Second, the bubble does not stay parked. Once copying gets going the whole bubble travels with the enzyme down the gene, melting fresh DNA at its front edge and letting the strands snap closed again behind, so only a short window is ever open at one time. Picture a small moving zone of unzipped fabric sliding along a long closed zipper — opening just ahead, re-closing just behind.
Inside the bubble, something neat happens. As the polymerase reads the template strand and lays down RNA, the newest few RNA letters stay base-paired to the template they were just copied from. For a stretch of roughly 8 or 9 base pairs, you have one strand of DNA paired with a strand of RNA — a short RNA-DNA hybrid. It is held by the same A-U and G-C base-pairing logic as ordinary DNA, just with RNA's uracil standing in for thymine. This hybrid is what keeps the fresh RNA correctly registered against its template while the bond is still being formed. A little further back the RNA peels off the template, threads out of the enzyme, and the two DNA strands re-pair behind the bubble — and the single-stranded RNA goes on its way.
Putting it together: from address to first letter
Let us walk the whole opening sequence once, in the order it happens at a bacterial gene. Each step sets up the next, and together they are exactly what transcription initiation means at the molecular level.
- The core enzyme picks up a sigma factor, forming the holoenzyme — the complete machine that can recognize promoters.
- The holoenzyme slides and hops along the DNA until sigma recognizes a -35 box and a -10 box the right distance apart, and binds — this loose docking on closed double-stranded DNA is the 'closed complex.'
- The enzyme melts open about a dozen base pairs around the start site, exposing the template strand — the 'open complex,' which is the transcription bubble.
- Reading the exposed template, the polymerase joins the first few ribonucleotides into RNA, building 5'-to-3' and usually starting with a purine (A or G) at +1 — a short RNA-DNA hybrid forms inside the bubble.
- Once a real transcript is underway, sigma lets go and drifts off to find another core enzyme; the core, now committed, clears the promoter and switches into steady elongation down the gene.
One last honest nuance, because it is a classic stumbling point. Getting started is the slow, hard part — finding the promoter, melting the DNA, and escaping the promoter are the rate-limiting hurdles, and the polymerase often stutters here, making and dropping a few useless tiny RNAs before it succeeds. Once it is past that and elongating happily, it can add tens of nucleotides per second. That is exactly why initiation, not elongation, is where regulation concentrates: it is the bottleneck, and a bottleneck is the natural place to install a valve. With sigma gone and the core enzyme striding into the gene, the next guide picks up the story — how the elongating polymerase reads on, and how it eventually knows to stop.