Turning Words into Numbers

A model only speaks numbers

You already know from the earlier rungs that a neural network is a stack of matrix multiplies — it eats vectors of numbers and emits vectors of numbers. It has no slot for the letter *q* or the word *cat*. So the very first thing any text system must do is convert human writing into numbers the math can chew on. That conversion is the unglamorous front door of all natural language processing, and getting it right shapes everything downstream.

The trick is that text is a *sequence of discrete symbols*, not a continuous quantity. A temperature of 21.5 degrees is already a number; the word *however* is not. So we play a two-stage game: first we cut the text into units (this is tokenization), then we map each unit to an integer ID by looking it up in a fixed list. The integers themselves carry no meaning yet — they are just addresses. Meaning gets attached later, when each ID indexes into an embedding table the network learns during training.

Where do you cut? Words, characters, and the trouble with both

The obvious idea is to split on spaces and treat each *word* as a token. Early NLP did exactly this, and it pairs naturally with the count-based methods you met before — bag-of-words and TF-IDF both assume a clean list of words. But word-level tokenization has two stubborn problems. First, the vocabulary explodes: English alone has millions of word forms once you count plurals, tenses, typos, and names. Second, you will *always* meet words at test time that were never in your list — the dreaded out-of-vocabulary problem — and a pure word model has no choice but to throw them away as a single "unknown" symbol.

The opposite extreme is to use *characters* as tokens. Now your vocabulary is tiny (a few hundred symbols) and nothing is ever out-of-vocabulary — you can spell anything. The price is that sequences become very long, and the model has to relearn from scratch that the letters *c-a-t* tend to travel together. Worse for many writing systems: in Chinese, Japanese, or Thai there are no spaces to split on at all, so "just use words" was never even an option. The field needed a middle path.

Notice this is really a tradeoff in disguise. Word tokens are information-dense but brittle and huge; character tokens are robust and compact but force long, low-level sequences. Pre-processing such as removing stop-words (the *the*, *of*, *and* that count-based methods loved to drop) made sense for word counting, but it actively hurts modern models that need every word to model fluent language. The winning answer keeps frequent words whole and splits rare ones into pieces.

Subwords: byte-pair encoding splits the difference

The dominant compromise is subword tokenization, and its most famous recipe is byte-pair encoding (BPE). The idea is delightfully simple, borrowed from a 1990s data-compression trick. Start with every text broken into individual characters. Then repeatedly find the most frequent adjacent pair of symbols and merge it into a new single symbol. Do this thousands of times, and common chunks like *th*, *ing*, *tion*, and whole frequent words like *the* naturally crystallize into their own tokens, while a rare word gets left as a handful of smaller pieces.

start: l o w e r _   n e w e s t _
# most frequent pair is (e, s) -> merge
step1: l o w e r _   n e w es t _
# next most frequent pair is (es, t) -> merge
step2: l o w e r _   n e w est _
# ... after many merges:
final: low er    new est
# 'lower' -> [low, er]   'newest' -> [new, est]

BPE in miniature: merge the most frequent adjacent pair, repeat. Rare words end up as reusable pieces instead of a single "unknown".

This single idea solves both earlier problems at once. There is essentially no out-of-vocabulary anymore: in the worst case a strange word falls back to its characters or raw bytes, so *any* string can be encoded. And the vocabulary stays a fixed, manageable size — typically 30,000 to 100,000 tokens — that you choose up front by deciding how many merges to perform. Relatives of BPE (WordPiece, used by the BERT family, and Unigram/SentencePiece) differ in *how* they pick the pieces, but they all share the same spirit: a learned middle ground between letters and words.

The vocabulary and the full pipeline to model inputs

The output of training BPE is a vocabulary: an ordered list that assigns each token a fixed integer ID, plus the list of merge rules. This vocabulary is frozen once and shipped with the model — encoding and decoding are deterministic lookups, not learning. A few special tokens get reserved slots too, for example a padding token to fill out short sequences and markers for the start or end of a passage. Decoding is just the reverse: take the model's output IDs, look up their pieces, and glue them back into text.

Normalize: lowercase if the model expects it, fix unicode quirks, sometimes strip accents — light, model-specific cleanup.
Tokenize: apply the learned merge rules to split the text into subword tokens.
Map to IDs: look each token up in the vocabulary to get its integer ID.
Pad or truncate: make the sequence a fixed length so many examples stack into one batch.
Embed: each ID indexes a learned vector; now the network finally has real numbers to compute on.

That ID-to-vector step is worth pausing on, because it connects directly to the next guides in this rung. Mathematically, looking up row *i* of the embedding table is identical to multiplying a one-hot vector by that table — the integer ID is just a compact way to name a row. The vectors that come out are the famous word vectors; word2vec and GloVe were early standalone ways to learn them, and you will meet those next. Modern models simply learn the embedding table jointly with everything else.

Why this dull step quietly matters

It is tempting to treat tokenization as plumbing and move on, but its choices leak into everything. Because tokens are not words, a model can struggle with tasks that need character-level vision: counting the letters in a word, reversing a string, or doing arithmetic where digits split awkwardly. These are not deep reasoning failures — they are often tokenization artifacts. Likewise, languages that the tokenizer was not trained on get chopped into many more tokens, so they cost more, fit less in the context window, and can perform worse. That is a real, measurable fairness issue, not a rumor.

There is a healthy honesty to keep here. Subword tokenization is a clever engineering hack, not a theory of language — it has no idea what a morpheme is, and the pieces it finds (*est*, *tion*) only sometimes line up with real linguistic units. Researchers keep asking whether we could drop it entirely and feed raw bytes to the model; a few systems do, at the cost of longer sequences. For now, BPE-style tokenization remains the quiet, near-universal first step that turns your sentence into the row of integers a large language model actually consumes.