Word2Vec & the Geometry of Meaning

From symbols to coordinates

In the previous guide you learned to chop text into tokens. But a token like "dog" is still just a symbol — a label with no inner structure. The oldest fix was one-hot encoding: give every word its own slot in a giant vector, a single 1 surrounded by zeros. With a 50,000-word vocabulary, "dog" and "cat" become 50,000-dimensional vectors that are perfectly, uselessly orthogonal. The machine has no way to know they are both animals; every pair of distinct words is equally, infinitely far apart.

The leap of word embeddings is to replace that sparse, meaningless slot with a short, dense vector of real numbers — say 300 of them — learned from data. Each word becomes a point in a 300-dimensional space, and the crucial promise is that distance reflects meaning: "dog" and "cat" sit near each other, "dog" and "democracy" sit far apart. An embedding is exactly this: a learned map from discrete symbols into a continuous space where geometry carries information.

This also tackles a quieter problem from earlier rungs: the curse of dimensionality. One-hot vectors grow with the vocabulary and stay sparse forever; a dense 300-dimensional embedding is a compact summary that the rest of a model can actually compute with. Fewer dimensions, more meaning per dimension.

"You shall know a word by the company it keeps"

Where do these coordinates come from? Nobody hand-labels them. The answer rests on the distributional hypothesis, the linguist J. R. Firth's 1957 slogan: "You shall know a word by the company it keeps." Words that appear in similar contexts tend to mean similar things. You have never been told the definition of "wug," but read "I fed the hungry wug" and "the wug curled up by the fire" and you already suspect it is a pet. Meaning, the hypothesis claims, leaks out of context.

Earlier NLP already half-used this idea. Bag-of-words and n-gram counts, sharpened by TF-IDF, notice which words co-occur, but they treat each word as a separate column and never compress that co-occurrence into a shared geometry. Embeddings take the distributional hypothesis literally and turn it into a learning objective: arrange the vectors so that a word's position predicts its neighbors.

How word2vec actually learns

In 2013, Mikolov and colleagues at Google released word2vec, and it made embeddings cheap and shockingly good. The most popular variant, skip-gram, plays a simple guessing game. Slide a window over the text; at each position, take the center word and try to predict the words around it. "The cat sat on the mat" — given "sat," the model should make "cat," "on," "the" likely. Nobody supplies labels; the text labels itself. This is self-supervised learning — supervision conjured from raw data.

Mechanically it is the shallowest possible neural network: each word has an input vector and an output vector, and the model scores a context word by the dot product of the two. A high dot product means "these two belong together." The scores run through a softmax to become probabilities, and training nudges the vectors — by ordinary gradient descent — so real neighbors score high and random words score low. After millions of windows, words that keep similar company drift into the same region of space. The embedding is not the output; it is the by-product, the rows of the input matrix.

One practical trick made it scale: instead of computing softmax over the whole 50,000-word vocabulary every step (expensive), word2vec uses negative sampling — push up the score for the true neighbor, push down the scores for a handful of random "negative" words. A couple of years later GloVe (Stanford, 2014) reached similar vectors from the other direction: rather than scanning local windows, it factorizes a global word-co-occurrence count matrix. Two routes, one destination — both are flavors of representation learning.

# skip-gram, in one breath
for (center, context) in slide_window(corpus):
    # raise the true neighbor, lower a few random words
    score      = dot(vec_in[center], vec_out[context])
    neg_scores = [dot(vec_in[center], vec_out[w]) for w in sample_negatives()]
    loss       = -log_sigmoid(score) - sum(log_sigmoid(-s) for s in neg_scores)
    update(loss)            # tiny gradient-descent step
# the embedding you keep = the rows of vec_in

Skip-gram with negative sampling, stripped to its core: predict the company a word keeps.

Vector arithmetic: meaning you can add and subtract

Here is the result that made headlines. Take the vectors for "king," "man," "woman," and compute king − man + woman. The nearest word to that resulting point is "queen." The same trick gives Paris − France + Italy ≈ Rome, and walked − walk + swim ≈ swam. It feels like the model *understands* gender, capitals, and verb tense. What is really going on?

The honest explanation is geometric, not magical. Because training tied each word to its contexts, a single consistent *difference* — the "male → female" shift, or the "country → capital" shift — shows up as roughly the same direction and length across many pairs. The arrow from "man" to "woman" is nearly parallel to the arrow from "king" to "queen." So subtracting "man" and adding "woman" slides you along that gender direction; landing near "queen" is the payoff. Relationships became directions in space.

To find the closest word you don't use straight-line distance but cosine similarity — the angle between vectors, which ignores length and asks only "do these point the same way?" Two words pointing in nearly the same direction are judged similar even if one vector is longer. This angle-based search is the engine behind "find me related words," and it is the same idea that later powers a vector search over whole documents.

Where the geometry breaks down

Word2vec and GloVe have one hard limit you must internalize: every word gets exactly one vector, forever. So "bank" — the riverbank and the money bank — is forced into a single muddy point, a blurry average of all its senses. There is no way for context to sharpen it. The very thing that lets the vectors exist (learning from all contexts at once) is the thing that prevents them from disambiguating any single use.

These are static embeddings: "bank" has the same coordinates in "river bank" and "central bank." The fix — giving each occurrence a vector shaped by its actual sentence — is exactly what the next guides build toward through language modeling and, eventually, the attention-based models where a word's representation is recomputed from its neighbors every time. Word2vec is the doorway; contextual embeddings are the room beyond it.

There is also an ethical edge worth facing squarely. Because embeddings absorb the statistics of human text, they absorb its biases too: word2vec famously placed "man → computer programmer" alongside "woman → homemaker." The geometry that captures "king → queen" captures stereotypes with the same machinery — it has no notion of which directions are harmless and which are harmful. Useful representations and unwanted bias are learned by the very same process, which is why every modern large language model still inherits this problem.

Why this idea outlived the algorithm

You rarely train word2vec from scratch today, and that is the point. Its lasting gift was not the specific algorithm but the conviction that meaning can be a vector — that the right move is to learn a dense space where geometry does the work. That conviction now runs through everything downstream: every token entering a Transformer first becomes an embedding, and "embeddings" of sentences, images, and users power search, recommendation, and retrieval across the industry.

Tokens are still just symbols — meaningless on their own and crippled by one-hot sparsity.
The distributional hypothesis says context reveals meaning: similar company implies similar sense.
Word2vec / GloVe turn that into a self-supervised game, yielding a dense vector per word.
Distances and directions then encode similarity and relationships — vector arithmetic and cosine search.
But one static vector per word can't handle ambiguity — which is the cliffhanger the next guides resolve.