Four words for "a bunch of numbers"
If the word "linear algebra" makes your shoulders tense, relax them. Almost everything in this guide is just organizing numbers into boxes and agreeing on rules for moving them around. The math behind machine learning is mostly that, repeated very fast on very large boxes. Let us name the boxes first; the four names of scalar, vector, matrix and tensor are nothing more than "how many directions of numbers do I have?".
A scalar is a single number, plain and alone: a temperature of 21, a price of 4.99, a learning rate. A vector is a list of numbers in a fixed order, like `[21, 4.99, 0]` — a single point with several coordinates. A matrix is a grid of numbers, rows and columns, like a spreadsheet. A tensor is just the umbrella word for "an array of numbers with any number of dimensions" — a scalar is a 0-D tensor, a vector a 1-D tensor, a matrix a 2-D tensor, and you can keep stacking from there.
Here is a concrete ladder. One grayscale pixel is a scalar. One row of pixels is a vector. A whole grayscale image is a matrix. A color image adds red, green and blue channels — now it is a 3-D tensor. And a batch of 64 color images you feed to a model at once is a 4-D tensor. Nothing mysterious happened; you just kept adding "another direction to count along".
A vector is an arrow, and an address
It helps to hold two pictures of a vector at once. Picture one: a vector is an arrow from the origin to a point in space — `[3, 2]` is "go 3 right, 2 up". Picture two: a vector is an address, a list of coordinates that pins down exactly one location. They are the same thing seen two ways, and which picture is handier depends on the moment. The arrow makes direction and length feel real; the address makes it something a computer can store.
Two everyday operations come almost for free. Adding two vectors means adding them slot by slot: `[3, 2] + [1, 4] = [4, 6]` — lay the second arrow's tail at the first arrow's head and see where you end up. Scaling by a number stretches or shrinks the whole arrow: `2 × [3, 2] = [6, 4]`, same direction, twice as long. The length of an arrow has its own name, the vector norm, and it answers "how big is this thing?" — a question you will meet again whenever you measure how far off a prediction is.
Here is the leap that makes vectors matter for machine learning: there is no rule that says you must stop at 2 or 3 directions. A vector can live in 10 dimensions, or 768, or 4096. You cannot picture a 768-dimensional arrow, and that is fine — nobody can. You keep the *rules* (add slot by slot, scale every slot, measure a length) and quietly drop the *mental image*. That single move is what lets the same humble arithmetic describe a point, a sentence, or a face.
The dot product: how much do two things agree?
The single most important operation between two vectors is the dot product. The recipe is almost insultingly simple: multiply the matching slots and add everything up. For `[1, 2, 3]` and `[4, 0, 5]`, that is `1×4 + 2×0 + 3×5 = 4 + 0 + 15 = 19`. Two vectors go in; a single scalar comes out. That collapse — many numbers down to one — is exactly why it appears everywhere.
But what does that one number *mean*? Think of it as a score of agreement. When two arrows point the same way, their dot product is large and positive. When they are at right angles — utterly unrelated — it is zero. When they point opposite ways, it goes negative. So the dot product quietly measures "how much do these two vectors pull in the same direction, weighted by how long they are?" That is why it shows up as a similarity score between an embedding of your search query and an embedding of every document, or between two word meanings.
This is also the heartbeat of a single artificial neuron. A neuron takes its inputs as one vector and its weights as another, takes their dot product, and that one number is the neuron's raw response before it decides whether to "fire". Every time you hear that a network does "billions of multiply-adds", it is doing this — dot products, by the truckload. Master this one operation and a surprising amount of the rest stops being scary.
def dot(a, b):
total = 0
for i in range(len(a)): # walk both lists together
total += a[i] * b[i] # multiply matching slots, add up
return total
dot([1, 2, 3], [4, 0, 5]) # -> 19 (one number out)Matrix multiplication is just many dot products
If the dot product is a single handshake between two vectors, matrix multiplication is a whole room of handshakes done in one swoop. To multiply two matrices, you take every row of the first and dot it with every column of the second. Each of those dot products becomes one entry in the result. That is the entire rule — there is no extra magic hiding underneath.
There is one rule of etiquette that trips up every beginner: the inner sizes must match. A matrix that is 3-by-2 can only be multiplied by a matrix that has 2 rows, because each row of the first (length 2) must line up slot-for-slot with each column of the second (length 2). The result then takes the *outer* sizes: 3-by-2 times 2-by-4 gives a 3-by-4. If those inner numbers disagree, the multiplication is simply undefined — and a mismatched-shape error is the single most common bug you will ever hit in this field.
The deeper way to read matrix multiplication is as a transformation: a matrix is a machine that takes a vector in and hands a (usually different) vector out. Feed it a point and it rotates, stretches, squashes or projects that point into a new place. A full layer of a neural network does exactly this — it multiplies your input vector by a matrix of weights to recombine the inputs into a new set of features, where every output is its own weighted blend of all the inputs. Matrix multiplication is, at heart, the operation of *mixing things together in a chosen proportion*.
Why everything becomes a vector
A model cannot do anything with the word "cat" or a photo of a sunset directly. It only knows how to multiply and add numbers. So the first job, always, is to turn the messy real thing into a vector of numbers — a list the model can compute on. You already met the idea of a feature in an earlier rung: each slot of the vector is one measurable quality of the thing. A house might become `[area, bedrooms, age, distance_to_school]`; that single vector is the house, as far as the math is concerned.
The beautiful payoff is geometric. Once everything is a vector, "similar things" become "points that sit close together", and "this is unlike that" becomes "these arrows point in different directions" — which, you now know, is just a dot product. A learned embedding does precisely this for words and images: it places them in a high-dimensional space so that meaning turns into distance and direction. Searching, recommending, clustering and comparing all reduce to measuring between vectors.
Putting the toolbox together
Let us trace one tiny pass through a model to see the pieces click together. It is the same shape, scaled up, whether you are running linear regression or a giant network.
- Encode the input as a vector — each slot is one feature of the example (the house, the pixel patch, the word).
- Multiply that vector by a weight matrix — one dot product per output, recombining the inputs into new features.
- Read off the result vector as the model's answer — a score, a prediction, or the input to the next layer.
- Stack many such steps and you have a deep network; the numbers in the matrices are what training learns.
That is genuinely most of the linear algebra you need to read the rest of this ladder. Data becomes vectors; vectors get arranged into matrices and tensors; the dot product measures agreement; matrix multiplication recombines and transforms; and a model is a tall stack of these multiplications whose numbers are tuned by training. You do not need proofs or clever tricks to follow along — you need these few moves, held loosely and used often.