Cleaning & Preparing Data

Raw data is never clean

In the last two guides you saw what a dataset is, met features and labels, and split your data into train, validation, and test. But that split assumed a tidy table of rows and columns. Real data almost never arrives that way. A survey has blank answers; a sensor logs a temperature of -999 when it disconnects; one column says "Yes/No" and another says "yes", "y", and "YES". Before a model can learn anything, someone has to make the table honest. That work is data cleaning, and on most real projects it eats far more time than the modeling itself.

It helps to know what "clean" even means. A clean table has one meaning per column, one row per thing you are studying, consistent units, and values that are what they claim to be. Cleaning is the unglamorous craft of getting there: spotting impossible values, unifying spellings, fixing units, and deciding what to do about the inevitable gaps. None of it is fancy math. All of it decides whether your later models are built on rock or on sand.

Missing values: do not just delete them

The most common defect is a hole — a value that simply is not there. The lazy fix is to drop every row with a gap, but that can throw away most of your data, and worse, it can be biased: if low-income people skip the salary question more often, deleting those rows quietly tilts your dataset toward the wealthy. The thoughtful alternative is imputation — filling the gap with a sensible guess. The simplest guesses are the column's mean or median for numbers, or its most common value for categories.

But why is a value missing? Sometimes it is random noise (a form page glitched). Sometimes the absence is itself a signal — a blank "date of last purchase" might mean the customer never bought anything. In that case the honest move is not to invent a number but to add a flag column "was-this-missing?" so the model can use the fact of absence. Median is usually safer than mean for numbers, because a few extreme values can drag a mean far from typical, and those extremes lead us straight to the next problem.

Outliers: errors or important rare cases?

An outlier is a value far outside the usual range — a person listed as 200 years old, a house priced at one dollar, a transaction of ten million. The crucial first question is *why*. A 200-year-old is almost surely a data-entry error and should be corrected or removed. But a genuinely huge transaction might be the very fraud your model is supposed to catch. Deleting it because it "looks weird" would erase the signal you most need.

So treat outlier handling as a decision, not a reflex. If it is clearly impossible, fix or drop it. If it is real but extreme, you have gentler options: cap it at a reasonable bound ("clip" everything above the 99th percentile down to that percentile), or transform the column (taking a logarithm pulls a long tail of big values back toward the pack). The goal is never to make the data prettier — it is to stop a handful of points from dominating everything the model learns.

Putting numbers on the same scale

Imagine two features: age (roughly 0–100) and annual income (roughly 0–1,000,000). To many algorithms, income looks a thousand times "bigger" simply because its numbers are larger, so it drowns out age — even if age matters more. The cure is feature scaling: rewriting columns so their magnitudes are comparable. This is not optional for distance-based methods (k-nearest neighbors), gradient-based training, or anything with regularization; without it they are quietly skewed by your choice of units.

Two workhorse recipes live under normalization and standardization. Normalization (min-max scaling) squeezes a column into a fixed range, usually 0 to 1: subtract the minimum, divide by the range. Standardization rescales a column to have mean 0 and standard deviation 1: subtract the mean, divide by the standard deviation. Normalization keeps things in tidy bounds but is sensitive to outliers (one giant value stretches the whole range); standardization handles spread more gracefully and is the more common default. Neither changes the *shape* of the data — only its scale.

# Standardize, the leak-free way
mean, std = compute_on(train)      # fit on TRAIN only
train = (train - mean) / std
val   = (val   - mean) / std       # reuse train's numbers
test  = (test  - mean) / std       # never refit here

Fit the scaler on training data, then apply the same mean and std everywhere.

Turning categories into numbers

Scaling assumes numbers, but plenty of columns are words: city, blood type, product category. Knowing the difference between categorical and numerical features is half the battle. Numerical features have meaningful order and arithmetic — 30 degrees is genuinely hotter than 20, and the gap is 10. Categorical features are labels: "red", "blue", "green" have no inherent order, and no, blue is not "two times" red. Some categories are ordered (small < medium < large) — those are ordinal, and you can map them to ranks. Most are not.

The classic trap is to label "red=1, blue=2, green=3" and feed that in. The model then believes green > blue > red and that blue sits exactly halfway — a fiction you invented by accident. The honest fix is one-hot encoding: replace one category column with several yes/no columns, one per value. A row that is "blue" becomes is-red=0, is-blue=1, is-green=0. No fake ordering, no fake distances — just an unambiguous flag for each possibility.

The golden rule: fit on train, apply everywhere

Every step here — the imputation value, the scaling mean and std, the outlier cap, even which categories exist — is a parameter learned from data. And there is one iron rule that ties this guide to your splits: learn those parameters from the training set only, then apply the frozen result to validation and test. Peek at the test set's statistics and you have let the answer key bleed into preparation; your scores will look great and then collapse in the real world. This is the single most common rookie mistake in the whole pipeline.

Understand the holes first: figure out *why* each value is missing before you fill anything.
Inspect outliers one type at a time — fix impossible ones, keep meaningful rare ones, gently cap or transform the rest.
One-hot encode categorical columns; scale or standardize numerical ones.
Fit every transformation on training data, then apply those exact same numbers to validation and test.

Cleaning is not glamorous and it is rarely "finished" — you will revisit it as you learn more about your problem. But it is where good models are quietly won or lost. With a clean, scaled, honestly encoded table in hand, you are ready for the next step: shaping the columns you already have into features that make the pattern *easier* for the model to see.