Data Is the Real Fuel

The thing the model actually learns from

By now you know a model is a function with tunable parameters, and that supervised learning adjusts those parameters by looking at examples. But where do the examples come from, and what are they made of? That is what this rung is about. A [[dataset|dataset]] is simply a collection of examples — rows in a spreadsheet, photos in a folder, customer records in a database. The model never sees the real world; it only ever sees the dataset. So the dataset *is* the world, as far as the model is concerned.

Each individual entry is an example (also called a sample, instance, or row). One example might be a single email, one house listing, or one photo of a cat. A dataset is just many of these stacked together. The interesting question is always the same: what does each example *say*, and what do we *want to predict* from it?

Features and labels: the input and the answer

Inside each example, we split the information into two roles. The [[feature|features]] are the inputs — the measurable, describable facts the model is allowed to look at. For a house, features might be square footage, number of bedrooms, and neighborhood. The [[label|label]] is the answer we want the model to produce — for the house, perhaps its sale price. Features go in; the prediction tries to match the label that comes out.

# One labeled example: features -> label
features = {
  "sqft": 1200,
  "bedrooms": 2,
  "neighborhood": "riverside"
}
label = 450000   # the sale price we want to predict

# A dataset is just many rows like this
dataset = [(features_1, label_1), (features_2, label_2), ...]

Features are what the model reads; the label is what it tries to output. A dataset stacks thousands of these pairs.

Not every dataset has labels. In supervised learning every example carries a label, and getting those labels often means paying humans to do [[data-annotation|annotation]] — clicking through thousands of images to mark which ones contain a tumor, or rating reviews as positive or negative. That labelling is slow and expensive, which is exactly why unlabelled data is so much more abundant. Choosing good features, by contrast, is the craft of feature engineering, a later guide in this rung.

Structured vs unstructured data

Data comes in two broad shapes. Structured data already lives in neat tables: columns with clear meanings, like age, price, or country. A bank's transaction log or a hospital's patient records are structured — each column is a ready-made feature. Unstructured data is everything else: raw text, images, audio, video. A photo is just a grid of pixels; a tweet is just a run of characters. There are no labelled columns telling the model 'this part is the sky' or 'this word is the verb'.

This distinction explains a lot of AI history. Classic methods like decision trees thrive on tidy structured tables. The reason deep learning caused such a stir is that it learned to extract useful features directly from unstructured pixels and text — work that humans used to do by hand. But 'unstructured' never means 'no work needed'; it means the structure is hidden, and someone has to coax it out.

Garbage in, garbage out

Here is the most honest law in the field: a model can only be as good as the data it learns from. If half your house prices were typed in wrong, or your cat photos are secretly all from one breed, no amount of clever architecture will save you. The model faithfully learns whatever patterns are in the data — including the mistakes. This is garbage in, garbage out, and it is not a cliché; it is the daily reality of real AI work.

That is why so much real-world effort goes into data cleaning: fixing typos, removing duplicates, deciding what to do with missing values, catching the row where someone's age is listed as 999. It is unglamorous, it rarely makes the demo video, and it is routinely 80% of the job. The later guides in this rung dig into the subtle traps — leakage, imbalance, bias — that ruin models even when the data *looks* clean.

Data-centric AI: stop tweaking the model

For years the instinct when a model underperformed was to reach for a fancier architecture. [[data-centric-ai|Data-centric AI]] is the now-mainstream counter-idea: hold the model fixed and improve the data instead. In many real projects, relabelling the 200 examples your model gets confused on, or fixing inconsistent annotation guidelines, improves results far more than swapping in a bigger network. The leverage often lives in the data, not the model.

Look at your data by hand. Actually read examples and inspect images — most bugs are visible to the naked eye before any training.
Find where the model fails, then inspect those specific examples — are the labels wrong, or genuinely ambiguous?
Fix the labels and the guidelines, not just the offending rows; consistent labels matter more than perfect ones.
Retrain on the improved data and compare — let the change in the data, not a hunch, decide if you helped.

So as you climb the rest of this rung, keep one frame in mind: the model is the engine everyone admires, but the data is the fuel. A magnificent engine running on dirty fuel sputters and stalls — and no horsepower spec on the brochure will fix that. The next guides show you, step by step, how to refine that fuel.