Privacy & Data Rights

Why data-hungry models strain privacy

By now you know that a machine-learning system improves with more data — that's the whole shape of the field you've been climbing. [[privacy|Privacy]] is the principle that you should keep some control over what is known about you and how it's used, and modern AI strains it in a specific way: because models get better with more examples, there is constant pressure to collect everything, keep it forever, and reuse it for purposes nobody originally agreed to. Privacy isn't about having something to hide; it's the same instinct as closing a door or sealing an envelope.

Two failure modes are worth naming precisely. First, re-identification: data that looks anonymous often isn't. Your birthday, postal code, and sex together pin down most individuals, so stripping names is not anonymization — unique combinations of ordinary facts still single you out. Second, memorization: a trained model can absorb and later regurgitate specific examples from its training set. A large language model coaxed the right way can sometimes emit a real person's phone number or a verbatim passage it saw once. These aren't hypothetical; they're documented behaviors of the systems you've been studying.

Differential privacy: a guarantee, not a hope

[[differential-privacy|Differential privacy]] is one of the few privacy ideas with a real mathematical proof behind it. The core promise is precise: whatever the system publishes should look essentially the same whether or not your data was included. So an analyst looking at the output cannot tell if you were in the dataset at all — and can therefore learn almost nothing specific about you, even if they already know everything about everyone else. It turns "trust us" into a property you can actually prove.

The trick is carefully calibrated random noise. Before releasing an answer — say, "how many people in this town have a certain disease" — the system adds a small, mathematically tuned dollop of randomness. Across a large crowd the noise mostly washes out and the aggregate stays useful; but for any one person it provides deniability, like adding faint static so no single voice can be picked out of a choir. A knob called the privacy budget (written as epsilon, ε) sets the trade-off: less noise means more accurate answers but weaker privacy, and the reverse.

RAW count:    diabetic = 412
ADD NOISE  +  random draw  ~  (-3, +5, -1, ...)
PUBLISHED:    diabetic ~= 410   useful for the town,
                                useless for outing any one person

epsilon small  ->  more noise  ->  stronger privacy, less accuracy
epsilon large  ->  less noise  ->  weaker privacy,  more accuracy

The aggregate survives the noise; the individual hides inside it. Epsilon tunes the dial.

Federated learning: keep the data home

[[federated-learning|Federated learning]] flips the usual training recipe. Instead of hauling everyone's raw data into one central pile and learning there, you send the model out to where the data already lives — your phone, a hospital's own servers — let it learn locally, and bring back only the lessons (small mathematical updates to the model), never the raw data. Picture a chef who wants to learn from a hundred home kitchens but never enters them: each family cooks in their own kitchen and mails back only their tweaks to the recipe.

A central server sends the current shared model out to thousands of devices.
Each device improves the model a little using its own private examples — your photos, messages, typing.
Devices send back only their parameter updates, never the raw data.
The server averages the updates into a better shared model, and the loop repeats.

This is why a phone keyboard can learn your slang overnight while charging, uploading only a tiny update and never your messages, and why hospitals that legally cannot share patient files can still jointly train a diagnostic model. But here is the honest catch many pitches gloss over: federated does not automatically mean private. The updates themselves can leak information about the data that produced them — researchers have reconstructed training examples from updates alone. So real systems combine federation with differential privacy or encrypted aggregation. Keeping data on the device is a strong start, not a finished guarantee.

Consent, governance, and the limits of "I Agree"

Techniques only get you so far; the harder questions are human. [[data-governance-consent|Data governance]] is the set of rules and accountable roles that decide what an organization collects, where it lives, who may touch it, how long it's kept, and when it's deleted. Consent is one pillar: people should knowingly and freely agree to how their data is used — not have it taken because they clicked "Accept" on a wall of fine print they never read. Good consent is informed (you understand it), specific (tied to a stated purpose), and revocable (you can change your mind).

The honest difficulty is that meaningful consent is hard at scale — nobody reads the terms, and "take it or leave it" isn't a free choice when the service is essential. So good governance is less about a perfect consent form and more about collecting less in the first place, being truthful about purpose, and accepting accountability when things go wrong. Many data-protection laws now encode this: the EU's GDPR demands purpose limitation, data minimization, and a real right to deletion, and the EU AI Act adds obligations on top for higher-risk systems.

Copyright: who owns what a model learned from

Today's biggest models learn by ingesting staggering amounts of text, images, code, and music — much of it scraped from the open web, much of that protected by copyright. [[copyright-training-data|Copyright and training data]] is the tangle this raises, and it helps to split it into separate questions with different answers. (1) The input question: was copying the works to train the model an infringement, or permitted as fair use / text-and-data-mining? (2) The memorization question: does the model sometimes emit near-verbatim copies of specific works — which is a clearer problem? (3) The output question: who owns what the model generates, and can an image closely imitating a living artist infringe? (4) The authorship question: can purely AI-made work be copyrighted at all?

Two of these get conflated constantly, and the distinction matters: *training on* copyrighted data is legally murky and hotly contested, while a model *reproducing* a near-exact copy of a specific protected work is a much clearer infringement risk. The honest state of things in the mid-2020s is unsettled — no global consensus, courts issuing their first and often conflicting rulings, reasonable people disagreeing. Be wary of confident claims in either direction: "it's obviously theft" and "it's obviously fair use" both overstate. And "the model learns like a human would" is a rhetorical analogy, not an established legal principle.

Putting it together

Privacy and data rights are where the technical and the human meet head-on. The risks are concrete — re-identification and memorization, not science fiction. The defenses are real but partial: differential privacy gives a provable guarantee at a cost in accuracy, and federated learning keeps raw data home but needs help to actually be private. None of this is solved by a single clever algorithm; it takes deliberate governance choices about what to collect and why, and honesty about a legal landscape that genuinely hasn't settled.

Carry one habit forward as you keep climbing this rung: when someone promises a system is "private" or "anonymized," ask *how*, and ask what it costs. Privacy is rarely all-or-nothing. It's a set of deliberate choices — what to collect, how long to keep it, who can see it, and what they're allowed to do with it — and the most trustworthy systems are the ones honest about exactly where their protections stop.