ROC, AUC & Thresholds

A score, not a verdict

By now you know how to read a confusion matrix and compute precision and recall. But there is a hidden assumption buried in all of those numbers: that the model already decided *yes* or *no*. In reality, almost every classifier — a logistic regression, a random forest, a neural net ending in a sigmoid — does not hand you a verdict. It hands you a score: a number, often between 0 and 1, expressing how strongly it leans toward the positive class.

To turn that score into a decision you need a threshold — a cutoff line. Score above the line, call it positive; below, call it negative. The usual default of 0.5 feels natural, but it is just a convention, not a law. Slide the threshold up and you become stricter (fewer positives); slide it down and you become more generous. Every metric you learned earlier — precision, recall, F1 — silently depends on where you put that line.

The ROC curve: every threshold at once

Instead of committing to one threshold, what if we tried *all* of them and plotted the result? That is exactly what the [[roc-curve|ROC curve]] does. For each possible cutoff, it measures two things: the true positive rate (recall — the fraction of real positives you caught) and the false positive rate (the fraction of real negatives you wrongly flagged). Plot false positive rate on the x-axis and true positive rate on the y-axis, and sweep the threshold from strict to lenient. The trail of points is the ROC curve.

Read the corners and the picture clears up. At a very strict threshold (top score required), you flag almost nothing: both rates near 0, the bottom-left corner. At a very lenient threshold, you flag everything: both rates near 1, the top-right corner. A useless model that guesses randomly traces the diagonal line from corner to corner. A good model *bows toward the top-left* — catching many true positives while keeping false positives low. The closer the curve hugs that upper-left corner, the better.

TPR
 1 |          ____------  good model (bows to top-left)
   |      _--/
   |    _/        ....  random guess (diagonal)
   |  _/      ....
   | /    ....
 0 |/ ...______________________
   0                          1   FPR

The ROC curve sweeps every threshold. The diagonal is a coin flip; the bow toward the top-left is skill.

AUC: squeezing the curve into one number

A whole curve is hard to put in a spreadsheet, so people summarize it with the [[auc|AUC]] — the *area under the ROC curve*. A perfect model fills the whole box: AUC = 1.0. Random guessing gives the diagonal, which cuts the box in half: AUC = 0.5. So AUC lives between 0.5 (worthless) and 1.0 (flawless), and bigger is better. An AUC below 0.5 means your model is worse than chance — usually a sign your labels are flipped.

AUC has a lovely, intuitive meaning: it is the probability that the model gives a *randomly chosen positive example* a higher score than a *randomly chosen negative example*. In other words, it measures ranking quality — how well the model sorts positives above negatives — and it does so *regardless of any threshold*. That is exactly why AUC is so popular for answering "is the model good?": it judges the scores themselves, not one arbitrary cutoff.

When ROC lies: imbalance and the PR curve

ROC has a blind spot, and it shows up exactly where real life gets hard: [[class-imbalance|class imbalance]]. Imagine fraud detection where 1 in 1,000 transactions is fraud. The false positive *rate* divides by the huge pile of negatives, so even thousands of false alarms barely move it. A model can post a gorgeous 0.95 AUC and still bury every real fraud under a mountain of false alarms — because the ROC curve never looks at how many of your *flagged* cases were actually right.

This is where the [[precision-recall-curve|precision-recall curve]] earns its keep. It plots precision (of everything I flagged, how much was right?) against recall (of all real positives, how many did I catch?), again sweeping every threshold. Because precision focuses on your positive predictions, it does *not* get diluted by the ocean of easy negatives. On a rare-positive problem, a mediocre model that looked fine on ROC will reveal a sad, sagging PR curve — an honest portrait of how often your alarms are wrong.

Choosing the threshold for your real problem

Curves and AUC tell you the model's *potential*. Deployment still needs one number: the threshold. And the right threshold is not a math question — it is a *cost* question. Ask which mistake hurts more. A spam filter that dumps a real job offer into junk (a false positive) is far worse than letting one spam through, so you raise the threshold to protect precision. A cancer screen that misses a real tumor (a false negative) is catastrophic, so you lower the threshold to protect recall, accepting more false alarms that a second test can rule out.

Write down the cost of each error type for your problem — in money, harm, or user trust. Be concrete; vague intuition leads to a vague threshold.
Pick the metric those costs imply: high-recall if misses are deadly, high-precision if false alarms are expensive, F1 or a weighted blend if both matter.
Sweep thresholds on the validation set, record that metric at each, and pick the threshold that maximizes it.
Lock that threshold, then report final numbers on the untouched test set so the choice is not quietly overfit.

Keep the layers straight and you will never be fooled by a single number again. AUC and the curves rank the model and survive any threshold; the threshold turns a score into an action and must reflect real-world costs; precision and recall at that chosen threshold are what your users actually live with. A model that scores 0.99 AUC can still be the wrong tool if its operating point quietly trades away the error you could least afford.