A score, not a verdict
By now you know how to read a confusion matrix and compute precision and recall. But there is a hidden assumption buried in all of those numbers: that the model already decided *yes* or *no*. In reality, almost every classifier — a logistic regression, a random forest, a neural net ending in a sigmoid — does not hand you a verdict. It hands you a score: a number, often between 0 and 1, expressing how strongly it leans toward the positive class.
To turn that score into a decision you need a threshold — a cutoff line. Score above the line, call it positive; below, call it negative. The usual default of 0.5 feels natural, but it is just a convention, not a law. Slide the threshold up and you become stricter (fewer positives); slide it down and you become more generous. Every metric you learned earlier — precision, recall, F1 — silently depends on where you put that line.
The ROC curve: every threshold at once
Instead of committing to one threshold, what if we tried *all* of them and plotted the result? That is exactly what the [[roc-curve|ROC curve]] does. For each possible cutoff, it measures two things: the true positive rate (recall — the fraction of real positives you caught) and the false positive rate (the fraction of real negatives you wrongly flagged). Plot false positive rate on the x-axis and true positive rate on the y-axis, and sweep the threshold from strict to lenient. The trail of points is the ROC curve.
Read the corners and the picture clears up. At a very strict threshold (top score required), you flag almost nothing: both rates near 0, the bottom-left corner. At a very lenient threshold, you flag everything: both rates near 1, the top-right corner. A useless model that guesses randomly traces the diagonal line from corner to corner. A good model *bows toward the top-left* — catching many true positives while keeping false positives low. The closer the curve hugs that upper-left corner, the better.
TPR 1 | ____------ good model (bows to top-left) | _--/ | _/ .... random guess (diagonal) | _/ .... | / .... 0 |/ ...______________________ 0 1 FPR
AUC: squeezing the curve into one number
A whole curve is hard to put in a spreadsheet, so people summarize it with the [[auc|AUC]] — the *area under the ROC curve*. A perfect model fills the whole box: AUC = 1.0. Random guessing gives the diagonal, which cuts the box in half: AUC = 0.5. So AUC lives between 0.5 (worthless) and 1.0 (flawless), and bigger is better. An AUC below 0.5 means your model is worse than chance — usually a sign your labels are flipped.
AUC has a lovely, intuitive meaning: it is the probability that the model gives a *randomly chosen positive example* a higher score than a *randomly chosen negative example*. In other words, it measures ranking quality — how well the model sorts positives above negatives — and it does so *regardless of any threshold*. That is exactly why AUC is so popular for answering "is the model good?": it judges the scores themselves, not one arbitrary cutoff.
When ROC lies: imbalance and the PR curve
ROC has a blind spot, and it shows up exactly where real life gets hard: [[class-imbalance|class imbalance]]. Imagine fraud detection where 1 in 1,000 transactions is fraud. The false positive *rate* divides by the huge pile of negatives, so even thousands of false alarms barely move it. A model can post a gorgeous 0.95 AUC and still bury every real fraud under a mountain of false alarms — because the ROC curve never looks at how many of your *flagged* cases were actually right.
This is where the [[precision-recall-curve|precision-recall curve]] earns its keep. It plots precision (of everything I flagged, how much was right?) against recall (of all real positives, how many did I catch?), again sweeping every threshold. Because precision focuses on your positive predictions, it does *not* get diluted by the ocean of easy negatives. On a rare-positive problem, a mediocre model that looked fine on ROC will reveal a sad, sagging PR curve — an honest portrait of how often your alarms are wrong.
Choosing the threshold for your real problem
Curves and AUC tell you the model's *potential*. Deployment still needs one number: the threshold. And the right threshold is not a math question — it is a *cost* question. Ask which mistake hurts more. A spam filter that dumps a real job offer into junk (a false positive) is far worse than letting one spam through, so you raise the threshold to protect precision. A cancer screen that misses a real tumor (a false negative) is catastrophic, so you lower the threshold to protect recall, accepting more false alarms that a second test can rule out.
- Write down the cost of each error type for your problem — in money, harm, or user trust. Be concrete; vague intuition leads to a vague threshold.
- Pick the metric those costs imply: high-recall if misses are deadly, high-precision if false alarms are expensive, F1 or a weighted blend if both matter.
- Sweep thresholds on the validation set, record that metric at each, and pick the threshold that maximizes it.
- Lock that threshold, then report final numbers on the untouched test set so the choice is not quietly overfit.
Keep the layers straight and you will never be fooled by a single number again. AUC and the curves rank the model and survive any threshold; the threshold turns a score into an action and must reflect real-world costs; precision and recall at that chosen threshold are what your users actually live with. A model that scores 0.99 AUC can still be the wrong tool if its operating point quietly trades away the error you could least afford.