Chapter 9: Model Evaluation, Validation, and Testing
Learning Objectives
Select classification, regression, and ranking metrics that mirror the business cost function.
Design validation splits (k-fold, stratified, group, forward-chaining) that resist leakage and respect time and entity boundaries.
Run slice-based and fairness evaluation with Fairlearn and Aequitas to expose worst-case subgroup performance.
Apply behavioral testing (MFT/INV/DIR) and shadow evaluation as part of a pre-deployment release gate.
1. Choosing Metrics
Pre-Read Quiz — Metrics
A fraud-detection model on a 99.9% negative dataset reports 99.9% accuracy. What does this tell you?
0: The model is production-ready.1: Accuracy is essentially uninformative under severe imbalance.2: Precision and recall must both be at least 99%.3: ROC-AUC must be above 0.99.
When over-prediction is much cheaper than under-prediction (inventory stockouts), which regression loss is most appropriate?
0: RMSE.1: MAE.2: Quantile loss at a high quantile.3: MAPE.
You are evaluating a web search ranker where users click on graded relevance results. Which metric is the standard choice?
0: NDCG@k.1: F1.2: Log-loss.3: MAPE.
Before you compute a single number, ask: what decision does the output drive, what are the asymmetric costs of errors, what kind of output is produced, what constraints apply, and who consumes the metric. Only then is the primary metric chosen — and only after secondary metrics are added to monitor the trade-offs you accepted.
Classification lives off the confusion matrix. Precision answers "of what I flagged, how much was real?"; recall answers "of all real positives, how many did I catch?" F1 is their harmonic mean. AUC-ROC is threshold-independent but can mislead under severe imbalance — prefer PR-AUC there. Log-loss measures calibration quality, which matters when downstream logic multiplies probabilities by dollar amounts.
Regression metrics differ in how they weight magnitudes. RMSE punishes large errors, MAE weights them linearly, MAPE is scale-free but breaks near zero. Quantile loss is the right answer when over- and under-prediction have different business costs.
Ranking metrics evaluate ordered lists. NDCG@k handles graded relevance for web search and recommenders. MAP fits binary-multi-relevance retrieval. MRR fits single-answer tasks like FAQ retrieval where users stop after the first hit.
Figure 9.1: Confusion matrix anatomy and derived metrics
flowchart TD
A["Actual: Positive"] --> TP["True Positive (TP) Predicted: Positive"]
A --> FN["False Negative (FN) Predicted: Negative"]
B["Actual: Negative"] --> FP["False Positive (FP) Predicted: Positive"]
B --> TN["True Negative (TN) Predicted: Negative"]
TP --> P["Precision = TP / (TP + FP)"]
FP --> P
TP --> R["Recall = TP / (TP + FN)"]
FN --> R
P --> F1["F1 = 2 P R / (P + R)"]
R --> F1
Key Points
Pick the metric whose mathematical structure mirrors the business cost function — accuracy is rarely the answer.
Under severe class imbalance, use PR-AUC, precision@k, or recall@precision-constraint instead of accuracy or ROC-AUC.
Log-loss and Brier score evaluate probability calibration; choose them when downstream decisions multiply scores by dollars.
Asymmetric regression costs (stockouts vs. overstock, blackouts vs. waste) call for quantile loss, not RMSE/MAE.
Always specify the @k cutoff for ranking metrics so it matches user-facing position.
Post-Read Quiz — Metrics
A fraud-detection model on a 99.9% negative dataset reports 99.9% accuracy. What does this tell you?
0: The model is production-ready.1: Accuracy is essentially uninformative under severe imbalance.2: Precision and recall must both be at least 99%.3: ROC-AUC must be above 0.99.
When over-prediction is much cheaper than under-prediction (inventory stockouts), which regression loss is most appropriate?
0: RMSE.1: MAE.2: Quantile loss at a high quantile.3: MAPE.
You are evaluating a web search ranker where users click on graded relevance results. Which metric is the standard choice?
0: NDCG@k.1: F1.2: Log-loss.3: MAPE.
2. Validation Strategies
Pre-Read Quiz — Validation
You repeatedly tune hyperparameters on the test set. What actually happens?
0: Nothing — the test set protects itself.1: It becomes a second validation set; final estimates are inflated.2: Cross-validation breaks irreversibly.3: Accuracy automatically converges to true performance.
For time-series data, why is random k-fold inappropriate?
0: It is computationally expensive.1: It leaks the future into training; the model learns from tomorrow to predict today.2: It oversamples positives.3: It cannot be parallelized.
A scaler is fit on train+test combined before splitting. What is this an instance of?
0: A best practice for normalization.1: Preprocessing leakage.2: Stratified sampling.3: Adversarial validation.
The discipline is to lock the test set away and touch it once, at the end, after all decisions are frozen. Other common errors include splitting before deduplication, splitting rows when the unit of generalization is users or sessions, and shuffling time series so the model trains on the future.
When data is limited, k-fold cross-validation reuses every example by rotating folds. Stratified k-fold is mandatory for imbalanced classification; group k-fold ensures every row tied to one entity (user, hospital, document) stays in the same fold. For temporal data, use forward-chaining: train on weeks 1-4, validate on week 5; train on weeks 1-5, validate on week 6; and so on. Purged k-fold adds a buffer to handle slowly-resolving labels.
Mature teams also maintain a golden dataset — a slowly evolving, manually verified evaluation set including edge cases and past regression scenarios — and a shadow set refreshed from recent production traffic. Each answers a different question: do we still handle the cases we explicitly care about, and is the world drifting away from our training distribution?
Leakage is the silent killer. Target-derived features, temporal leakage, entity leakage, and preprocessing-statistic leakage all inflate offline metrics. Detection includes single-feature AUC checks, adversarial validation (train a classifier to distinguish train from test rows), and audits asking "would this value actually be known at decision time in production?"
Animation A1 — 5-fold cross-validation rotation
The dataset is split into 5 folds. On each iteration, one fold becomes the test set (red) while the others train (blue). The role rotates until every fold has been the test set exactly once.
flowchart TD
F1["Fold 1 Train: W1-W4 to Validate: W5"]
F2["Fold 2 Train: W1-W5 to Validate: W6"]
F3["Fold 3 Train: W1-W6 to Validate: W7"]
F4["Fold 4 Train: W1-W7 to Validate: W8"]
F1 --> F2 --> F3 --> F4
F4 --> AGG["Aggregate per-fold metrics"]
Animation A2 — Time-series forward chaining
The training window expands forward in time while a single test window slides one step ahead. The model is always validated on data strictly later than the training set, mimicking deployment.
Lock the test set; touch it only once after all modeling decisions are frozen.
Use stratified k-fold for imbalance, group k-fold for entity-tied generalization, forward-chaining for time series.
Golden datasets answer "do we still handle hard known cases?"; shadow sets answer "is the world drifting?"
Compute every preprocessing statistic on the training fold only — fit transformers there and apply them to val/test.
Detect leakage with single-feature AUC, adversarial validation, and a per-feature audit of "known at decision time?"
Post-Read Quiz — Validation
You repeatedly tune hyperparameters on the test set. What actually happens?
0: Nothing — the test set protects itself.1: It becomes a second validation set; final estimates are inflated.2: Cross-validation breaks irreversibly.3: Accuracy automatically converges to true performance.
For time-series data, why is random k-fold inappropriate?
0: It is computationally expensive.1: It leaks the future into training; the model learns from tomorrow to predict today.2: It oversamples positives.3: It cannot be parallelized.
A scaler is fit on train+test combined before splitting. What is this an instance of?
0: A best practice for normalization.1: Preprocessing leakage.2: Stratified sampling.3: Adversarial validation.
3. Slice-Based and Fairness Evaluation
Pre-Read Quiz — Slice and Fairness
A model has 92% global accuracy. Why might that still be a poor production signal?
0: It cannot be — global accuracy is the gold standard.1: It can hide a slice (e.g., a demographic group) where accuracy is 71%.2: It is only meaningful if recall is also reported.3: Accuracy is not defined for production traffic.
Which fairness criterion requires equal true-positive rates across groups?
A 92% global accuracy can hide a 71% slice. Slice-based evaluation decomposes performance per subgroup so that worst-case rather than average behavior is the headline. Intersectional slices routinely surface failures invisible to single-attribute analysis.
Fairlearn's MetricFrame computes any metric per group and overall in one call. .by_group shows per-group metrics, .difference() shows max-minus-min, and .group_min() surfaces the worst-case group. Passing a DataFrame as sensitive_features produces intersectional slices.
Aequitas expects a DataFrame with score, label_value, and attribute columns. The standard flow: Preprocessor, then Group.get_crosstabs for per-group ppr/tpr/fpr/fnr, then Fairness.get_fairness for disparity ratios versus a reference group, with flags when the 80% rule is violated.
Three fairness criteria dominate: demographic parity (equal selection rates), equal opportunity (equal TPR), and equalized odds (equal TPR and FPR). They are mutually incompatible when base rates differ — choosing one is a policy decision. Mitigation falls into pre-processing (reweight data), in-processing (constrain the loss), and post-processing (per-group thresholds), each with distinct trade-offs. Group-aware decisions at inference may be legally prohibited in domains like lending.
Figure 9.4: Slice-based evaluation workflow
flowchart LR
PRED["Predictions"]
ATTR["Sensitive attributes"]
PRED --> SLICE["Group by slice"]
ATTR --> SLICE
SLICE --> METRICS["Per-group metrics"]
METRICS --> WORST["Worst-case group"]
METRICS --> DI["Disparate-impact ratio"]
WORST --> GATE{"Meets thresholds?"}
DI --> GATE
GATE -->|Yes| PASS["Pass slice gate"]
GATE -->|No| MITIGATE["Mitigation"]
Key Points
Always disaggregate — global accuracy can mask catastrophically bad subgroup performance.
Use Fairlearn's MetricFrame.group_min() and .difference() to surface worst-case slices.
Aequitas requires columns named score and label_value; rename or it silently produces zeros.
Demographic parity, equal opportunity, and equalized odds cannot all be satisfied when base rates differ — choose one as policy.
Bootstrap confidence intervals on small slices to avoid chasing sampling noise.
Post-Read Quiz — Slice and Fairness
A model has 92% global accuracy. Why might that still be a poor production signal?
0: It cannot be — global accuracy is the gold standard.1: It can hide a slice (e.g., a demographic group) where accuracy is 71%.2: It is only meaningful if recall is also reported.3: Accuracy is not defined for production traffic.
Which fairness criterion requires equal true-positive rates across groups?
Adding "very" before "good" must increase positive-class probability. Which test type encodes this?
0: MFT.1: INV.2: DIR.3: Adversarial validation.
What does shadow evaluation NOT do?
0: Run the candidate model on live traffic.1: Surface latency and input-distribution gaps.2: Serve the candidate's predictions to users.3: Compare agreement with the production model.
Aggregate metrics tell you nothing about negation, typos, paraphrases, demographic substitutions, or numerical reasoning. CheckList (Ribeiro et al. 2020) treats a model like software, probing capabilities with unit-test-style assertions on a capability x test-type matrix.
MFTs are atomic correctness tests (negation, intensifiers, fairness across demographics). INVs are metamorphic tests — label-preserving perturbations like typos, synonyms, or name swaps that must not change the prediction. DIRs are directional metamorphic tests — adding "very" must make a positive sentence more positive; the score must move the expected way.
Test suites scale because tests are generated from templates and lexicons, not handwritten. Output is a capability x test-type matrix of pass rates: "Negation MFT: 68%, Spelling INV: 55%, Gender-swap INV: 92%." Each failure points to specific data augmentation or guardrails that might fix it. Adversarial generators (TextAttack, PGD) probe what the team did not think to test; stress tests characterize degradation on noisy or rare inputs.
Shadow evaluation runs the candidate model on live production traffic alongside the existing model, logging both predictions without affecting users. It surfaces real-world input distribution, real-world latency, and the disagreement pattern between candidate and production — a model that disagrees on 8% of cases for "harmless" reasons may still surprise users.
Figure 9.5: Shadow evaluation architecture
flowchart LR
USER["Production traffic"] --> ROUTER["Request router"]
ROUTER --> LIVE["Live model (serves user)"]
ROUTER -.->|mirror| SHADOW["Shadow model (no user impact)"]
LIVE --> RESP["Response to user"]
LIVE --> LOG["Prediction log"]
SHADOW --> LOG
LOG --> COMPARE["Compare agreement, latency, drift"]
COMPARE --> REPORT["Shadow eval report"]
Animation A3 — Shadow evaluation flow
Incoming traffic forks at the router: the production model serves a response back to the user (blue), while the shadow model writes its prediction to a log (gray) without user impact. A comparator surfaces disagreement, latency, and drift.
A mature pre-deployment gate combines all the layers in this chapter into a release checklist: aggregate metrics on the golden dataset, per-slice thresholds, fairness bounds, behavioral suite pass rates per capability (with safety-critical MFTs blocking release), shadow agreement and latency, and a final A/B test correlating the offline proxy with the real business KPI.
Key Points
CheckList reframes evaluation as a capability x test-type matrix (MFT, INV, DIR) — far more actionable than one accuracy number.
MFTs are atomic; INVs are label-preserving perturbations; DIRs assert the score moves in a known direction.
Generate tests from templates and lexicons to scale; handwrite only the irreducibly weird cases.
Shadow evaluation surfaces real-world latency, distribution, and disagreement that no held-out set can.
The pre-deployment gate: golden metrics, slice minimums, fairness bounds, behavioral pass rates, shadow agreement, A/B KPI correlation.
Post-Read Quiz — Behavioral and Shadow
In CheckList, you assert "The food was delicious" and "The food was deliciuos" must receive the same label. Which test type is this?
Adding "very" before "good" must increase positive-class probability. Which test type encodes this?
0: MFT.1: INV.2: DIR.3: Adversarial validation.
What does shadow evaluation NOT do?
0: Run the candidate model on live traffic.1: Surface latency and input-distribution gaps.2: Serve the candidate's predictions to users.3: Compare agreement with the production model.