Chapter 9: Model Evaluation, Validation, and Testing

Learning Objectives

1. Choosing Metrics

Pre-Read Quiz — Metrics

A fraud-detection model on a 99.9% negative dataset reports 99.9% accuracy. What does this tell you?

0: The model is production-ready. 1: Accuracy is essentially uninformative under severe imbalance. 2: Precision and recall must both be at least 99%. 3: ROC-AUC must be above 0.99.

When over-prediction is much cheaper than under-prediction (inventory stockouts), which regression loss is most appropriate?

0: RMSE. 1: MAE. 2: Quantile loss at a high quantile. 3: MAPE.

You are evaluating a web search ranker where users click on graded relevance results. Which metric is the standard choice?

0: NDCG@k. 1: F1. 2: Log-loss. 3: MAPE.

Before you compute a single number, ask: what decision does the output drive, what are the asymmetric costs of errors, what kind of output is produced, what constraints apply, and who consumes the metric. Only then is the primary metric chosen — and only after secondary metrics are added to monitor the trade-offs you accepted.

Classification lives off the confusion matrix. Precision answers "of what I flagged, how much was real?"; recall answers "of all real positives, how many did I catch?" F1 is their harmonic mean. AUC-ROC is threshold-independent but can mislead under severe imbalance — prefer PR-AUC there. Log-loss measures calibration quality, which matters when downstream logic multiplies probabilities by dollar amounts.

Regression metrics differ in how they weight magnitudes. RMSE punishes large errors, MAE weights them linearly, MAPE is scale-free but breaks near zero. Quantile loss is the right answer when over- and under-prediction have different business costs.

Ranking metrics evaluate ordered lists. NDCG@k handles graded relevance for web search and recommenders. MAP fits binary-multi-relevance retrieval. MRR fits single-answer tasks like FAQ retrieval where users stop after the first hit.

Figure 9.1: Confusion matrix anatomy and derived metrics

flowchart TD A["Actual: Positive"] --> TP["True Positive (TP)
Predicted: Positive"] A --> FN["False Negative (FN)
Predicted: Negative"] B["Actual: Negative"] --> FP["False Positive (FP)
Predicted: Positive"] B --> TN["True Negative (TN)
Predicted: Negative"] TP --> P["Precision = TP / (TP + FP)"] FP --> P TP --> R["Recall = TP / (TP + FN)"] FN --> R P --> F1["F1 = 2 P R / (P + R)"] R --> F1

Key Points

Post-Read Quiz — Metrics

A fraud-detection model on a 99.9% negative dataset reports 99.9% accuracy. What does this tell you?

0: The model is production-ready. 1: Accuracy is essentially uninformative under severe imbalance. 2: Precision and recall must both be at least 99%. 3: ROC-AUC must be above 0.99.

When over-prediction is much cheaper than under-prediction (inventory stockouts), which regression loss is most appropriate?

0: RMSE. 1: MAE. 2: Quantile loss at a high quantile. 3: MAPE.

You are evaluating a web search ranker where users click on graded relevance results. Which metric is the standard choice?

0: NDCG@k. 1: F1. 2: Log-loss. 3: MAPE.

2. Validation Strategies

Pre-Read Quiz — Validation

You repeatedly tune hyperparameters on the test set. What actually happens?

0: Nothing — the test set protects itself. 1: It becomes a second validation set; final estimates are inflated. 2: Cross-validation breaks irreversibly. 3: Accuracy automatically converges to true performance.

For time-series data, why is random k-fold inappropriate?

0: It is computationally expensive. 1: It leaks the future into training; the model learns from tomorrow to predict today. 2: It oversamples positives. 3: It cannot be parallelized.

A scaler is fit on train+test combined before splitting. What is this an instance of?

0: A best practice for normalization. 1: Preprocessing leakage. 2: Stratified sampling. 3: Adversarial validation.

The discipline is to lock the test set away and touch it once, at the end, after all decisions are frozen. Other common errors include splitting before deduplication, splitting rows when the unit of generalization is users or sessions, and shuffling time series so the model trains on the future.

When data is limited, k-fold cross-validation reuses every example by rotating folds. Stratified k-fold is mandatory for imbalanced classification; group k-fold ensures every row tied to one entity (user, hospital, document) stays in the same fold. For temporal data, use forward-chaining: train on weeks 1-4, validate on week 5; train on weeks 1-5, validate on week 6; and so on. Purged k-fold adds a buffer to handle slowly-resolving labels.

Mature teams also maintain a golden dataset — a slowly evolving, manually verified evaluation set including edge cases and past regression scenarios — and a shadow set refreshed from recent production traffic. Each answers a different question: do we still handle the cases we explicitly care about, and is the world drifting away from our training distribution?

Leakage is the silent killer. Target-derived features, temporal leakage, entity leakage, and preprocessing-statistic leakage all inflate offline metrics. Detection includes single-feature AUC checks, adversarial validation (train a classifier to distinguish train from test rows), and audits asking "would this value actually be known at decision time in production?"

Animation A1 — 5-fold cross-validation rotation

The dataset is split into 5 folds. On each iteration, one fold becomes the test set (red) while the others train (blue). The role rotates until every fold has been the test set exactly once.

Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 F1 F2 F3 F4 F5 TEST TEST TEST TEST TEST

Figure 9.2: Forward-chaining (expanding-window) time-series cross-validation

flowchart TD F1["Fold 1
Train: W1-W4 to Validate: W5"] F2["Fold 2
Train: W1-W5 to Validate: W6"] F3["Fold 3
Train: W1-W6 to Validate: W7"] F4["Fold 4
Train: W1-W7 to Validate: W8"] F1 --> F2 --> F3 --> F4 F4 --> AGG["Aggregate per-fold metrics"]

Animation A2 — Time-series forward chaining

The training window expands forward in time while a single test window slides one step ahead. The model is always validated on data strictly later than the training set, mimicking deployment.

W1 W2 W3 W4 W5 W6 W7 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Train Validate

Figure 9.3: Four common data leakage paths

flowchart TD LABELS["Ground-truth labels"] FUTURE["Future observations"] USERS["User / entity identity"] STATS["Train+test combined statistics"] LABELS -->|"Target-derived feature"| TRAIN["Training set"] FUTURE -->|"Temporal leakage"| TRAIN USERS -->|"Entity leakage"| TRAIN STATS -->|"Preprocessing leakage"| TRAIN TRAIN --> METRIC["Inflated offline metric"] METRIC --> PROD["Production performance collapses"]

Key Points

Post-Read Quiz — Validation

You repeatedly tune hyperparameters on the test set. What actually happens?

0: Nothing — the test set protects itself. 1: It becomes a second validation set; final estimates are inflated. 2: Cross-validation breaks irreversibly. 3: Accuracy automatically converges to true performance.

For time-series data, why is random k-fold inappropriate?

0: It is computationally expensive. 1: It leaks the future into training; the model learns from tomorrow to predict today. 2: It oversamples positives. 3: It cannot be parallelized.

A scaler is fit on train+test combined before splitting. What is this an instance of?

0: A best practice for normalization. 1: Preprocessing leakage. 2: Stratified sampling. 3: Adversarial validation.

3. Slice-Based and Fairness Evaluation

Pre-Read Quiz — Slice and Fairness

A model has 92% global accuracy. Why might that still be a poor production signal?

0: It cannot be — global accuracy is the gold standard. 1: It can hide a slice (e.g., a demographic group) where accuracy is 71%. 2: It is only meaningful if recall is also reported. 3: Accuracy is not defined for production traffic.

Which fairness criterion requires equal true-positive rates across groups?

0: Demographic parity. 1: Equal opportunity. 2: Disparate impact ratio. 3: Predictive parity.

Which Fairlearn object computes per-group metrics and exposes `.difference()` and `.group_min()`?

0: ExponentiatedGradient. 1: MetricFrame. 2: ThresholdOptimizer. 3: GridSearch.

A 92% global accuracy can hide a 71% slice. Slice-based evaluation decomposes performance per subgroup so that worst-case rather than average behavior is the headline. Intersectional slices routinely surface failures invisible to single-attribute analysis.

Fairlearn's MetricFrame computes any metric per group and overall in one call. .by_group shows per-group metrics, .difference() shows max-minus-min, and .group_min() surfaces the worst-case group. Passing a DataFrame as sensitive_features produces intersectional slices.

Aequitas expects a DataFrame with score, label_value, and attribute columns. The standard flow: Preprocessor, then Group.get_crosstabs for per-group ppr/tpr/fpr/fnr, then Fairness.get_fairness for disparity ratios versus a reference group, with flags when the 80% rule is violated.

Three fairness criteria dominate: demographic parity (equal selection rates), equal opportunity (equal TPR), and equalized odds (equal TPR and FPR). They are mutually incompatible when base rates differ — choosing one is a policy decision. Mitigation falls into pre-processing (reweight data), in-processing (constrain the loss), and post-processing (per-group thresholds), each with distinct trade-offs. Group-aware decisions at inference may be legally prohibited in domains like lending.

Figure 9.4: Slice-based evaluation workflow

flowchart LR PRED["Predictions"] ATTR["Sensitive attributes"] PRED --> SLICE["Group by slice"] ATTR --> SLICE SLICE --> METRICS["Per-group metrics"] METRICS --> WORST["Worst-case group"] METRICS --> DI["Disparate-impact ratio"] WORST --> GATE{"Meets thresholds?"} DI --> GATE GATE -->|Yes| PASS["Pass slice gate"] GATE -->|No| MITIGATE["Mitigation"]

Key Points

Post-Read Quiz — Slice and Fairness

A model has 92% global accuracy. Why might that still be a poor production signal?

0: It cannot be — global accuracy is the gold standard. 1: It can hide a slice (e.g., a demographic group) where accuracy is 71%. 2: It is only meaningful if recall is also reported. 3: Accuracy is not defined for production traffic.

Which fairness criterion requires equal true-positive rates across groups?

0: Demographic parity. 1: Equal opportunity. 2: Disparate impact ratio. 3: Predictive parity.

Which Fairlearn object computes per-group metrics and exposes `.difference()` and `.group_min()`?

0: ExponentiatedGradient. 1: MetricFrame. 2: ThresholdOptimizer. 3: GridSearch.

4. Behavioral and Robustness Testing

Pre-Read Quiz — Behavioral and Shadow

In CheckList, you assert "The food was delicious" and "The food was deliciuos" must receive the same label. Which test type is this?

0: MFT (Minimum Functionality Test). 1: INV (Invariance Test). 2: DIR (Directional Expectation Test). 3: AUC test.

Adding "very" before "good" must increase positive-class probability. Which test type encodes this?

0: MFT. 1: INV. 2: DIR. 3: Adversarial validation.

What does shadow evaluation NOT do?

0: Run the candidate model on live traffic. 1: Surface latency and input-distribution gaps. 2: Serve the candidate's predictions to users. 3: Compare agreement with the production model.

Aggregate metrics tell you nothing about negation, typos, paraphrases, demographic substitutions, or numerical reasoning. CheckList (Ribeiro et al. 2020) treats a model like software, probing capabilities with unit-test-style assertions on a capability x test-type matrix.

MFTs are atomic correctness tests (negation, intensifiers, fairness across demographics). INVs are metamorphic tests — label-preserving perturbations like typos, synonyms, or name swaps that must not change the prediction. DIRs are directional metamorphic tests — adding "very" must make a positive sentence more positive; the score must move the expected way.

Test suites scale because tests are generated from templates and lexicons, not handwritten. Output is a capability x test-type matrix of pass rates: "Negation MFT: 68%, Spelling INV: 55%, Gender-swap INV: 92%." Each failure points to specific data augmentation or guardrails that might fix it. Adversarial generators (TextAttack, PGD) probe what the team did not think to test; stress tests characterize degradation on noisy or rare inputs.

Shadow evaluation runs the candidate model on live production traffic alongside the existing model, logging both predictions without affecting users. It surfaces real-world input distribution, real-world latency, and the disagreement pattern between candidate and production — a model that disagrees on 8% of cases for "harmless" reasons may still surprise users.

Figure 9.5: Shadow evaluation architecture

flowchart LR USER["Production traffic"] --> ROUTER["Request router"] ROUTER --> LIVE["Live model
(serves user)"] ROUTER -.->|mirror| SHADOW["Shadow model
(no user impact)"] LIVE --> RESP["Response to user"] LIVE --> LOG["Prediction log"] SHADOW --> LOG LOG --> COMPARE["Compare agreement,
latency, drift"] COMPARE --> REPORT["Shadow eval report"]

Animation A3 — Shadow evaluation flow

Incoming traffic forks at the router: the production model serves a response back to the user (blue), while the shadow model writes its prediction to a log (gray) without user impact. A comparator surfaces disagreement, latency, and drift.

Traffic Router Live model Shadow model User Pred log Compare: agreement, drift response 200 pred logged

A mature pre-deployment gate combines all the layers in this chapter into a release checklist: aggregate metrics on the golden dataset, per-slice thresholds, fairness bounds, behavioral suite pass rates per capability (with safety-critical MFTs blocking release), shadow agreement and latency, and a final A/B test correlating the offline proxy with the real business KPI.

Key Points

Post-Read Quiz — Behavioral and Shadow

In CheckList, you assert "The food was delicious" and "The food was deliciuos" must receive the same label. Which test type is this?

0: MFT (Minimum Functionality Test). 1: INV (Invariance Test). 2: DIR (Directional Expectation Test). 3: AUC test.

Adding "very" before "good" must increase positive-class probability. Which test type encodes this?

0: MFT. 1: INV. 2: DIR. 3: Adversarial validation.

What does shadow evaluation NOT do?

0: Run the candidate model on live traffic. 1: Surface latency and input-distribution gaps. 2: Serve the candidate's predictions to users. 3: Compare agreement with the production model.

Your Progress

Answer Explanations