Chapter 3 — Data Validation, Cleaning, and Quality
Learning Objectives
Decompose data quality into five orthogonal dimensions and map each to a downstream ML failure mode.
Choose the right validation framework (TFDV, Great Expectations, Pandera, pydantic) for each layer of an ML pipeline.
Select missing-value, outlier, and label-noise strategies grounded in mechanism and risk.
Distinguish training-serving skew from concept drift and data drift, and apply the right detector (KS, PSI, JS, MMD).
Encode data-quality contracts as code, version them, and gate them with SLAs.
1. Data Quality Dimensions
Pre-Reading Check — Quality Dimensions
1. Which dimension is most directly violated when user_id = 42 has country = "US" in one table and country = "United States" in another?
2. A fraud model is trained on transactions delayed by 48 hours and systematically misses fast-evolving attacks. Which dimension failed?
3. The "1-10-100" rule from data management implies that a defect:
Data quality is not a single binary property. It decomposes into five orthogonal dimensions, each with its own failure signature in ML:
Completeness — fraction of expected values present. A null-rate spike in total_charges indicates upstream pipeline bugs (renamed column, dropped join key, partial backfill).
Accuracy — whether values reflect ground truth. age = -7 and country = "XX" are obvious; wrong timezone or currency unit failures hide for months.
Consistency — same fact represented identically across sources. Dangerous across the train/serve boundary because the model sees mutually incompatible encodings.
Timeliness — lag between event and availability. Stale features produce a fraud model that under-reacts to regime change.
Uniqueness — each entity appears exactly as often as expected. Duplicates inflate certain classes during training and bias loss estimates.
Dimension
Symptom
ML Consequence
Completeness
High null rate
Biased imputation, shifted priors
Accuracy
Out-of-domain values
Garbage-in-garbage-out predictions
Consistency
Two formats for one concept
Training-serving skew, broken one-hot encodings
Timeliness
Stale features
Concept drift, poor reaction to regime change
Uniqueness
Duplicate rows
Inflated metric estimates, leakage
Quality Scoring
Mature platforms persist a per-dimension score and dashboard it next to model metrics. A typical scheme:
Completeness: 1 − mean(null_rate) across critical features
The "1-10-100" rule scales precisely to ML: $1 to prevent at source, $10 to remediate downstream, $100 once it hits a customer. A bad row costs a millisecond at ingestion, hours at training debugging, weeks of business impact when the model misclassifies in production.
Key Points
Quality decomposes into five orthogonal dimensions; treating it as a single binary obscures the tradeoffs ML engineers must reason about.
Each dimension maps one-to-one with a distinct ML failure mode — biased priors, garbage predictions, skew, drift, or inflated metrics.
Compute a per-dimension score per batch and persist it to time-series storage so degradation is measurable, not "felt."
Apply the 1-10-100 cost rule: prevention at source is roughly 100x cheaper than fixing a customer-impacting prediction error.
You can ship missing one dimension only if you compensate explicitly elsewhere — never silently.
Post-Reading Check — Quality Dimensions
1. Which dimension is most directly violated when user_id = 42 has country = "US" in one table and country = "United States" in another?
2. A fraud model is trained on transactions delayed by 48 hours and systematically misses fast-evolving attacks. Which dimension failed?
3. The "1-10-100" rule from data management implies that a defect:
2. Schema and Statistical Validation
Pre-Reading Check — Schema and Statistical Validation
1. Which framework is built on Apache Beam and ships with first-class drift / training-serving skew detection as a native operation?
2. Great Expectations organizes its data contracts as:
3. In a hybrid validation stack, where does pydantic most commonly fit?
Two complementary categories of tools dominate: schema-based validators (enforce types, presence, domains) and statistical anomaly detectors (compare distributions to a baseline). Modern frameworks blend both.
TFDV (TensorFlow Data Validation)
The TFX data-quality component, built on Apache Beam — scales to terabytes on Dataflow, Spark, or Flink. Three-step workflow: generate stats, infer a schema, validate every subsequent batch.
Schemas are protobuf describing types, presence, domains, and nested structure. TFDV's killer feature is built-in training-serving skew and drift detection: hand it two stats artifacts and it emits per-feature comparisons.
Figure 3.1 — Validation pipeline
flowchart LR
A[Raw Batch] --> B[Schema Check types, presence, domains]
B -->|pass| C[Statistical Check ranges, distributions]
B -->|fail| X[Anomaly Report]
C -->|pass| D[Cleaned Dataset]
C -->|fail| X
X --> E[Gate / Alert]
Animation — Validation Pipeline with Reject Path
A raw batch flows through schema check, statistical check, and emerges cleaned. Failed rows divert to the anomaly report.
Tip: hover any stage to focus — the reject branch highlights when stats checks fail.
Great Expectations
GE inverts the philosophy: humans write Expectation Suites — assertions about the data — and every failed expectation is an anomaly.
import great_expectations as ge
import pandas as pd
df = pd.read_csv("train.csv")
ge_df = ge.from_pandas(df)
ge_df.expect_column_values_to_not_be_null("user_id")
ge_df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
ge_df.expect_column_mean_to_be_between("click_count", min_value=0, max_value=10)
result = ge_df.validate()
Suites can be auto-profiled then human-refined — like generating tests via coverage then editing by hand. Data Docs render as browsable HTML reports, doubling as living documentation. First-class Airflow integration via GreatExpectationsOperator.
Pandera and pydantic
Pandera validates Pandas/Polars DataFrames inline; pydantic validates request payloads at FastAPI service boundaries. Both are the last mile.
import pandera as pa
from pandera.typing import Series
class TransactionSchema(pa.DataFrameModel):
user_id: Series[int] = pa.Field(ge=1)
amount: Series[float] = pa.Field(ge=0, le=1_000_000)
country: Series[str] = pa.Field(isin=["US", "CA", "UK", "DE"])
TransactionSchema.validate(df)
Expectations as Code
The unifying idea: version-controlled, PR-reviewed, CI-executed data contracts — the data-quality analog of infrastructure-as-code.
Capability
TFDV
Great Expectations
Pandera
pydantic
Primary artifact
Schema protobuf
Expectation Suite
DataFrameModel
BaseModel
Scale engine
Apache Beam
Pandas / Spark / SQL
Pandas / Polars
Single row
Drift / skew built-in
Yes
Via expectations
No
No
Best for
TFX feature validation
Warehouse contracts
DataFrame steps
API request bodies
The mature hybrid: GE upstream in the warehouse; TFDV downstream for ML-specific skew; pandera/pydantic at service boundaries.
Key Points
TFDV is schema-and-statistics-first; it owns automated drift and training-serving skew at TFX/Beam scale.
Great Expectations is assertion-first; suites are version-controlled, human-readable, and integrate natively with Airflow.
Pandera and pydantic are last-mile validators — DataFrames in services and HTTP payloads at FastAPI boundaries.
Treat expectations as code: PR-reviewed, CI-executed, deployed alongside the pipeline.
Most mature platforms run a hybrid: GE upstream, TFDV downstream, pandera/pydantic at service boundaries.
Post-Reading Check — Schema and Statistical Validation
1. Which framework is built on Apache Beam and ships with first-class drift / training-serving skew detection as a native operation?
2. Great Expectations organizes its data contracts as:
3. In a hybrid validation stack, where does pydantic most commonly fit?
3. Cleaning Strategies
Pre-Reading Check — Cleaning Strategies
1. A feature is missing more often when its underlying value is high (e.g., income missing for high earners). The mechanism is:
2. You auto-delete every outlier flagged by Isolation Forest from a fraud-detection training set. What is the most likely problem?
3. In confident learning (Cleanlab), the predicted probabilities passed to find_label_issues must be:
Validation finds problems; cleaning resolves them. The three perennial issues: missing values, outliers, and label noise.
Missing Values — Mechanism Matters
MCAR — missingness unrelated to data. Dropping rows safe but wasteful.
MAR — depends on other observed variables. Conditional imputation works.
MNAR — depends on the unobserved value itself. Missingness is informative; add a missingness indicator feature.
Figure 3.2 — Imputation decision tree
flowchart TD
A[Missing values detected] --> B{Mechanism?}
B -->|MCAR| C{Missing rate < 5%?}
C -->|Yes| D[Drop rows]
C -->|No| E[Mean/Median impute]
B -->|MAR| F[Conditional impute KNN or MICE]
B -->|MNAR| G[Sentinel + missingness indicator feature]
F --> H{Feature critical?}
G --> H
E --> H
H -->|Yes| I[Model-based imputation]
H -->|No| J[Keep simple imputer]
Imputation strategies in increasing complexity: drop → mean/median/mode → constant+indicator → KNN → MICE → model-based. Non-negotiable rule: fit imputers only on training data, then apply the fitted artifact to validation, test, and production. Fitting on the full dataset leaks information.
DBSCAN — density-based; noise points are unassigned. Sensitive to eps.
LOF — local density vs. neighbors. Good for heterogeneous data with local anomalies.
Treatment: fix, winsorize (cap at 1st/99th percentile), remove (only when confident), down-weight, or use robust models (trees, Huber loss). Critical: in fraud or rare-disease prediction, the "outliers" are the positive class — never auto-delete.
Deduplication and Normalization
Exact dedup is trivial; near-dup (Jaro-Winkler, MinHash, embedding similarity) is harder. Duplicates inflate training-set size, bias the model toward the duplicated examples, and produce optimistic CV scores. Normalization (lowercase, ISO 8601 dates, canonical country codes) is one of the most common sources of train-serve skew.
Label Noise
The most damaging form of corruption — corrupts the learning signal directly. The dominant modern technique is confident learning (Cleanlab):
Train a baseline model and obtain out-of-sample predicted probabilities via 5-fold CV.
Pass labels and probabilities to find_label_issues; it estimates the joint distribution of noisy vs. true labels and flags confident disagreements.
Route flagged examples for review, drop them, or use CleanLearning for noise-aware reweighting.
from cleanlab.filter import find_label_issues
probs = cross_val_predict(base_clf, X, y, cv=5, method="predict_proba")
label_issues = find_label_issues(labels=y, probs=probs)
X_clean, y_clean = X[~label_issues], y[~label_issues]
Complementary: label smoothing, co-teaching, per-example loss tracking. In production, the highest-leverage action is a human-in-the-loop review queue for high-confidence model-vs-label disagreements.
Key Points
Diagnose the missingness mechanism (MCAR / MAR / MNAR) before choosing a treatment; MNAR demands a missingness-indicator feature.
Fit imputers exclusively on training data and ship the serialized pipeline to inference — this eliminates a major class of skew.
Distinguish data-error outliers from rare-but-valid positive class; auto-deletion in fraud or rare-disease tasks deletes the signal.
Label noise caps achievable accuracy and causes memorization — cleaning labels often beats tuning the model.
Confident learning (Cleanlab) requires out-of-sample probabilities from cross-validation, not in-sample fits.
Post-Reading Check — Cleaning Strategies
1. A feature is missing more often when its underlying value is high (e.g., income missing for high earners). The mechanism is:
2. You auto-delete every outlier flagged by Isolation Forest from a fraud-detection training set. What is the most likely problem?
3. In confident learning (Cleanlab), the predicted probabilities passed to find_label_issues must be:
4. Drift, Skew, and Anomalies
Pre-Reading Check — Drift, Skew, and Anomalies
1. A SQL feature query in training and a Python function in serving produce different values for the same row. This is:
2. The de-facto industry-standard binned divergence metric in credit risk, with the heuristic thresholds <0.1 / 0.1–0.25 / ≥0.25, is:
3. Which detector is best suited to multivariate drift on embedding or image-representation vectors?
A clean snapshot still degrades in production because the world moves. Drift detection compares live distributions against a training reference and alarms when divergence matters.
Training-Serving Skew
A systematic difference between training and serving inputs caused by an engineering issue, not natural shift. Examples: SQL vs. Python feature implementations, a categorical encoder fit on training silently mapping unseen serving categories to 0, normalization re-fit per batch. Fixable by engineering: share a single serialized preprocessing pipeline, validate serving against the training schema, continuously compare distributions. Feature stores exist largely to make this easy.
Animation — Training-Serving Skew
Two parallel pipelines transform the same raw source; subtly different encoders produce divergent features and a corrupted prediction.
Tip: arrows reveal sequentially, then the SKEW marker pulses across the boundary — the fix is a shared serialized pipeline.
Concept Drift vs. Data Drift
Data drift (covariate): P(X) changes — new geography, new demographic. Detectable from inputs alone.
Concept drift: P(Y | X) changes — spammers adopt new tactics, the relationship moves. Needs labels (often delayed) or proxies: prediction-distribution shift, performance estimation, confidence shifts.
KS, PSI, JS, and Other Divergence Measures
Animation — Drift Detection
A reference histogram (blue) is overlaid by a shifted live histogram (orange); the KS statistic rises and an alert fires red.
Tip: PSI ≥ 0.25 = significant drift; combine with a significance test (KS p < 0.05) to gate alerts.
Method
Type
Best For
Threshold
KS
Non-parametric test
Univariate continuous
p < 0.05; D > 0.1–0.2
PSI
Binned divergence
Numeric (binned), categorical
<0.1 / 0.1–0.25 / ≥0.25
JS
Symmetric divergence
Categorical / predictions
~0 / 0.05–0.1 / >0.1–0.2
KL
Asymmetric divergence
Binned features
Calibrate vs. baseline
χ²
Parametric test
Categorical features
p < 0.05 + effect size
Wasserstein
Distance metric
Univariate numeric
Scale-dependent
MMD
Kernel two-sample test
Multivariate, embeddings, images
Permutation, α=0.05
PSI = Σ (qᵢ − pᵢ) · ln(qᵢ / pᵢ) — the de-facto credit-risk standard. JS = ½KL(P||M) + ½KL(Q||M), bounded and finite, safer than raw KL for dashboards. MMD is kernel-based in an RKHS — the heavy artillery for multivariate / embedding drift, as in Alibi Detect's MMDDrift.
Two practical heuristics:
Statistical vs. practical significance: p-values answer "real?", effect sizes (PSI, Wasserstein) answer "big enough?" — use both.
Calibrate thresholds on healthy data: compute each metric over historical no-drift windows; alert at the 95th or 99th percentile.
Alerting and SLAs
Drift detection without alerting is theater. A production monitor should:
Run continuously, latest window vs. fixed training reference.
Compute multiple metrics per feature (at least PSI + one significance test).
Aggregate into a drift share across features.
Route by severity: warn on moderate, page on significant.
Tie alerts to operational playbooks: retrain, fallback, kill-switch.
Figure 3.4 — Drift detection and SLA monitoring loop
flowchart TD
REF[Reference Distribution training baseline] --> CMP[Drift Tests PSI, KS, JS, MMD]
LIVE[Live Distribution rolling window] --> CMP
CMP --> AGG[Drift Share aggregate per-feature signals]
AGG --> SEV{Severity?}
SEV -->|None| LOG[Log metrics]
SEV -->|Moderate| WARN[Warn + ticket investigate]
SEV -->|Significant| PAGE[Page on-call fallback / retrain / kill-switch]
LOG --> LIVE
WARN --> LIVE
PAGE --> LIVE
A data-quality SLA template:
Dimension
Metric
SLA
Missingness
% null per critical feature
< 1%
Schema validity
% rows failing schema checks
< 0.1%
Outlier rate
% rows flagged by Isolation Forest
< 2%
Feature drift
PSI vs. training reference
< 0.25 (any critical feature)
Prediction drift
JS divergence on score distribution
< 0.1
Freshness
Lag from real-world event
< 15 min (stream) / < 24 h (batch)
The point is that the response is automatic. Humans should be informed, not interrupted, when the system behaves as designed.
Key Points
Skew (engineering bug, fixable) is fundamentally different from drift (world changing) — the right fix is a shared serialized preprocessing pipeline.
Data drift = P(X) change (detect from inputs); concept drift = P(Y|X) change (needs labels or proxies like prediction-distribution shifts).
Combine a significance test (KS, χ², MMD) with an effect-size metric (PSI, Wasserstein, JS) — significance answers "real?", effect size answers "big enough?".
PSI is the credit-risk standard (<0.1 / 0.1–0.25 / ≥0.25); JS is safer than raw KL for dashboards; MMD is the multivariate / embedding tool.
Calibrate thresholds on historical healthy windows; codify response as data-quality SLAs with automatic, severity-routed playbooks.
Post-Reading Check — Drift, Skew, and Anomalies
1. A SQL feature query in training and a Python function in serving produce different values for the same row. This is:
2. The de-facto industry-standard binned divergence metric in credit risk, with the heuristic thresholds <0.1 / 0.1–0.25 / ≥0.25, is:
3. Which detector is best suited to multivariate drift on embedding or image-representation vectors?