Chapter 3 — Data Validation, Cleaning, and Quality

Learning Objectives

Decompose data quality into five orthogonal dimensions and map each to a downstream ML failure mode.
Choose the right validation framework (TFDV, Great Expectations, Pandera, pydantic) for each layer of an ML pipeline.
Select missing-value, outlier, and label-noise strategies grounded in mechanism and risk.
Distinguish training-serving skew from concept drift and data drift, and apply the right detector (KS, PSI, JS, MMD).
Encode data-quality contracts as code, version them, and gate them with SLAs.

1. Data Quality Dimensions

Pre-Reading Check — Quality Dimensions

1. Which dimension is most directly violated when user_id = 42 has country = "US" in one table and country = "United States" in another?

Completeness Consistency Timeliness Uniqueness

2. A fraud model is trained on transactions delayed by 48 hours and systematically misses fast-evolving attacks. Which dimension failed?

Accuracy Uniqueness Timeliness Completeness

3. The "1-10-100" rule from data management implies that a defect:

Costs the same wherever you catch it Gets cheaper to fix as it moves downstream Costs roughly 100x more once it reaches a customer than at source Cannot be quantified financially

Data quality is not a single binary property. It decomposes into five orthogonal dimensions, each with its own failure signature in ML:

Completeness — fraction of expected values present. A null-rate spike in total_charges indicates upstream pipeline bugs (renamed column, dropped join key, partial backfill).
Accuracy — whether values reflect ground truth. age = -7 and country = "XX" are obvious; wrong timezone or currency unit failures hide for months.
Consistency — same fact represented identically across sources. Dangerous across the train/serve boundary because the model sees mutually incompatible encodings.
Timeliness — lag between event and availability. Stale features produce a fraud model that under-reacts to regime change.
Uniqueness — each entity appears exactly as often as expected. Duplicates inflate certain classes during training and bias loss estimates.

Dimension	Symptom	ML Consequence
Completeness	High null rate	Biased imputation, shifted priors
Accuracy	Out-of-domain values	Garbage-in-garbage-out predictions
Consistency	Two formats for one concept	Training-serving skew, broken one-hot encodings
Timeliness	Stale features	Concept drift, poor reaction to regime change
Uniqueness	Duplicate rows	Inflated metric estimates, leakage

Quality Scoring

Mature platforms persist a per-dimension score and dashboard it next to model metrics. A typical scheme:

Completeness: 1 − mean(null_rate) across critical features
Accuracy: 1 − fraction failing range/domain checks
Timeliness: 1 − min(1, lag / SLA_lag)
Uniqueness: 1 − duplicate_rate

Cost of Poor Quality

The "1-10-100" rule scales precisely to ML: $1 to prevent at source, $10 to remediate downstream, $100 once it hits a customer. A bad row costs a millisecond at ingestion, hours at training debugging, weeks of business impact when the model misclassifies in production.

Key Points

Quality decomposes into five orthogonal dimensions; treating it as a single binary obscures the tradeoffs ML engineers must reason about.
Each dimension maps one-to-one with a distinct ML failure mode — biased priors, garbage predictions, skew, drift, or inflated metrics.
Compute a per-dimension score per batch and persist it to time-series storage so degradation is measurable, not "felt."
Apply the 1-10-100 cost rule: prevention at source is roughly 100x cheaper than fixing a customer-impacting prediction error.
You can ship missing one dimension only if you compensate explicitly elsewhere — never silently.

Post-Reading Check — Quality Dimensions

1. Which dimension is most directly violated when user_id = 42 has country = "US" in one table and country = "United States" in another?

Completeness Consistency Timeliness Uniqueness

2. A fraud model is trained on transactions delayed by 48 hours and systematically misses fast-evolving attacks. Which dimension failed?

Accuracy Uniqueness Timeliness Completeness

3. The "1-10-100" rule from data management implies that a defect:

Costs the same wherever you catch it Gets cheaper to fix as it moves downstream Costs roughly 100x more once it reaches a customer than at source Cannot be quantified financially

2. Schema and Statistical Validation

Pre-Reading Check — Schema and Statistical Validation

1. Which framework is built on Apache Beam and ships with first-class drift / training-serving skew detection as a native operation?

Great Expectations pydantic TFDV pandera

2. Great Expectations organizes its data contracts as:

Protobuf schemas Expectation Suites of human-readable assertions Single-row pydantic models Inline Mermaid diagrams

3. In a hybrid validation stack, where does pydantic most commonly fit?

Warehouse-side business-rule enforcement TFX feature skew detection at scale Request-payload validation at the inference service boundary DataFrame schema validation for offline training jobs

Two complementary categories of tools dominate: schema-based validators (enforce types, presence, domains) and statistical anomaly detectors (compare distributions to a baseline). Modern frameworks blend both.

TFDV (TensorFlow Data Validation)

The TFX data-quality component, built on Apache Beam — scales to terabytes on Dataflow, Spark, or Flink. Three-step workflow: generate stats, infer a schema, validate every subsequent batch.

import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_csv('train.csv')
schema = tfdv.infer_schema(stats)
eval_stats = tfdv.generate_statistics_from_csv('batch.csv')
anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(anomalies)

Schemas are protobuf describing types, presence, domains, and nested structure. TFDV's killer feature is built-in training-serving skew and drift detection: hand it two stats artifacts and it emits per-feature comparisons.

Figure 3.1 — Validation pipeline

Animation — Validation Pipeline with Reject Path

A raw batch flows through schema check, statistical check, and emerges cleaned. Failed rows divert to the anomaly report.

Tip: hover any stage to focus — the reject branch highlights when stats checks fail.

Great Expectations

GE inverts the philosophy: humans write Expectation Suites — assertions about the data — and every failed expectation is an anomaly.

import great_expectations as ge
import pandas as pd

df = pd.read_csv("train.csv")
ge_df = ge.from_pandas(df)
ge_df.expect_column_values_to_not_be_null("user_id")
ge_df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
ge_df.expect_column_mean_to_be_between("click_count", min_value=0, max_value=10)
result = ge_df.validate()

Suites can be auto-profiled then human-refined — like generating tests via coverage then editing by hand. Data Docs render as browsable HTML reports, doubling as living documentation. First-class Airflow integration via GreatExpectationsOperator.

Pandera and pydantic

Pandera validates Pandas/Polars DataFrames inline; pydantic validates request payloads at FastAPI service boundaries. Both are the last mile.

import pandera as pa
from pandera.typing import Series

class TransactionSchema(pa.DataFrameModel):
    user_id: Series[int] = pa.Field(ge=1)
    amount: Series[float] = pa.Field(ge=0, le=1_000_000)
    country: Series[str] = pa.Field(isin=["US", "CA", "UK", "DE"])

TransactionSchema.validate(df)

Expectations as Code

The unifying idea: version-controlled, PR-reviewed, CI-executed data contracts — the data-quality analog of infrastructure-as-code.

Capability	TFDV	Great Expectations	Pandera	pydantic
Primary artifact	Schema protobuf	Expectation Suite	DataFrameModel	BaseModel
Scale engine	Apache Beam	Pandas / Spark / SQL	Pandas / Polars	Single row
Drift / skew built-in	Yes	Via expectations	No	No
Best for	TFX feature validation	Warehouse contracts	DataFrame steps	API request bodies

The mature hybrid: GE upstream in the warehouse; TFDV downstream for ML-specific skew; pandera/pydantic at service boundaries.

Key Points

TFDV is schema-and-statistics-first; it owns automated drift and training-serving skew at TFX/Beam scale.
Great Expectations is assertion-first; suites are version-controlled, human-readable, and integrate natively with Airflow.
Pandera and pydantic are last-mile validators — DataFrames in services and HTTP payloads at FastAPI boundaries.
Treat expectations as code: PR-reviewed, CI-executed, deployed alongside the pipeline.
Most mature platforms run a hybrid: GE upstream, TFDV downstream, pandera/pydantic at service boundaries.

Post-Reading Check — Schema and Statistical Validation

1. Which framework is built on Apache Beam and ships with first-class drift / training-serving skew detection as a native operation?

Great Expectations pydantic TFDV pandera

2. Great Expectations organizes its data contracts as:

Protobuf schemas Expectation Suites of human-readable assertions Single-row pydantic models Inline Mermaid diagrams

3. In a hybrid validation stack, where does pydantic most commonly fit?

Warehouse-side business-rule enforcement TFX feature skew detection at scale Request-payload validation at the inference service boundary DataFrame schema validation for offline training jobs

3. Cleaning Strategies

Pre-Reading Check — Cleaning Strategies

1. A feature is missing more often when its underlying value is high (e.g., income missing for high earners). The mechanism is:

MCAR — Missing Completely At Random MAR — Missing At Random MNAR — Missing Not At Random Stationary

2. You auto-delete every outlier flagged by Isolation Forest from a fraud-detection training set. What is the most likely problem?

The model will train faster but with no accuracy loss You may have deleted most of the positive (fraud) class Isolation Forest cannot flag fraudulent rows Contamination must be set to 1.0 for any flagging to occur

3. In confident learning (Cleanlab), the predicted probabilities passed to find_label_issues must be:

In-sample predictions from the same fit Out-of-sample probabilities, typically from k-fold cross-validation From a separate ensemble of unsupervised models From the test set after final training

Validation finds problems; cleaning resolves them. The three perennial issues: missing values, outliers, and label noise.

Missing Values — Mechanism Matters

MCAR — missingness unrelated to data. Dropping rows safe but wasteful.
MAR — depends on other observed variables. Conditional imputation works.
MNAR — depends on the unobserved value itself. Missingness is informative; add a missingness indicator feature.

Figure 3.2 — Imputation decision tree

flowchart TD A[Missing values detected] --> B{Mechanism?} B -->|MCAR| C{Missing rate < 5%?} C -->|Yes| D[Drop rows] C -->|No| E[Mean/Median impute] B -->|MAR| F[Conditional impute
KNN or MICE] B -->|MNAR| G[Sentinel + missingness
indicator feature] F --> H{Feature critical?} G --> H E --> H H -->|Yes| I[Model-based imputation] H -->|No| J[Keep simple imputer]

Imputation strategies in increasing complexity: drop → mean/median/mode → constant+indicator → KNN → MICE → model-based. Non-negotiable rule: fit imputers only on training data, then apply the fitted artifact to validation, test, and production. Fitting on the full dataset leaks information.

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median", add_indicator=True)),
    ("scaler", StandardScaler())
])
preprocess = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features),
])
clf = Pipeline([("preprocess", preprocess), ("model", RandomForestClassifier(...))])
clf.fit(X_train, y_train)

The serialized pipeline applies identical imputation to both train and inference — eliminating an entire class of training-serving skew.

Outliers

Three flavors: data errors, rare-but-valid cases, distribution shifts. They require different responses.

IQR rule — outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR]. Univariate, robust.
Z-score — |z| > 3. Affected by the outliers it measures; prefer median + MAD.
Isolation Forest — randomly partitions; scales well; key knob is contamination.
DBSCAN — density-based; noise points are unassigned. Sensitive to eps.
LOF — local density vs. neighbors. Good for heterogeneous data with local anomalies.

Treatment: fix, winsorize (cap at 1st/99th percentile), remove (only when confident), down-weight, or use robust models (trees, Huber loss). Critical: in fraud or rare-disease prediction, the "outliers" are the positive class — never auto-delete.

Deduplication and Normalization

Exact dedup is trivial; near-dup (Jaro-Winkler, MinHash, embedding similarity) is harder. Duplicates inflate training-set size, bias the model toward the duplicated examples, and produce optimistic CV scores. Normalization (lowercase, ISO 8601 dates, canonical country codes) is one of the most common sources of train-serve skew.

Label Noise

The most damaging form of corruption — corrupts the learning signal directly. The dominant modern technique is confident learning (Cleanlab):

Train a baseline model and obtain out-of-sample predicted probabilities via 5-fold CV.
Pass labels and probabilities to find_label_issues; it estimates the joint distribution of noisy vs. true labels and flags confident disagreements.
Route flagged examples for review, drop them, or use CleanLearning for noise-aware reweighting.

from cleanlab.filter import find_label_issues
probs = cross_val_predict(base_clf, X, y, cv=5, method="predict_proba")
label_issues = find_label_issues(labels=y, probs=probs)
X_clean, y_clean = X[~label_issues], y[~label_issues]

Complementary: label smoothing, co-teaching, per-example loss tracking. In production, the highest-leverage action is a human-in-the-loop review queue for high-confidence model-vs-label disagreements.

Key Points

Diagnose the missingness mechanism (MCAR / MAR / MNAR) before choosing a treatment; MNAR demands a missingness-indicator feature.
Fit imputers exclusively on training data and ship the serialized pipeline to inference — this eliminates a major class of skew.
Distinguish data-error outliers from rare-but-valid positive class; auto-deletion in fraud or rare-disease tasks deletes the signal.
Label noise caps achievable accuracy and causes memorization — cleaning labels often beats tuning the model.
Confident learning (Cleanlab) requires out-of-sample probabilities from cross-validation, not in-sample fits.

Post-Reading Check — Cleaning Strategies

1. A feature is missing more often when its underlying value is high (e.g., income missing for high earners). The mechanism is:

MCAR — Missing Completely At Random MAR — Missing At Random MNAR — Missing Not At Random Stationary

2. You auto-delete every outlier flagged by Isolation Forest from a fraud-detection training set. What is the most likely problem?

3. In confident learning (Cleanlab), the predicted probabilities passed to find_label_issues must be:

In-sample predictions from the same fit Out-of-sample probabilities, typically from k-fold cross-validation From a separate ensemble of unsupervised models From the test set after final training

4. Drift, Skew, and Anomalies

Pre-Reading Check — Drift, Skew, and Anomalies

1. A SQL feature query in training and a Python function in serving produce different values for the same row. This is:

Concept drift — P(Y|X) changed Data drift — P(X) changed Training-serving skew — an engineering bug Label noise

2. The de-facto industry-standard binned divergence metric in credit risk, with the heuristic thresholds <0.1 / 0.1–0.25 / ≥0.25, is:

KS test statistic PSI (Population Stability Index) MMD Chi-squared

3. Which detector is best suited to multivariate drift on embedding or image-representation vectors?

Univariate KS test Chi-squared on per-column counts Maximum Mean Discrepancy (MMD) in an RKHS Z-score on each feature

A clean snapshot still degrades in production because the world moves. Drift detection compares live distributions against a training reference and alarms when divergence matters.

Training-Serving Skew

A systematic difference between training and serving inputs caused by an engineering issue, not natural shift. Examples: SQL vs. Python feature implementations, a categorical encoder fit on training silently mapping unseen serving categories to 0, normalization re-fit per batch. Fixable by engineering: share a single serialized preprocessing pipeline, validate serving against the training schema, continuously compare distributions. Feature stores exist largely to make this easy.

Animation — Training-Serving Skew

Two parallel pipelines transform the same raw source; subtly different encoders produce divergent features and a corrupted prediction.

Tip: arrows reveal sequentially, then the SKEW marker pulses across the boundary — the fix is a shared serialized pipeline.

Concept Drift vs. Data Drift

Data drift (covariate): P(X) changes — new geography, new demographic. Detectable from inputs alone.
Concept drift: P(Y | X) changes — spammers adopt new tactics, the relationship moves. Needs labels (often delayed) or proxies: prediction-distribution shift, performance estimation, confidence shifts.

KS, PSI, JS, and Other Divergence Measures

Animation — Drift Detection

A reference histogram (blue) is overlaid by a shifted live histogram (orange); the KS statistic rises and an alert fires red.

Tip: PSI ≥ 0.25 = significant drift; combine with a significance test (KS p < 0.05) to gate alerts.

Method	Type	Best For	Threshold
KS	Non-parametric test	Univariate continuous	p < 0.05; D > 0.1–0.2
PSI	Binned divergence	Numeric (binned), categorical	<0.1 / 0.1–0.25 / ≥0.25
JS	Symmetric divergence	Categorical / predictions	~0 / 0.05–0.1 / >0.1–0.2
KL	Asymmetric divergence	Binned features	Calibrate vs. baseline
χ²	Parametric test	Categorical features	p < 0.05 + effect size
Wasserstein	Distance metric	Univariate numeric	Scale-dependent
MMD	Kernel two-sample test	Multivariate, embeddings, images	Permutation, α=0.05

PSI = Σ (qᵢ − pᵢ) · ln(qᵢ / pᵢ) — the de-facto credit-risk standard. JS = ½KL(P||M) + ½KL(Q||M), bounded and finite, safer than raw KL for dashboards. MMD is kernel-based in an RKHS — the heavy artillery for multivariate / embedding drift, as in Alibi Detect's MMDDrift.

Two practical heuristics:

Statistical vs. practical significance: p-values answer "real?", effect sizes (PSI, Wasserstein) answer "big enough?" — use both.
Calibrate thresholds on healthy data: compute each metric over historical no-drift windows; alert at the 95th or 99th percentile.

Alerting and SLAs

Drift detection without alerting is theater. A production monitor should:

Run continuously, latest window vs. fixed training reference.
Compute multiple metrics per feature (at least PSI + one significance test).
Aggregate into a drift share across features.
Route by severity: warn on moderate, page on significant.
Tie alerts to operational playbooks: retrain, fallback, kill-switch.

Figure 3.4 — Drift detection and SLA monitoring loop

flowchart TD REF[Reference Distribution
training baseline] --> CMP[Drift Tests
PSI, KS, JS, MMD] LIVE[Live Distribution
rolling window] --> CMP CMP --> AGG[Drift Share
aggregate per-feature signals] AGG --> SEV{Severity?} SEV -->|None| LOG[Log metrics] SEV -->|Moderate| WARN[Warn + ticket
investigate] SEV -->|Significant| PAGE[Page on-call
fallback / retrain / kill-switch] LOG --> LIVE WARN --> LIVE PAGE --> LIVE

A data-quality SLA template:

Dimension	Metric	SLA
Missingness	% null per critical feature	< 1%
Schema validity	% rows failing schema checks	< 0.1%
Outlier rate	% rows flagged by Isolation Forest	< 2%
Feature drift	PSI vs. training reference	< 0.25 (any critical feature)
Prediction drift	JS divergence on score distribution	< 0.1
Freshness	Lag from real-world event	< 15 min (stream) / < 24 h (batch)

The point is that the response is automatic. Humans should be informed, not interrupted, when the system behaves as designed.

Key Points

Skew (engineering bug, fixable) is fundamentally different from drift (world changing) — the right fix is a shared serialized preprocessing pipeline.
Data drift = P(X) change (detect from inputs); concept drift = P(Y|X) change (needs labels or proxies like prediction-distribution shifts).
Combine a significance test (KS, χ², MMD) with an effect-size metric (PSI, Wasserstein, JS) — significance answers "real?", effect size answers "big enough?".
PSI is the credit-risk standard (<0.1 / 0.1–0.25 / ≥0.25); JS is safer than raw KL for dashboards; MMD is the multivariate / embedding tool.
Calibrate thresholds on historical healthy windows; codify response as data-quality SLAs with automatic, severity-routed playbooks.

Post-Reading Check — Drift, Skew, and Anomalies

1. A SQL feature query in training and a Python function in serving produce different values for the same row. This is:

Concept drift — P(Y|X) changed Data drift — P(X) changed Training-serving skew — an engineering bug Label noise

2. The de-facto industry-standard binned divergence metric in credit risk, with the heuristic thresholds <0.1 / 0.1–0.25 / ≥0.25, is:

KS test statistic PSI (Population Stability Index) MMD Chi-squared

3. Which detector is best suited to multivariate drift on embedding or image-representation vectors?

Univariate KS test Chi-squared on per-column counts Maximum Mean Discrepancy (MMD) in an RKHS Z-score on each feature

Chapter 3 — Data Validation, Cleaning, and Quality

Learning Objectives

1. Data Quality Dimensions

Quality Scoring

Cost of Poor Quality

Key Points

2. Schema and Statistical Validation

TFDV (TensorFlow Data Validation)

Figure 3.1 — Validation pipeline

Animation — Validation Pipeline with Reject Path

Great Expectations

Pandera and pydantic

Expectations as Code

Key Points

3. Cleaning Strategies

Missing Values — Mechanism Matters

Figure 3.2 — Imputation decision tree

Outliers

Deduplication and Normalization

Label Noise

Key Points

4. Drift, Skew, and Anomalies

Training-Serving Skew

Animation — Training-Serving Skew

Concept Drift vs. Data Drift

KS, PSI, JS, and Other Divergence Measures

Animation — Drift Detection

Alerting and SLAs

Figure 3.4 — Drift detection and SLA monitoring loop

Key Points

Your Progress

Answer Explanations