Chapter 3 — Data Validation, Cleaning, and Quality

Learning Objectives

1. Data Quality Dimensions

Pre-Reading Check — Quality Dimensions

1. Which dimension is most directly violated when user_id = 42 has country = "US" in one table and country = "United States" in another?

2. A fraud model is trained on transactions delayed by 48 hours and systematically misses fast-evolving attacks. Which dimension failed?

3. The "1-10-100" rule from data management implies that a defect:

Data quality is not a single binary property. It decomposes into five orthogonal dimensions, each with its own failure signature in ML:

DimensionSymptomML Consequence
CompletenessHigh null rateBiased imputation, shifted priors
AccuracyOut-of-domain valuesGarbage-in-garbage-out predictions
ConsistencyTwo formats for one conceptTraining-serving skew, broken one-hot encodings
TimelinessStale featuresConcept drift, poor reaction to regime change
UniquenessDuplicate rowsInflated metric estimates, leakage

Quality Scoring

Mature platforms persist a per-dimension score and dashboard it next to model metrics. A typical scheme:

Cost of Poor Quality

The "1-10-100" rule scales precisely to ML: $1 to prevent at source, $10 to remediate downstream, $100 once it hits a customer. A bad row costs a millisecond at ingestion, hours at training debugging, weeks of business impact when the model misclassifies in production.

Key Points

Post-Reading Check — Quality Dimensions

1. Which dimension is most directly violated when user_id = 42 has country = "US" in one table and country = "United States" in another?

2. A fraud model is trained on transactions delayed by 48 hours and systematically misses fast-evolving attacks. Which dimension failed?

3. The "1-10-100" rule from data management implies that a defect:

2. Schema and Statistical Validation

Pre-Reading Check — Schema and Statistical Validation

1. Which framework is built on Apache Beam and ships with first-class drift / training-serving skew detection as a native operation?

2. Great Expectations organizes its data contracts as:

3. In a hybrid validation stack, where does pydantic most commonly fit?

Two complementary categories of tools dominate: schema-based validators (enforce types, presence, domains) and statistical anomaly detectors (compare distributions to a baseline). Modern frameworks blend both.

TFDV (TensorFlow Data Validation)

The TFX data-quality component, built on Apache Beam — scales to terabytes on Dataflow, Spark, or Flink. Three-step workflow: generate stats, infer a schema, validate every subsequent batch.

import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_csv('train.csv')
schema = tfdv.infer_schema(stats)
eval_stats = tfdv.generate_statistics_from_csv('batch.csv')
anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(anomalies)

Schemas are protobuf describing types, presence, domains, and nested structure. TFDV's killer feature is built-in training-serving skew and drift detection: hand it two stats artifacts and it emits per-feature comparisons.

Figure 3.1 — Validation pipeline

flowchart LR A[Raw Batch] --> B[Schema Check
types, presence, domains] B -->|pass| C[Statistical Check
ranges, distributions] B -->|fail| X[Anomaly Report] C -->|pass| D[Cleaned Dataset] C -->|fail| X X --> E[Gate / Alert]

Animation — Validation Pipeline with Reject Path

A raw batch flows through schema check, statistical check, and emerges cleaned. Failed rows divert to the anomaly report.
Raw Batch CSV / Parquet Schema Check types, domains Stat Check PSI, KS, ranges Cleaned model-ready Anomaly Report raw → schema → stats → cleaned
Tip: hover any stage to focus — the reject branch highlights when stats checks fail.

Great Expectations

GE inverts the philosophy: humans write Expectation Suites — assertions about the data — and every failed expectation is an anomaly.

import great_expectations as ge
import pandas as pd

df = pd.read_csv("train.csv")
ge_df = ge.from_pandas(df)
ge_df.expect_column_values_to_not_be_null("user_id")
ge_df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
ge_df.expect_column_mean_to_be_between("click_count", min_value=0, max_value=10)
result = ge_df.validate()

Suites can be auto-profiled then human-refined — like generating tests via coverage then editing by hand. Data Docs render as browsable HTML reports, doubling as living documentation. First-class Airflow integration via GreatExpectationsOperator.

Pandera and pydantic

Pandera validates Pandas/Polars DataFrames inline; pydantic validates request payloads at FastAPI service boundaries. Both are the last mile.

import pandera as pa
from pandera.typing import Series

class TransactionSchema(pa.DataFrameModel):
    user_id: Series[int] = pa.Field(ge=1)
    amount: Series[float] = pa.Field(ge=0, le=1_000_000)
    country: Series[str] = pa.Field(isin=["US", "CA", "UK", "DE"])

TransactionSchema.validate(df)

Expectations as Code

The unifying idea: version-controlled, PR-reviewed, CI-executed data contracts — the data-quality analog of infrastructure-as-code.

CapabilityTFDVGreat ExpectationsPanderapydantic
Primary artifactSchema protobufExpectation SuiteDataFrameModelBaseModel
Scale engineApache BeamPandas / Spark / SQLPandas / PolarsSingle row
Drift / skew built-inYesVia expectationsNoNo
Best forTFX feature validationWarehouse contractsDataFrame stepsAPI request bodies

The mature hybrid: GE upstream in the warehouse; TFDV downstream for ML-specific skew; pandera/pydantic at service boundaries.

Key Points

Post-Reading Check — Schema and Statistical Validation

1. Which framework is built on Apache Beam and ships with first-class drift / training-serving skew detection as a native operation?

2. Great Expectations organizes its data contracts as:

3. In a hybrid validation stack, where does pydantic most commonly fit?

3. Cleaning Strategies

Pre-Reading Check — Cleaning Strategies

1. A feature is missing more often when its underlying value is high (e.g., income missing for high earners). The mechanism is:

2. You auto-delete every outlier flagged by Isolation Forest from a fraud-detection training set. What is the most likely problem?

3. In confident learning (Cleanlab), the predicted probabilities passed to find_label_issues must be:

Validation finds problems; cleaning resolves them. The three perennial issues: missing values, outliers, and label noise.

Missing Values — Mechanism Matters

Figure 3.2 — Imputation decision tree

flowchart TD A[Missing values detected] --> B{Mechanism?} B -->|MCAR| C{Missing rate < 5%?} C -->|Yes| D[Drop rows] C -->|No| E[Mean/Median impute] B -->|MAR| F[Conditional impute
KNN or MICE] B -->|MNAR| G[Sentinel + missingness
indicator feature] F --> H{Feature critical?} G --> H E --> H H -->|Yes| I[Model-based imputation] H -->|No| J[Keep simple imputer]

Imputation strategies in increasing complexity: drop → mean/median/mode → constant+indicator → KNN → MICE → model-based. Non-negotiable rule: fit imputers only on training data, then apply the fitted artifact to validation, test, and production. Fitting on the full dataset leaks information.

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median", add_indicator=True)),
    ("scaler", StandardScaler())
])
preprocess = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features),
])
clf = Pipeline([("preprocess", preprocess), ("model", RandomForestClassifier(...))])
clf.fit(X_train, y_train)

The serialized pipeline applies identical imputation to both train and inference — eliminating an entire class of training-serving skew.

Outliers

Three flavors: data errors, rare-but-valid cases, distribution shifts. They require different responses.

Treatment: fix, winsorize (cap at 1st/99th percentile), remove (only when confident), down-weight, or use robust models (trees, Huber loss). Critical: in fraud or rare-disease prediction, the "outliers" are the positive class — never auto-delete.

Deduplication and Normalization

Exact dedup is trivial; near-dup (Jaro-Winkler, MinHash, embedding similarity) is harder. Duplicates inflate training-set size, bias the model toward the duplicated examples, and produce optimistic CV scores. Normalization (lowercase, ISO 8601 dates, canonical country codes) is one of the most common sources of train-serve skew.

Label Noise

The most damaging form of corruption — corrupts the learning signal directly. The dominant modern technique is confident learning (Cleanlab):

  1. Train a baseline model and obtain out-of-sample predicted probabilities via 5-fold CV.
  2. Pass labels and probabilities to find_label_issues; it estimates the joint distribution of noisy vs. true labels and flags confident disagreements.
  3. Route flagged examples for review, drop them, or use CleanLearning for noise-aware reweighting.
from cleanlab.filter import find_label_issues
probs = cross_val_predict(base_clf, X, y, cv=5, method="predict_proba")
label_issues = find_label_issues(labels=y, probs=probs)
X_clean, y_clean = X[~label_issues], y[~label_issues]

Complementary: label smoothing, co-teaching, per-example loss tracking. In production, the highest-leverage action is a human-in-the-loop review queue for high-confidence model-vs-label disagreements.

Key Points

Post-Reading Check — Cleaning Strategies

1. A feature is missing more often when its underlying value is high (e.g., income missing for high earners). The mechanism is:

2. You auto-delete every outlier flagged by Isolation Forest from a fraud-detection training set. What is the most likely problem?

3. In confident learning (Cleanlab), the predicted probabilities passed to find_label_issues must be:

4. Drift, Skew, and Anomalies

Pre-Reading Check — Drift, Skew, and Anomalies

1. A SQL feature query in training and a Python function in serving produce different values for the same row. This is:

2. The de-facto industry-standard binned divergence metric in credit risk, with the heuristic thresholds <0.1 / 0.1–0.25 / ≥0.25, is:

3. Which detector is best suited to multivariate drift on embedding or image-representation vectors?

A clean snapshot still degrades in production because the world moves. Drift detection compares live distributions against a training reference and alarms when divergence matters.

Training-Serving Skew

A systematic difference between training and serving inputs caused by an engineering issue, not natural shift. Examples: SQL vs. Python feature implementations, a categorical encoder fit on training silently mapping unseen serving categories to 0, normalization re-fit per batch. Fixable by engineering: share a single serialized preprocessing pipeline, validate serving against the training schema, continuously compare distributions. Feature stores exist largely to make this easy.

Animation — Training-Serving Skew

Two parallel pipelines transform the same raw source; subtly different encoders produce divergent features and a corrupted prediction.
Raw Source TRAINING PATH SQL feature query Fitted encoder Model training SERVING PATH Python feature fn Ad-hoc encoder Online inference ⚡ SKEW ⚡ Same input → different features → silent corruption
Tip: arrows reveal sequentially, then the SKEW marker pulses across the boundary — the fix is a shared serialized pipeline.

Concept Drift vs. Data Drift

KS, PSI, JS, and Other Divergence Measures

Animation — Drift Detection

A reference histogram (blue) is overlaid by a shifted live histogram (orange); the KS statistic rises and an alert fires red.
feature value drift stats KS D = 0.31 PSI = 0.42 JS = 0.18 p < 0.001 threshold: 0.25 ⚠ DRIFT ALERT training reference live (shifted)
Tip: PSI ≥ 0.25 = significant drift; combine with a significance test (KS p < 0.05) to gate alerts.
MethodTypeBest ForThreshold
KSNon-parametric testUnivariate continuousp < 0.05; D > 0.1–0.2
PSIBinned divergenceNumeric (binned), categorical<0.1 / 0.1–0.25 / ≥0.25
JSSymmetric divergenceCategorical / predictions~0 / 0.05–0.1 / >0.1–0.2
KLAsymmetric divergenceBinned featuresCalibrate vs. baseline
χ²Parametric testCategorical featuresp < 0.05 + effect size
WassersteinDistance metricUnivariate numericScale-dependent
MMDKernel two-sample testMultivariate, embeddings, imagesPermutation, α=0.05

PSI = Σ (qᵢ − pᵢ) · ln(qᵢ / pᵢ) — the de-facto credit-risk standard. JS = ½KL(P||M) + ½KL(Q||M), bounded and finite, safer than raw KL for dashboards. MMD is kernel-based in an RKHS — the heavy artillery for multivariate / embedding drift, as in Alibi Detect's MMDDrift.

Two practical heuristics:

  1. Statistical vs. practical significance: p-values answer "real?", effect sizes (PSI, Wasserstein) answer "big enough?" — use both.
  2. Calibrate thresholds on healthy data: compute each metric over historical no-drift windows; alert at the 95th or 99th percentile.

Alerting and SLAs

Drift detection without alerting is theater. A production monitor should:

  1. Run continuously, latest window vs. fixed training reference.
  2. Compute multiple metrics per feature (at least PSI + one significance test).
  3. Aggregate into a drift share across features.
  4. Route by severity: warn on moderate, page on significant.
  5. Tie alerts to operational playbooks: retrain, fallback, kill-switch.

Figure 3.4 — Drift detection and SLA monitoring loop

flowchart TD REF[Reference Distribution
training baseline] --> CMP[Drift Tests
PSI, KS, JS, MMD] LIVE[Live Distribution
rolling window] --> CMP CMP --> AGG[Drift Share
aggregate per-feature signals] AGG --> SEV{Severity?} SEV -->|None| LOG[Log metrics] SEV -->|Moderate| WARN[Warn + ticket
investigate] SEV -->|Significant| PAGE[Page on-call
fallback / retrain / kill-switch] LOG --> LIVE WARN --> LIVE PAGE --> LIVE

A data-quality SLA template:

DimensionMetricSLA
Missingness% null per critical feature< 1%
Schema validity% rows failing schema checks< 0.1%
Outlier rate% rows flagged by Isolation Forest< 2%
Feature driftPSI vs. training reference< 0.25 (any critical feature)
Prediction driftJS divergence on score distribution< 0.1
FreshnessLag from real-world event< 15 min (stream) / < 24 h (batch)
The point is that the response is automatic. Humans should be informed, not interrupted, when the system behaves as designed.

Key Points

Post-Reading Check — Drift, Skew, and Anomalies

1. A SQL feature query in training and a Python function in serving produce different values for the same row. This is:

2. The de-facto industry-standard binned divergence metric in credit risk, with the heuristic thresholds <0.1 / 0.1–0.25 / ≥0.25, is:

3. Which detector is best suited to multivariate drift on embedding or image-representation vectors?

Your Progress

Answer Explanations