Chapter 8: Experiment Tracking and Hyperparameter Tuning

Learning Objectives

1. Why Track Experiments?

Pre-Reading Quiz — Why Track?

1. What is the core failure mode that experiment tracking is designed to prevent?

Slow training runs caused by inefficient code paths. A team being unable to reproduce, compare, or audit past model results because the inputs and context were never recorded. GPUs running out of memory during distributed training. Data drift in production causing silent accuracy degradation.

2. In a regulated industry, what does experiment tracking provide that ad-hoc notebooks do not?

Lower training cost per epoch. An auditable trail linking each scored decision back to a specific model version, code commit, data, and approver. Automatic feature engineering for compliance-sensitive fields. Higher model accuracy on imbalanced datasets.

3. What is the relationship between an experiment tracker and a model registry?

They are competing products; teams typically pick one or the other. The tracker records the noisy reality of many runs; the registry records the small subset promoted to standardize on, with provenance back to the source run. The registry stores raw data while the tracker stores models. The tracker handles online serving while the registry handles batch inference.

Experiment tracking is the discipline of recording, for every model training run, the inputs (data version, code commit, hyperparameters, environment), the outputs (metrics, artifacts, predictions, plots), and the context (who, when, why) so that any past result can be understood, compared, and reproduced.

The Lost Notebook Problem

A data scientist runs ten variants in a notebook, saves model_final_v3_actually_final.pkl on a workstation, and posts a screenshot of validation AUC to Slack. Three months later: "Which preprocessing did that model use? Which features? What seed? Was that min_samples_leaf=5 or 15?" Nobody knows. The cost is wasted compute, silent regressions, and blocked collaboration. Modern trackers prevent this by writing every run to a durable, queryable backend at the moment the run happens.

Reproducibility, Comparison, Audit

A tracked run is a row in a database tying together a git commit, a configuration object, a dataset hash, a Python environment, time-series metrics, and an artifact folder. Comparison views, parallel-coordinate plots, and metric-vs-step charts all rest on this machinery. In regulated domains (finance, healthcare, hiring), auditors need to answer "which version of which model produced the score that denied this loan?" — a question that is nearly trivial when each run is timestamped, signed, and linked to a registered model version.

Foundation for the Model Registry

Tracker and registry are two halves of one system. The tracker captures the dozens of daily runs (failures, sweeps, ablations); the registry captures the small subset promoted to standardize on. The bridge is provenance: every promoted model version points back to the exact run, code, data, and metrics that produced it.

Animation: Experiment Tracking Flow — Training Script to UI

A training script invokes the MLflow client to log params, metrics, and artifacts. The tracking server persists them and the UI surfaces the run.

Training Script mlflow.log_param mlflow.log_metric MLflow Client params / metrics artifacts / tags Tracking Server Postgres / MySQL S3 / GCS artifacts UI & Registry compare / promote

Figure 8.1: Experiment tracking flow from code to registry

flowchart LR A[Code Commit] --> B[Training Run] C[Data Version] --> B D[Hyperparameters] --> B B --> E[Log Params] B --> F[Log Metrics] B --> G[Log Artifacts] E --> H[(Tracking Backend)] F --> H G --> H H --> I[Compare and Select] I --> J[Model Registry]

Key Points

Post-Reading Quiz — Why Track?

1. What is the core failure mode that experiment tracking is designed to prevent?

Slow training runs caused by inefficient code paths. A team being unable to reproduce, compare, or audit past model results because the inputs and context were never recorded. GPUs running out of memory during distributed training. Data drift in production causing silent accuracy degradation.

2. In a regulated industry, what does experiment tracking provide that ad-hoc notebooks do not?

Lower training cost per epoch. An auditable trail linking each scored decision back to a specific model version, code commit, data, and approver. Automatic feature engineering for compliance-sensitive fields. Higher model accuracy on imbalanced datasets.

3. What is the relationship between an experiment tracker and a model registry?

They are competing products; teams typically pick one or the other. The tracker records the noisy reality of many runs; the registry records the small subset promoted to standardize on, with provenance back to the source run. The registry stores raw data while the tracker stores models. The tracker handles online serving while the registry handles batch inference.

2. Experiment Tracking Tools

Pre-Reading Quiz — Tools

1. Which tracker is the de facto open-source standard with a first-class model registry and full self-hosting?

Weights & Biases MLflow Neptune Comet

2. A research team prioritizes interactive dashboards, GPU metrics, shareable Reports, and one-line PyTorch Lightning callbacks. Which tracker fits best?

MLflow OSS, because it integrates with PyTorch. Weights & Biases, whose identity lives in best-in-class visualization, Reports, and native framework callbacks. A custom SQLite database queried with pandas. Plain TensorBoard with no remote backend.

3. What is the principal trade-off of SaaS-only trackers versus self-hosted ones?

SaaS trackers cannot log artifacts; only self-hosted ones can. SaaS shifts operational burden to a vendor but raises data-residency and lock-in concerns vs self-hosted, which keeps metadata in your perimeter. Self-hosted trackers are always faster than SaaS at scale. SaaS trackers do not support hyperparameter sweeps.

Four platforms dominate: MLflow, Weights & Biases (W&B), Neptune, and Comet. All four cover the core capabilities; they differ on the axes that matter when a team commits.

MLflow Tracking

The de facto OSS standard, organized into four pillars: Tracking, Projects, Models, and Registry. Architecturally it is a FastAPI server backed by a relational metadata DB (Postgres/MySQL/SQLite) and a pluggable artifact store (S3, GCS, Azure Blob, NFS). Regulated orgs run it fully inside their VPC and treat it as the system of record. Trade-off: you operate the DB, storage, server, and upgrades.

Weights & Biases (W&B)

SaaS-first, visualization-strongest. Interactive metric panels, system metrics, gradient histograms, custom panels, and shareable Reports. W&B Artifacts provide versioned lineage; W&B Sweeps orchestrates HPO (grid/random/Bayesian) with live search-space visualization. Self-hosting only on enterprise.

Neptune

Positions itself as a metadata store: structured logging, custom fields, fast search across many runs. Built for "50,000 experiments across 12 teams, queryable like a database." SaaS or self-hosted; less central registry than MLflow.

Comet

Balanced hosted experience with a useful online/offline mode for spot or air-gapped runs that sync later. SaaS or self-hosted/VPC. Good for mid-size teams wanting hosted convenience plus a private path.

Comparison Table

DimensionMLflowW&BNeptuneComet
Core focusOSS lifecycle + registrySaaS tracking + reportsMetadata store at scaleTracking + hybrid hosting
Self-host / on-premYes (full OSS)Enterprise onlyYes (SaaS or VPC)Yes (SaaS or VPC)
Registry maturityFirst-class, stagesBuilt-in, integratedBasic; less centralVersions + stages
VisualizationFunctional, basicBest-in-classStructured browsingSolid
Best fitRegulated, infra-heavyResearch, fast iterationMany teams, governanceMid-size hosted + private

Figure 8.2: MLflow tracking server architecture

flowchart TD A[Training Client
mlflow.log_*] --> B[MLflow Tracking Server
FastAPI + UI] C[Notebook Client] --> B D[Pipeline Client] --> B B --> E[(Metadata DB
Postgres / MySQL / SQLite)] B --> F[(Artifact Store
S3 / GCS / Azure Blob / NFS)] E --> G[Run Metadata
params, metrics, tags] F --> H[Models, Plots,
Datasets, Logs] B --> I[Model Registry
Versions and Stages]

Key Points

Post-Reading Quiz — Tools

1. Which tracker is the de facto open-source standard with a first-class model registry and full self-hosting?

Weights & Biases MLflow Neptune Comet

2. A research team prioritizes interactive dashboards, GPU metrics, shareable Reports, and one-line PyTorch Lightning callbacks. Which tracker fits best?

MLflow OSS, because it integrates with PyTorch. Weights & Biases, whose identity lives in best-in-class visualization, Reports, and native framework callbacks. A custom SQLite database queried with pandas. Plain TensorBoard with no remote backend.

3. What is the principal trade-off of SaaS-only trackers versus self-hosted ones?

SaaS trackers cannot log artifacts; only self-hosted ones can. SaaS shifts operational burden to a vendor but raises data-residency and lock-in concerns vs self-hosted, which keeps metadata in your perimeter. Self-hosted trackers are always faster than SaaS at scale. SaaS trackers do not support hyperparameter sweeps.

3. Hyperparameter Tuning

Pre-Reading Quiz — HPO

1. Why does random search typically beat grid search at the same budget?

Random search is parallelizable while grid search is sequential. Random search allocates samples non-redundantly across important dimensions instead of wasting trials on irrelevant grid axes. Grid search cannot use GPUs. Random search uses a Bayesian surrogate model internally.

2. What are the two components Bayesian optimization uses on each iteration?

A neural network and a genetic algorithm. A surrogate model (GP/TPE/RF) of past evaluations plus an acquisition function (EI/UCB/PI) to choose the next point. A gradient estimator and a momentum buffer. A k-NN classifier and a logistic prior.

3. When does ASHA shine, and when does it fail?

Shines when early performance predicts final performance and you have many parallel workers; fails when models learn late or have non-monotonic curves. Shines only on CPU-bound workloads; fails on GPU clusters. Shines on tabular data only; fails on images. It always outperforms Bayesian optimization regardless of regime.

Hyperparameters are knobs the optimizer cannot turn: learning rate, depth, width, dropout, regularization, kernel choice, tree depth, etc. Finding good ones is itself an optimization problem — a black-box search over a mixed-type space where each evaluation is expensive.

Grid and Random Search

Grid search enumerates the Cartesian product of per-hyperparameter value sets — deterministic and trivially parallel but exponential in dimension. Random search samples from per-hyperparameter distributions and is the standard baseline that any smarter method should beat, because it does not waste trials on dimensions that do not matter. Neither method learns: trial 100 is sampled the same way as trial 1.

Bayesian Optimization (BO)

BO fits a surrogate model (Gaussian Process, random forest as in SMAC, or Tree-structured Parzen Estimator as in HyperOpt/Optuna) to past evaluations, then uses an acquisition function (EI, UCB, PI) to pick where to sample next, balancing exploration vs exploitation. Strength: sample efficiency when each run is expensive and the space is 20-30 dimensions or fewer. Weaknesses: high dimensions, categorical/conditional parameters, parallelism beyond ~10-20 workers, and ignorance of intermediate learning curves.

Animation: Bayesian Optimization Loop

A surrogate (blue) fits observed points (dots). The acquisition function (orange) peaks at the most informative next point; a new observation is added; the surrogate updates. Four iterations.

Surrogate of f(lambda) over hyperparameter space Acquisition function (Expected Improvement) observation next sample surrogate mean true objective

Figure 8.3: Bayesian optimization loop

flowchart TD A[Observed Trials
lambda, performance] --> B[Fit Surrogate Model
GP / TPE / RF] B --> C[Evaluate Acquisition Function
EI / UCB / PI] C --> D[Select Next Hyperparameter
lambda*] D --> E[Train Model
and Evaluate] E --> F[Record Performance] F --> A F --> G{Budget
exhausted?} G -->|No| B G -->|Yes| H[Return Best Config]

Hyperband and ASHA

Instead of modeling the objective, multi-fidelity methods allocate compute adaptively. Successive Halving: launch many configs cheaply, evaluate, keep the top fraction (e.g., 1/eta), increase resource (epochs, time, data fraction), repeat. Hyperband wraps this in brackets with different (n, r) trade-offs. ASHA is the asynchronous parallel variant — trials are promoted or stopped at each rung as soon as they finish, scaling to thousands of workers with no global sync.

ASHA shines when early performance is predictive of final performance and parallel compute is abundant; it fails on models that learn late or have non-monotonic curves.

Animation: ASHA Rung-Based Early Stopping

27 trials start at rung 0 with 1 epoch. The top 1/3 advance to rung 1 (3 epochs); the bottom fade. The top 1/3 of those advance to rung 2 (9 epochs), then to rung 3 (27 epochs, full budget). One winner emerges.

Rung 0 (1 epoch) Rung 1 (3 epochs) Rung 2 (9 epochs) Rung 3 (27 epochs, full budget) best config

Figure 8.4: ASHA rung-based early stopping

flowchart TD A[Rung 0: 27 trials
1 epoch each] --> B{Top 1/3
by metric} B -->|Promote 9| C[Rung 1: 9 trials
3 epochs each] B -->|Stop 18| X1[Pruned] C --> D{Top 1/3
by metric} D -->|Promote 3| E[Rung 2: 3 trials
9 epochs each] D -->|Stop 6| X2[Pruned] E --> F{Top 1/3
by metric} F -->|Promote 1| G[Rung 3: 1 trial
27 epochs - full budget] F -->|Stop 2| X3[Pruned] G --> H[Best Configuration]

Population-Based Training (PBT)

PBT optimizes hyperparameters and weights jointly over training time. A population of N models trains in parallel; at periodic exploit/explore steps, low performers copy weights and hparams from peers, then perturb the hparams. Output is a schedule, not a fixed vector. Excellent for deep RL and long-horizon supervised training; expensive in compute and infra.

BOHB

BOHB = TPE-style sampling for which configurations to try, plus Hyperband for when to stop them. Often the strongest default for large DL tuning workloads.

Algorithm Cheat-Sheet

MethodLearns?Early stop?Parallel scalingBest fit
GridNoNoGoodTiny spaces, sensitivity
RandomNoOptionalExcellentCheap baseline, high-dim
BayesianYesNot inherent~4-20 workersExpensive runs, modest dim
Hyperband / ASHAPartiallyYes (core)ExcellentLarge DL, early signals
BOHBYesYesExcellentMixed regime, large DL
PBTPopulationImplicitExcellentDeep RL, long runs

Distributed HPO Tools

Figure 8.5: PBT exploit/explore cycle

sequenceDiagram participant W1 as Worker 1 (low perf) participant W2 as Worker 2 (top perf) participant Sched as PBT Scheduler participant Store as Checkpoint Store W1->>Sched: Report metric @ step T W2->>Sched: Report metric @ step T Sched->>Sched: Rank population Sched-->>W1: Exploit: copy from W2 W2->>Store: Save weights + hparams Store-->>W1: Load W2 checkpoint Sched-->>W1: Explore: perturb hparams W1->>W1: Resume training with new schedule W2->>W2: Continue training Note over W1,W2: Repeat every K steps

Key Points

Post-Reading Quiz — HPO

1. Why does random search typically beat grid search at the same budget?

Random search is parallelizable while grid search is sequential. Random search allocates samples non-redundantly across important dimensions instead of wasting trials on irrelevant grid axes. Grid search cannot use GPUs. Random search uses a Bayesian surrogate model internally.

2. What are the two components Bayesian optimization uses on each iteration?

A neural network and a genetic algorithm. A surrogate model (GP/TPE/RF) of past evaluations plus an acquisition function (EI/UCB/PI) to choose the next point. A gradient estimator and a momentum buffer. A k-NN classifier and a logistic prior.

3. When does ASHA shine, and when does it fail?

Shines when early performance predicts final performance and you have many parallel workers; fails when models learn late or have non-monotonic curves. Shines only on CPU-bound workloads; fails on GPU clusters. Shines on tabular data only; fails on images. It always outperforms Bayesian optimization regardless of regime.

4. From Experiment to Pipeline

Pre-Reading Quiz — To Pipeline

1. What is the recommended way to capture a winning hyperparameter configuration so it survives into a production pipeline?

Save the values as inline literals in the training script for speed. Commit a structured config file (e.g., YAML) to version control under a path like configs/<model>/v3.yaml; the pipeline reads only from this file. Paste them into a Slack pinned message so the team can refer back to them. Encode them into the model artifact's filename.

2. Why is aggressive HPO statistically risky if you use the validation set as the final benchmark?

Validation sets are slower to load than test sets. Searching thousands of configurations is multiple-comparisons testing against validation; the apparent winner can overstate true generalization. A held-out test set used once at promotion defends against this. Validation sets are biased toward the training distribution. HPO frameworks cannot log validation metrics correctly.

3. What must be pinned (versioned) to allow deterministic reproduction of a winning run?

Only the random seed. Code commit, library versions / container image, data version, random seeds, and config file path. Only the final model artifact. Only the training metrics chart.

A good HPO sweep ends with a winning recipe that can be re-run reliably as part of a production pipeline — not just a screenshot of a leaderboard.

Codifying Winning Hyperparameters

Commit the winning configuration to version control as a structured config file (YAML, JSON, or Hydra/Pydantic) at a path like configs/fraud_classifier/v3.yaml. The training pipeline reads this file — never inline literals — and logs it as an MLflow parameter at the start of every run. Code review now meaningfully covers hyperparameter changes; git history records who changed learning_rate from 3e-4 to 1e-4 and when.

Avoiding Overfit to the Validation Set

Aggressive HPO is multiple-comparisons testing against validation: run a thousand configurations and the best one will look better than its true generalization warrants. Three defenses:

  1. Hold out a true test set that no sweep ever sees, evaluated exactly once at promotion.
  2. Use nested cross-validation (or rolling-origin for time series): HPO in the inner loop, final eval in the outer loop.
  3. Prefer configurations near the top, not the literal best — a config that is robustly excellent across folds is more trustworthy than one that wins narrowly on one fold.

Reproducing Deterministically

Linking to the Model Registry

After a candidate retraining run, the pipeline registers the model with rich metadata (source run ID, git commit, data version, config path, eval metrics, validation reports). Promotions through stages (None → Staging → Production → Archived) are explicit, auditable, ideally gated by automated tests — not by a human clicking a button without checks. The registry-to-tracking link runs in both directions, providing the bidirectional traceability that underpins audit and debugging.

Figure 8.6: Experiment-to-pipeline promotion

flowchart LR A[HPO Sweep
Notebook] --> B[Winning Config] B --> C[Commit config YAML
to git] C --> D[Training Pipeline
reads config] D --> E[Tracked Run
pinned data + env + seed] E --> F[Test-set Evaluation] F --> G{Pass
thresholds?} G -->|Yes| H[Register Model
with provenance] G -->|No| A H --> I[Staging] I --> J[Production]

Key Points

Post-Reading Quiz — To Pipeline

1. What is the recommended way to capture a winning hyperparameter configuration so it survives into a production pipeline?

Save the values as inline literals in the training script for speed. Commit a structured config file (e.g., YAML) to version control under a path like configs/<model>/v3.yaml; the pipeline reads only from this file. Paste them into a Slack pinned message so the team can refer back to them. Encode them into the model artifact's filename.

2. Why is aggressive HPO statistically risky if you use the validation set as the final benchmark?

Validation sets are slower to load than test sets. Searching thousands of configurations is multiple-comparisons testing against validation; the apparent winner can overstate true generalization. A held-out test set used once at promotion defends against this. Validation sets are biased toward the training distribution. HPO frameworks cannot log validation metrics correctly.

3. What must be pinned (versioned) to allow deterministic reproduction of a winning run?

Only the random seed. Code commit, library versions / container image, data version, random seeds, and config file path. Only the final model artifact. Only the training metrics chart.

Your Progress

Answer Explanations