Chapter 8: Experiment Tracking and Hyperparameter Tuning
Learning Objectives
Explain why experiment tracking is the prerequisite for reproducibility, auditability, and a meaningful model registry.
Compare the four dominant tracking platforms (MLflow, W&B, Neptune, Comet) along self-hosting, registry, and visualization axes.
Choose an HPO algorithm (grid, random, Bayesian, ASHA/Hyperband, BOHB, PBT) appropriate to your model, budget, and parallelism.
Identify when to use Optuna, Ray Tune, or Kubeflow Katib for distributed hyperparameter optimization.
Codify a winning HPO result into a reproducible production pipeline tied to a model registry.
1. Why Track Experiments?
Pre-Reading Quiz — Why Track?
1. What is the core failure mode that experiment tracking is designed to prevent?
Slow training runs caused by inefficient code paths.A team being unable to reproduce, compare, or audit past model results because the inputs and context were never recorded.GPUs running out of memory during distributed training.Data drift in production causing silent accuracy degradation.
2. In a regulated industry, what does experiment tracking provide that ad-hoc notebooks do not?
Lower training cost per epoch.An auditable trail linking each scored decision back to a specific model version, code commit, data, and approver.Automatic feature engineering for compliance-sensitive fields.Higher model accuracy on imbalanced datasets.
3. What is the relationship between an experiment tracker and a model registry?
They are competing products; teams typically pick one or the other.The tracker records the noisy reality of many runs; the registry records the small subset promoted to standardize on, with provenance back to the source run.The registry stores raw data while the tracker stores models.The tracker handles online serving while the registry handles batch inference.
Experiment tracking is the discipline of recording, for every model training run, the inputs (data version, code commit, hyperparameters, environment), the outputs (metrics, artifacts, predictions, plots), and the context (who, when, why) so that any past result can be understood, compared, and reproduced.
The Lost Notebook Problem
A data scientist runs ten variants in a notebook, saves model_final_v3_actually_final.pkl on a workstation, and posts a screenshot of validation AUC to Slack. Three months later: "Which preprocessing did that model use? Which features? What seed? Was that min_samples_leaf=5 or 15?" Nobody knows. The cost is wasted compute, silent regressions, and blocked collaboration. Modern trackers prevent this by writing every run to a durable, queryable backend at the moment the run happens.
Reproducibility, Comparison, Audit
A tracked run is a row in a database tying together a git commit, a configuration object, a dataset hash, a Python environment, time-series metrics, and an artifact folder. Comparison views, parallel-coordinate plots, and metric-vs-step charts all rest on this machinery. In regulated domains (finance, healthcare, hiring), auditors need to answer "which version of which model produced the score that denied this loan?" — a question that is nearly trivial when each run is timestamped, signed, and linked to a registered model version.
Foundation for the Model Registry
Tracker and registry are two halves of one system. The tracker captures the dozens of daily runs (failures, sweeps, ablations); the registry captures the small subset promoted to standardize on. The bridge is provenance: every promoted model version points back to the exact run, code, data, and metrics that produced it.
Animation: Experiment Tracking Flow — Training Script to UI
A training script invokes the MLflow client to log params, metrics, and artifacts. The tracking server persists them and the UI surfaces the run.
Figure 8.1: Experiment tracking flow from code to registry
flowchart LR
A[Code Commit] --> B[Training Run]
C[Data Version] --> B
D[Hyperparameters] --> B
B --> E[Log Params]
B --> F[Log Metrics]
B --> G[Log Artifacts]
E --> H[(Tracking Backend)]
F --> H
G --> H
H --> I[Compare and Select]
I --> J[Model Registry]
Key Points
Tracking records the full context of a run: code commit, data version, hyperparameters, env, metrics, artifacts.
It exists primarily to defeat the "lost notebook" failure mode: half-remembered cells, untracked seeds, irreproducible AUC screenshots.
Reproducibility means same code + data + env + seeds yields the same numbers — trackers store all four.
In regulated domains, the audit trail (commit, user, metrics, stage transition) is non-optional, not a nice-to-have.
The registry is the formal "small subset" of tracked runs; provenance links every promoted version back to its source run.
Post-Reading Quiz — Why Track?
1. What is the core failure mode that experiment tracking is designed to prevent?
Slow training runs caused by inefficient code paths.A team being unable to reproduce, compare, or audit past model results because the inputs and context were never recorded.GPUs running out of memory during distributed training.Data drift in production causing silent accuracy degradation.
2. In a regulated industry, what does experiment tracking provide that ad-hoc notebooks do not?
Lower training cost per epoch.An auditable trail linking each scored decision back to a specific model version, code commit, data, and approver.Automatic feature engineering for compliance-sensitive fields.Higher model accuracy on imbalanced datasets.
3. What is the relationship between an experiment tracker and a model registry?
They are competing products; teams typically pick one or the other.The tracker records the noisy reality of many runs; the registry records the small subset promoted to standardize on, with provenance back to the source run.The registry stores raw data while the tracker stores models.The tracker handles online serving while the registry handles batch inference.
2. Experiment Tracking Tools
Pre-Reading Quiz — Tools
1. Which tracker is the de facto open-source standard with a first-class model registry and full self-hosting?
Weights & BiasesMLflowNeptuneComet
2. A research team prioritizes interactive dashboards, GPU metrics, shareable Reports, and one-line PyTorch Lightning callbacks. Which tracker fits best?
MLflow OSS, because it integrates with PyTorch.Weights & Biases, whose identity lives in best-in-class visualization, Reports, and native framework callbacks.A custom SQLite database queried with pandas.Plain TensorBoard with no remote backend.
3. What is the principal trade-off of SaaS-only trackers versus self-hosted ones?
SaaS trackers cannot log artifacts; only self-hosted ones can.SaaS shifts operational burden to a vendor but raises data-residency and lock-in concerns vs self-hosted, which keeps metadata in your perimeter.Self-hosted trackers are always faster than SaaS at scale.SaaS trackers do not support hyperparameter sweeps.
Four platforms dominate: MLflow, Weights & Biases (W&B), Neptune, and Comet. All four cover the core capabilities; they differ on the axes that matter when a team commits.
MLflow Tracking
The de facto OSS standard, organized into four pillars: Tracking, Projects, Models, and Registry. Architecturally it is a FastAPI server backed by a relational metadata DB (Postgres/MySQL/SQLite) and a pluggable artifact store (S3, GCS, Azure Blob, NFS). Regulated orgs run it fully inside their VPC and treat it as the system of record. Trade-off: you operate the DB, storage, server, and upgrades.
Weights & Biases (W&B)
SaaS-first, visualization-strongest. Interactive metric panels, system metrics, gradient histograms, custom panels, and shareable Reports. W&B Artifacts provide versioned lineage; W&B Sweeps orchestrates HPO (grid/random/Bayesian) with live search-space visualization. Self-hosting only on enterprise.
Neptune
Positions itself as a metadata store: structured logging, custom fields, fast search across many runs. Built for "50,000 experiments across 12 teams, queryable like a database." SaaS or self-hosted; less central registry than MLflow.
Comet
Balanced hosted experience with a useful online/offline mode for spot or air-gapped runs that sync later. SaaS or self-hosted/VPC. Good for mid-size teams wanting hosted convenience plus a private path.
Comparison Table
Dimension
MLflow
W&B
Neptune
Comet
Core focus
OSS lifecycle + registry
SaaS tracking + reports
Metadata store at scale
Tracking + hybrid hosting
Self-host / on-prem
Yes (full OSS)
Enterprise only
Yes (SaaS or VPC)
Yes (SaaS or VPC)
Registry maturity
First-class, stages
Built-in, integrated
Basic; less central
Versions + stages
Visualization
Functional, basic
Best-in-class
Structured browsing
Solid
Best fit
Regulated, infra-heavy
Research, fast iteration
Many teams, governance
Mid-size hosted + private
Figure 8.2: MLflow tracking server architecture
flowchart TD
A[Training Client mlflow.log_*] --> B[MLflow Tracking Server FastAPI + UI]
C[Notebook Client] --> B
D[Pipeline Client] --> B
B --> E[(Metadata DB Postgres / MySQL / SQLite)]
B --> F[(Artifact Store S3 / GCS / Azure Blob / NFS)]
E --> G[Run Metadata params, metrics, tags]
F --> H[Models, Plots, Datasets, Logs]
B --> I[Model Registry Versions and Stages]
Key Points
MLflow: open-source, strong registry, runs anywhere, you own the ops.
W&B: SaaS-first, best visualization and collaboration, per-seat pricing above free tier.
Neptune: structured-metadata search at scale; SaaS or VPC.
Comet: polished hosted experience with online/offline logging and a private-cloud option.
Choose on organizational axes (compliance, hosting, registry-first vs viz-first), not feature checklists.
Post-Reading Quiz — Tools
1. Which tracker is the de facto open-source standard with a first-class model registry and full self-hosting?
Weights & BiasesMLflowNeptuneComet
2. A research team prioritizes interactive dashboards, GPU metrics, shareable Reports, and one-line PyTorch Lightning callbacks. Which tracker fits best?
MLflow OSS, because it integrates with PyTorch.Weights & Biases, whose identity lives in best-in-class visualization, Reports, and native framework callbacks.A custom SQLite database queried with pandas.Plain TensorBoard with no remote backend.
3. What is the principal trade-off of SaaS-only trackers versus self-hosted ones?
SaaS trackers cannot log artifacts; only self-hosted ones can.SaaS shifts operational burden to a vendor but raises data-residency and lock-in concerns vs self-hosted, which keeps metadata in your perimeter.Self-hosted trackers are always faster than SaaS at scale.SaaS trackers do not support hyperparameter sweeps.
3. Hyperparameter Tuning
Pre-Reading Quiz — HPO
1. Why does random search typically beat grid search at the same budget?
Random search is parallelizable while grid search is sequential.Random search allocates samples non-redundantly across important dimensions instead of wasting trials on irrelevant grid axes.Grid search cannot use GPUs.Random search uses a Bayesian surrogate model internally.
2. What are the two components Bayesian optimization uses on each iteration?
A neural network and a genetic algorithm.A surrogate model (GP/TPE/RF) of past evaluations plus an acquisition function (EI/UCB/PI) to choose the next point.A gradient estimator and a momentum buffer.A k-NN classifier and a logistic prior.
3. When does ASHA shine, and when does it fail?
Shines when early performance predicts final performance and you have many parallel workers; fails when models learn late or have non-monotonic curves.Shines only on CPU-bound workloads; fails on GPU clusters.Shines on tabular data only; fails on images.It always outperforms Bayesian optimization regardless of regime.
Hyperparameters are knobs the optimizer cannot turn: learning rate, depth, width, dropout, regularization, kernel choice, tree depth, etc. Finding good ones is itself an optimization problem — a black-box search over a mixed-type space where each evaluation is expensive.
Grid and Random Search
Grid search enumerates the Cartesian product of per-hyperparameter value sets — deterministic and trivially parallel but exponential in dimension. Random search samples from per-hyperparameter distributions and is the standard baseline that any smarter method should beat, because it does not waste trials on dimensions that do not matter. Neither method learns: trial 100 is sampled the same way as trial 1.
Bayesian Optimization (BO)
BO fits a surrogate model (Gaussian Process, random forest as in SMAC, or Tree-structured Parzen Estimator as in HyperOpt/Optuna) to past evaluations, then uses an acquisition function (EI, UCB, PI) to pick where to sample next, balancing exploration vs exploitation. Strength: sample efficiency when each run is expensive and the space is 20-30 dimensions or fewer. Weaknesses: high dimensions, categorical/conditional parameters, parallelism beyond ~10-20 workers, and ignorance of intermediate learning curves.
Animation: Bayesian Optimization Loop
A surrogate (blue) fits observed points (dots). The acquisition function (orange) peaks at the most informative next point; a new observation is added; the surrogate updates. Four iterations.
Figure 8.3: Bayesian optimization loop
flowchart TD
A[Observed Trials lambda, performance] --> B[Fit Surrogate Model GP / TPE / RF]
B --> C[Evaluate Acquisition Function EI / UCB / PI]
C --> D[Select Next Hyperparameter lambda*]
D --> E[Train Model and Evaluate]
E --> F[Record Performance]
F --> A
F --> G{Budget exhausted?}
G -->|No| B
G -->|Yes| H[Return Best Config]
Hyperband and ASHA
Instead of modeling the objective, multi-fidelity methods allocate compute adaptively. Successive Halving: launch many configs cheaply, evaluate, keep the top fraction (e.g., 1/eta), increase resource (epochs, time, data fraction), repeat. Hyperband wraps this in brackets with different (n, r) trade-offs. ASHA is the asynchronous parallel variant — trials are promoted or stopped at each rung as soon as they finish, scaling to thousands of workers with no global sync.
ASHA shines when early performance is predictive of final performance and parallel compute is abundant; it fails on models that learn late or have non-monotonic curves.
Animation: ASHA Rung-Based Early Stopping
27 trials start at rung 0 with 1 epoch. The top 1/3 advance to rung 1 (3 epochs); the bottom fade. The top 1/3 of those advance to rung 2 (9 epochs), then to rung 3 (27 epochs, full budget). One winner emerges.
Figure 8.4: ASHA rung-based early stopping
flowchart TD
A[Rung 0: 27 trials 1 epoch each] --> B{Top 1/3 by metric}
B -->|Promote 9| C[Rung 1: 9 trials 3 epochs each]
B -->|Stop 18| X1[Pruned]
C --> D{Top 1/3 by metric}
D -->|Promote 3| E[Rung 2: 3 trials 9 epochs each]
D -->|Stop 6| X2[Pruned]
E --> F{Top 1/3 by metric}
F -->|Promote 1| G[Rung 3: 1 trial 27 epochs - full budget]
F -->|Stop 2| X3[Pruned]
G --> H[Best Configuration]
Population-Based Training (PBT)
PBT optimizes hyperparameters and weights jointly over training time. A population of N models trains in parallel; at periodic exploit/explore steps, low performers copy weights and hparams from peers, then perturb the hparams. Output is a schedule, not a fixed vector. Excellent for deep RL and long-horizon supervised training; expensive in compute and infra.
BOHB
BOHB = TPE-style sampling for which configurations to try, plus Hyperband for when to stop them. Often the strongest default for large DL tuning workloads.
Algorithm Cheat-Sheet
Method
Learns?
Early stop?
Parallel scaling
Best fit
Grid
No
No
Good
Tiny spaces, sensitivity
Random
No
Optional
Excellent
Cheap baseline, high-dim
Bayesian
Yes
Not inherent
~4-20 workers
Expensive runs, modest dim
Hyperband / ASHA
Partially
Yes (core)
Excellent
Large DL, early signals
BOHB
Yes
Yes
Excellent
Mixed regime, large DL
PBT
Population
Implicit
Excellent
Deep RL, long runs
Distributed HPO Tools
Optuna: Python-native, Samplers (TPE/CMA-ES) and Pruners (median/ASHA), ask-and-tell API, gRPC storage proxy for thousands of workers.
Ray Tune: built on Ray; resource-aware scheduler with fractional GPUs; ASHA/PBT/BOHB; integrates with MLflow.
Google Vizier: the internal-then-open-sourced ancestor; transfer learning across studies, large-scale parallel trial management.
Figure 8.5: PBT exploit/explore cycle
sequenceDiagram
participant W1 as Worker 1 (low perf)
participant W2 as Worker 2 (top perf)
participant Sched as PBT Scheduler
participant Store as Checkpoint Store
W1->>Sched: Report metric @ step T
W2->>Sched: Report metric @ step T
Sched->>Sched: Rank population
Sched-->>W1: Exploit: copy from W2
W2->>Store: Save weights + hparams
Store-->>W1: Load W2 checkpoint
Sched-->>W1: Explore: perturb hparams
W1->>W1: Resume training with new schedule
W2->>W2: Continue training
Note over W1,W2: Repeat every K steps
Key Points
Random is the baseline; Bayesian wins on sample efficiency at modest dimension; ASHA wins on compute efficiency with parallel workers; BOHB hybridizes both; PBT wins when schedules matter.
BO = surrogate (GP/TPE/RF) + acquisition (EI/UCB/PI), trading exploration vs exploitation.
ASHA's resource axis is anything monotonically increasable: epochs, time, dataset fraction, image resolution.
PBT outputs a schedule of hyperparameters over training, not a fixed vector.
Tool choice: Optuna for Python flexibility, Ray Tune for Ray clusters, Katib for Kubernetes-native AutoML.
Post-Reading Quiz — HPO
1. Why does random search typically beat grid search at the same budget?
Random search is parallelizable while grid search is sequential.Random search allocates samples non-redundantly across important dimensions instead of wasting trials on irrelevant grid axes.Grid search cannot use GPUs.Random search uses a Bayesian surrogate model internally.
2. What are the two components Bayesian optimization uses on each iteration?
A neural network and a genetic algorithm.A surrogate model (GP/TPE/RF) of past evaluations plus an acquisition function (EI/UCB/PI) to choose the next point.A gradient estimator and a momentum buffer.A k-NN classifier and a logistic prior.
3. When does ASHA shine, and when does it fail?
Shines when early performance predicts final performance and you have many parallel workers; fails when models learn late or have non-monotonic curves.Shines only on CPU-bound workloads; fails on GPU clusters.Shines on tabular data only; fails on images.It always outperforms Bayesian optimization regardless of regime.
4. From Experiment to Pipeline
Pre-Reading Quiz — To Pipeline
1. What is the recommended way to capture a winning hyperparameter configuration so it survives into a production pipeline?
Save the values as inline literals in the training script for speed.Commit a structured config file (e.g., YAML) to version control under a path like configs/<model>/v3.yaml; the pipeline reads only from this file.Paste them into a Slack pinned message so the team can refer back to them.Encode them into the model artifact's filename.
2. Why is aggressive HPO statistically risky if you use the validation set as the final benchmark?
Validation sets are slower to load than test sets.Searching thousands of configurations is multiple-comparisons testing against validation; the apparent winner can overstate true generalization. A held-out test set used once at promotion defends against this.Validation sets are biased toward the training distribution.HPO frameworks cannot log validation metrics correctly.
3. What must be pinned (versioned) to allow deterministic reproduction of a winning run?
Only the random seed.Code commit, library versions / container image, data version, random seeds, and config file path.Only the final model artifact.Only the training metrics chart.
A good HPO sweep ends with a winning recipe that can be re-run reliably as part of a production pipeline — not just a screenshot of a leaderboard.
Codifying Winning Hyperparameters
Commit the winning configuration to version control as a structured config file (YAML, JSON, or Hydra/Pydantic) at a path like configs/fraud_classifier/v3.yaml. The training pipeline reads this file — never inline literals — and logs it as an MLflow parameter at the start of every run. Code review now meaningfully covers hyperparameter changes; git history records who changed learning_rate from 3e-4 to 1e-4 and when.
Avoiding Overfit to the Validation Set
Aggressive HPO is multiple-comparisons testing against validation: run a thousand configurations and the best one will look better than its true generalization warrants. Three defenses:
Hold out a true test set that no sweep ever sees, evaluated exactly once at promotion.
Use nested cross-validation (or rolling-origin for time series): HPO in the inner loop, final eval in the outer loop.
Prefer configurations near the top, not the literal best — a config that is robustly excellent across folds is more trustworthy than one that wins narrowly on one fold.
Reproducing Deterministically
Pin library versions in requirements.txt / poetry.lock / a container image, and log the image hash with the run.
Reference datasets by version (DVC tag, Delta Lake version, feature-store snapshot ID, S3 object hash), never a mutable path.
Set seeds centrally (Python random, NumPy, PyTorch/TF CPU and GPU); use torch.use_deterministic_algorithms(True) when bit-determinism matters.
Log the code commit hash, container image, data version, and config file path with the run.
Linking to the Model Registry
After a candidate retraining run, the pipeline registers the model with rich metadata (source run ID, git commit, data version, config path, eval metrics, validation reports). Promotions through stages (None → Staging → Production → Archived) are explicit, auditable, ideally gated by automated tests — not by a human clicking a button without checks. The registry-to-tracking link runs in both directions, providing the bidirectional traceability that underpins audit and debugging.
Figure 8.6: Experiment-to-pipeline promotion
flowchart LR
A[HPO Sweep Notebook] --> B[Winning Config]
B --> C[Commit config YAML to git]
C --> D[Training Pipeline reads config]
D --> E[Tracked Run pinned data + env + seed]
E --> F[Test-set Evaluation]
F --> G{Pass thresholds?}
G -->|Yes| H[Register Model with provenance]
G -->|No| A
H --> I[Staging]
I --> J[Production]
Key Points
Commit winning hyperparameters as a versioned config file; never let them live in a notebook cell or Slack message.
HPO selection on validation alone overfits to validation — hold out a true test set used once at promotion.
Pin everything reproducible: code, libraries, container, data version, random seeds.
Log commit hash, image hash, data version, and config path with every tracked run.
Registry handoff with full provenance closes the loop and is the prerequisite for safe production promotion.
Post-Reading Quiz — To Pipeline
1. What is the recommended way to capture a winning hyperparameter configuration so it survives into a production pipeline?
Save the values as inline literals in the training script for speed.Commit a structured config file (e.g., YAML) to version control under a path like configs/<model>/v3.yaml; the pipeline reads only from this file.Paste them into a Slack pinned message so the team can refer back to them.Encode them into the model artifact's filename.
2. Why is aggressive HPO statistically risky if you use the validation set as the final benchmark?
Validation sets are slower to load than test sets.Searching thousands of configurations is multiple-comparisons testing against validation; the apparent winner can overstate true generalization. A held-out test set used once at promotion defends against this.Validation sets are biased toward the training distribution.HPO frameworks cannot log validation metrics correctly.
3. What must be pinned (versioned) to allow deterministic reproduction of a winning run?
Only the random seed.Code commit, library versions / container image, data version, random seeds, and config file path.Only the final model artifact.Only the training metrics chart.