Chapter 8: Experiment Tracking and Hyperparameter Tuning

Learning Objectives

Explain why experiment tracking is the prerequisite for reproducibility, auditability, and a meaningful model registry.
Compare the four dominant tracking platforms (MLflow, W&B, Neptune, Comet) along self-hosting, registry, and visualization axes.
Choose an HPO algorithm (grid, random, Bayesian, ASHA/Hyperband, BOHB, PBT) appropriate to your model, budget, and parallelism.
Identify when to use Optuna, Ray Tune, or Kubeflow Katib for distributed hyperparameter optimization.
Codify a winning HPO result into a reproducible production pipeline tied to a model registry.

1. Why Track Experiments?

Pre-Reading Quiz — Why Track?

1. What is the core failure mode that experiment tracking is designed to prevent?

Slow training runs caused by inefficient code paths. A team being unable to reproduce, compare, or audit past model results because the inputs and context were never recorded. GPUs running out of memory during distributed training. Data drift in production causing silent accuracy degradation.

2. In a regulated industry, what does experiment tracking provide that ad-hoc notebooks do not?

Lower training cost per epoch. An auditable trail linking each scored decision back to a specific model version, code commit, data, and approver. Automatic feature engineering for compliance-sensitive fields. Higher model accuracy on imbalanced datasets.

3. What is the relationship between an experiment tracker and a model registry?

They are competing products; teams typically pick one or the other. The tracker records the noisy reality of many runs; the registry records the small subset promoted to standardize on, with provenance back to the source run. The registry stores raw data while the tracker stores models. The tracker handles online serving while the registry handles batch inference.

Experiment tracking is the discipline of recording, for every model training run, the inputs (data version, code commit, hyperparameters, environment), the outputs (metrics, artifacts, predictions, plots), and the context (who, when, why) so that any past result can be understood, compared, and reproduced.

The Lost Notebook Problem

A data scientist runs ten variants in a notebook, saves model_final_v3_actually_final.pkl on a workstation, and posts a screenshot of validation AUC to Slack. Three months later: "Which preprocessing did that model use? Which features? What seed? Was that min_samples_leaf=5 or 15?" Nobody knows. The cost is wasted compute, silent regressions, and blocked collaboration. Modern trackers prevent this by writing every run to a durable, queryable backend at the moment the run happens.

Reproducibility, Comparison, Audit

A tracked run is a row in a database tying together a git commit, a configuration object, a dataset hash, a Python environment, time-series metrics, and an artifact folder. Comparison views, parallel-coordinate plots, and metric-vs-step charts all rest on this machinery. In regulated domains (finance, healthcare, hiring), auditors need to answer "which version of which model produced the score that denied this loan?" — a question that is nearly trivial when each run is timestamped, signed, and linked to a registered model version.

Foundation for the Model Registry

Tracker and registry are two halves of one system. The tracker captures the dozens of daily runs (failures, sweeps, ablations); the registry captures the small subset promoted to standardize on. The bridge is provenance: every promoted model version points back to the exact run, code, data, and metrics that produced it.

Animation: Experiment Tracking Flow — Training Script to UI

A training script invokes the MLflow client to log params, metrics, and artifacts. The tracking server persists them and the UI surfaces the run.

Figure 8.1: Experiment tracking flow from code to registry

flowchart LR A[Code Commit] --> B[Training Run] C[Data Version] --> B D[Hyperparameters] --> B B --> E[Log Params] B --> F[Log Metrics] B --> G[Log Artifacts] E --> H[(Tracking Backend)] F --> H G --> H H --> I[Compare and Select] I --> J[Model Registry]

Key Points

Tracking records the full context of a run: code commit, data version, hyperparameters, env, metrics, artifacts.
It exists primarily to defeat the "lost notebook" failure mode: half-remembered cells, untracked seeds, irreproducible AUC screenshots.
Reproducibility means same code + data + env + seeds yields the same numbers — trackers store all four.
In regulated domains, the audit trail (commit, user, metrics, stage transition) is non-optional, not a nice-to-have.
The registry is the formal "small subset" of tracked runs; provenance links every promoted version back to its source run.

Post-Reading Quiz — Why Track?

1. What is the core failure mode that experiment tracking is designed to prevent?

2. In a regulated industry, what does experiment tracking provide that ad-hoc notebooks do not?

3. What is the relationship between an experiment tracker and a model registry?

2. Experiment Tracking Tools

Pre-Reading Quiz — Tools

1. Which tracker is the de facto open-source standard with a first-class model registry and full self-hosting?

Weights & Biases MLflow Neptune Comet

2. A research team prioritizes interactive dashboards, GPU metrics, shareable Reports, and one-line PyTorch Lightning callbacks. Which tracker fits best?

MLflow OSS, because it integrates with PyTorch. Weights & Biases, whose identity lives in best-in-class visualization, Reports, and native framework callbacks. A custom SQLite database queried with pandas. Plain TensorBoard with no remote backend.

3. What is the principal trade-off of SaaS-only trackers versus self-hosted ones?

SaaS trackers cannot log artifacts; only self-hosted ones can. SaaS shifts operational burden to a vendor but raises data-residency and lock-in concerns vs self-hosted, which keeps metadata in your perimeter. Self-hosted trackers are always faster than SaaS at scale. SaaS trackers do not support hyperparameter sweeps.

Four platforms dominate: MLflow, Weights & Biases (W&B), Neptune, and Comet. All four cover the core capabilities; they differ on the axes that matter when a team commits.

MLflow Tracking

The de facto OSS standard, organized into four pillars: Tracking, Projects, Models, and Registry. Architecturally it is a FastAPI server backed by a relational metadata DB (Postgres/MySQL/SQLite) and a pluggable artifact store (S3, GCS, Azure Blob, NFS). Regulated orgs run it fully inside their VPC and treat it as the system of record. Trade-off: you operate the DB, storage, server, and upgrades.

Weights & Biases (W&B)

SaaS-first, visualization-strongest. Interactive metric panels, system metrics, gradient histograms, custom panels, and shareable Reports. W&B Artifacts provide versioned lineage; W&B Sweeps orchestrates HPO (grid/random/Bayesian) with live search-space visualization. Self-hosting only on enterprise.

Neptune

Positions itself as a metadata store: structured logging, custom fields, fast search across many runs. Built for "50,000 experiments across 12 teams, queryable like a database." SaaS or self-hosted; less central registry than MLflow.

Comet

Balanced hosted experience with a useful online/offline mode for spot or air-gapped runs that sync later. SaaS or self-hosted/VPC. Good for mid-size teams wanting hosted convenience plus a private path.

Comparison Table

Dimension	MLflow	W&B	Neptune	Comet
Core focus	OSS lifecycle + registry	SaaS tracking + reports	Metadata store at scale	Tracking + hybrid hosting
Self-host / on-prem	Yes (full OSS)	Enterprise only	Yes (SaaS or VPC)	Yes (SaaS or VPC)
Registry maturity	First-class, stages	Built-in, integrated	Basic; less central	Versions + stages
Visualization	Functional, basic	Best-in-class	Structured browsing	Solid
Best fit	Regulated, infra-heavy	Research, fast iteration	Many teams, governance	Mid-size hosted + private

Figure 8.2: MLflow tracking server architecture

flowchart TD A[Training Client
mlflow.log_*] --> B[MLflow Tracking Server
FastAPI + UI] C[Notebook Client] --> B D[Pipeline Client] --> B B --> E[(Metadata DB
Postgres / MySQL / SQLite)] B --> F[(Artifact Store
S3 / GCS / Azure Blob / NFS)] E --> G[Run Metadata
params, metrics, tags] F --> H[Models, Plots,
Datasets, Logs] B --> I[Model Registry
Versions and Stages]

Key Points

MLflow: open-source, strong registry, runs anywhere, you own the ops.
W&B: SaaS-first, best visualization and collaboration, per-seat pricing above free tier.
Neptune: structured-metadata search at scale; SaaS or VPC.
Comet: polished hosted experience with online/offline logging and a private-cloud option.
Choose on organizational axes (compliance, hosting, registry-first vs viz-first), not feature checklists.

Post-Reading Quiz — Tools

1. Which tracker is the de facto open-source standard with a first-class model registry and full self-hosting?

Weights & Biases MLflow Neptune Comet

2. A research team prioritizes interactive dashboards, GPU metrics, shareable Reports, and one-line PyTorch Lightning callbacks. Which tracker fits best?

3. What is the principal trade-off of SaaS-only trackers versus self-hosted ones?

3. Hyperparameter Tuning

Pre-Reading Quiz — HPO

1. Why does random search typically beat grid search at the same budget?

Random search is parallelizable while grid search is sequential. Random search allocates samples non-redundantly across important dimensions instead of wasting trials on irrelevant grid axes. Grid search cannot use GPUs. Random search uses a Bayesian surrogate model internally.

2. What are the two components Bayesian optimization uses on each iteration?

A neural network and a genetic algorithm. A surrogate model (GP/TPE/RF) of past evaluations plus an acquisition function (EI/UCB/PI) to choose the next point. A gradient estimator and a momentum buffer. A k-NN classifier and a logistic prior.

3. When does ASHA shine, and when does it fail?

Shines when early performance predicts final performance and you have many parallel workers; fails when models learn late or have non-monotonic curves. Shines only on CPU-bound workloads; fails on GPU clusters. Shines on tabular data only; fails on images. It always outperforms Bayesian optimization regardless of regime.

Hyperparameters are knobs the optimizer cannot turn: learning rate, depth, width, dropout, regularization, kernel choice, tree depth, etc. Finding good ones is itself an optimization problem — a black-box search over a mixed-type space where each evaluation is expensive.

Grid and Random Search

Grid search enumerates the Cartesian product of per-hyperparameter value sets — deterministic and trivially parallel but exponential in dimension. Random search samples from per-hyperparameter distributions and is the standard baseline that any smarter method should beat, because it does not waste trials on dimensions that do not matter. Neither method learns: trial 100 is sampled the same way as trial 1.

Bayesian Optimization (BO)

BO fits a surrogate model (Gaussian Process, random forest as in SMAC, or Tree-structured Parzen Estimator as in HyperOpt/Optuna) to past evaluations, then uses an acquisition function (EI, UCB, PI) to pick where to sample next, balancing exploration vs exploitation. Strength: sample efficiency when each run is expensive and the space is 20-30 dimensions or fewer. Weaknesses: high dimensions, categorical/conditional parameters, parallelism beyond ~10-20 workers, and ignorance of intermediate learning curves.

Animation: Bayesian Optimization Loop

A surrogate (blue) fits observed points (dots). The acquisition function (orange) peaks at the most informative next point; a new observation is added; the surrogate updates. Four iterations.

Figure 8.3: Bayesian optimization loop

flowchart TD A[Observed Trials
lambda, performance] --> B[Fit Surrogate Model
GP / TPE / RF] B --> C[Evaluate Acquisition Function
EI / UCB / PI] C --> D[Select Next Hyperparameter
lambda*] D --> E[Train Model
and Evaluate] E --> F[Record Performance] F --> A F --> G{Budget
exhausted?} G -->|No| B G -->|Yes| H[Return Best Config]

Hyperband and ASHA

Instead of modeling the objective, multi-fidelity methods allocate compute adaptively. Successive Halving: launch many configs cheaply, evaluate, keep the top fraction (e.g., 1/eta), increase resource (epochs, time, data fraction), repeat. Hyperband wraps this in brackets with different (n, r) trade-offs. ASHA is the asynchronous parallel variant — trials are promoted or stopped at each rung as soon as they finish, scaling to thousands of workers with no global sync.

ASHA shines when early performance is predictive of final performance and parallel compute is abundant; it fails on models that learn late or have non-monotonic curves.

Animation: ASHA Rung-Based Early Stopping

27 trials start at rung 0 with 1 epoch. The top 1/3 advance to rung 1 (3 epochs); the bottom fade. The top 1/3 of those advance to rung 2 (9 epochs), then to rung 3 (27 epochs, full budget). One winner emerges.

Figure 8.4: ASHA rung-based early stopping

flowchart TD A[Rung 0: 27 trials
1 epoch each] --> B{Top 1/3
by metric} B -->|Promote 9| C[Rung 1: 9 trials
3 epochs each] B -->|Stop 18| X1[Pruned] C --> D{Top 1/3
by metric} D -->|Promote 3| E[Rung 2: 3 trials
9 epochs each] D -->|Stop 6| X2[Pruned] E --> F{Top 1/3
by metric} F -->|Promote 1| G[Rung 3: 1 trial
27 epochs - full budget] F -->|Stop 2| X3[Pruned] G --> H[Best Configuration]

Population-Based Training (PBT)

PBT optimizes hyperparameters and weights jointly over training time. A population of N models trains in parallel; at periodic exploit/explore steps, low performers copy weights and hparams from peers, then perturb the hparams. Output is a schedule, not a fixed vector. Excellent for deep RL and long-horizon supervised training; expensive in compute and infra.

BOHB

BOHB = TPE-style sampling for which configurations to try, plus Hyperband for when to stop them. Often the strongest default for large DL tuning workloads.

Algorithm Cheat-Sheet

Method	Learns?	Early stop?	Parallel scaling	Best fit
Grid	No	No	Good	Tiny spaces, sensitivity
Random	No	Optional	Excellent	Cheap baseline, high-dim
Bayesian	Yes	Not inherent	~4-20 workers	Expensive runs, modest dim
Hyperband / ASHA	Partially	Yes (core)	Excellent	Large DL, early signals
BOHB	Yes	Yes	Excellent	Mixed regime, large DL
PBT	Population	Implicit	Excellent	Deep RL, long runs

Distributed HPO Tools

Optuna: Python-native, Samplers (TPE/CMA-ES) and Pruners (median/ASHA), ask-and-tell API, gRPC storage proxy for thousands of workers.
Ray Tune: built on Ray; resource-aware scheduler with fractional GPUs; ASHA/PBT/BOHB; integrates with MLflow.
Kubeflow Katib: Kubernetes CRDs (Experiment / Suggestion / Trial); framework-agnostic container Trials; algorithm services as gRPC images.
Google Vizier: the internal-then-open-sourced ancestor; transfer learning across studies, large-scale parallel trial management.

Figure 8.5: PBT exploit/explore cycle

sequenceDiagram participant W1 as Worker 1 (low perf) participant W2 as Worker 2 (top perf) participant Sched as PBT Scheduler participant Store as Checkpoint Store W1->>Sched: Report metric @ step T W2->>Sched: Report metric @ step T Sched->>Sched: Rank population Sched-->>W1: Exploit: copy from W2 W2->>Store: Save weights + hparams Store-->>W1: Load W2 checkpoint Sched-->>W1: Explore: perturb hparams W1->>W1: Resume training with new schedule W2->>W2: Continue training Note over W1,W2: Repeat every K steps

Key Points

Random is the baseline; Bayesian wins on sample efficiency at modest dimension; ASHA wins on compute efficiency with parallel workers; BOHB hybridizes both; PBT wins when schedules matter.
BO = surrogate (GP/TPE/RF) + acquisition (EI/UCB/PI), trading exploration vs exploitation.
ASHA's resource axis is anything monotonically increasable: epochs, time, dataset fraction, image resolution.
PBT outputs a schedule of hyperparameters over training, not a fixed vector.
Tool choice: Optuna for Python flexibility, Ray Tune for Ray clusters, Katib for Kubernetes-native AutoML.

Post-Reading Quiz — HPO

1. Why does random search typically beat grid search at the same budget?

2. What are the two components Bayesian optimization uses on each iteration?

3. When does ASHA shine, and when does it fail?

4. From Experiment to Pipeline

Pre-Reading Quiz — To Pipeline

1. What is the recommended way to capture a winning hyperparameter configuration so it survives into a production pipeline?

Save the values as inline literals in the training script for speed. Commit a structured config file (e.g., YAML) to version control under a path like configs/<model>/v3.yaml; the pipeline reads only from this file. Paste them into a Slack pinned message so the team can refer back to them. Encode them into the model artifact's filename.

2. Why is aggressive HPO statistically risky if you use the validation set as the final benchmark?

Validation sets are slower to load than test sets. Searching thousands of configurations is multiple-comparisons testing against validation; the apparent winner can overstate true generalization. A held-out test set used once at promotion defends against this. Validation sets are biased toward the training distribution. HPO frameworks cannot log validation metrics correctly.

3. What must be pinned (versioned) to allow deterministic reproduction of a winning run?

Only the random seed. Code commit, library versions / container image, data version, random seeds, and config file path. Only the final model artifact. Only the training metrics chart.

A good HPO sweep ends with a winning recipe that can be re-run reliably as part of a production pipeline — not just a screenshot of a leaderboard.

Codifying Winning Hyperparameters

Commit the winning configuration to version control as a structured config file (YAML, JSON, or Hydra/Pydantic) at a path like configs/fraud_classifier/v3.yaml. The training pipeline reads this file — never inline literals — and logs it as an MLflow parameter at the start of every run. Code review now meaningfully covers hyperparameter changes; git history records who changed learning_rate from 3e-4 to 1e-4 and when.

Avoiding Overfit to the Validation Set

Aggressive HPO is multiple-comparisons testing against validation: run a thousand configurations and the best one will look better than its true generalization warrants. Three defenses:

Hold out a true test set that no sweep ever sees, evaluated exactly once at promotion.
Use nested cross-validation (or rolling-origin for time series): HPO in the inner loop, final eval in the outer loop.
Prefer configurations near the top, not the literal best — a config that is robustly excellent across folds is more trustworthy than one that wins narrowly on one fold.

Reproducing Deterministically

Pin library versions in requirements.txt / poetry.lock / a container image, and log the image hash with the run.
Reference datasets by version (DVC tag, Delta Lake version, feature-store snapshot ID, S3 object hash), never a mutable path.
Set seeds centrally (Python random, NumPy, PyTorch/TF CPU and GPU); use torch.use_deterministic_algorithms(True) when bit-determinism matters.
Log the code commit hash, container image, data version, and config file path with the run.

Linking to the Model Registry

After a candidate retraining run, the pipeline registers the model with rich metadata (source run ID, git commit, data version, config path, eval metrics, validation reports). Promotions through stages (None → Staging → Production → Archived) are explicit, auditable, ideally gated by automated tests — not by a human clicking a button without checks. The registry-to-tracking link runs in both directions, providing the bidirectional traceability that underpins audit and debugging.

Figure 8.6: Experiment-to-pipeline promotion

flowchart LR A[HPO Sweep
Notebook] --> B[Winning Config] B --> C[Commit config YAML
to git] C --> D[Training Pipeline
reads config] D --> E[Tracked Run
pinned data + env + seed] E --> F[Test-set Evaluation] F --> G{Pass
thresholds?} G -->|Yes| H[Register Model
with provenance] G -->|No| A H --> I[Staging] I --> J[Production]

Key Points

Commit winning hyperparameters as a versioned config file; never let them live in a notebook cell or Slack message.
HPO selection on validation alone overfits to validation — hold out a true test set used once at promotion.
Pin everything reproducible: code, libraries, container, data version, random seeds.
Log commit hash, image hash, data version, and config path with every tracked run.
Registry handoff with full provenance closes the loop and is the prerequisite for safe production promotion.

Post-Reading Quiz — To Pipeline

1. What is the recommended way to capture a winning hyperparameter configuration so it survives into a production pipeline?

2. Why is aggressive HPO statistically risky if you use the validation set as the final benchmark?

3. What must be pinned (versioned) to allow deterministic reproduction of a winning run?

Only the random seed. Code commit, library versions / container image, data version, random seeds, and config file path. Only the final model artifact. Only the training metrics chart.

Chapter 8: Experiment Tracking and Hyperparameter Tuning

Learning Objectives

1. Why Track Experiments?

The Lost Notebook Problem

Reproducibility, Comparison, Audit

Foundation for the Model Registry

Animation: Experiment Tracking Flow — Training Script to UI

Figure 8.1: Experiment tracking flow from code to registry

Key Points

2. Experiment Tracking Tools

MLflow Tracking

Weights & Biases (W&B)

Neptune

Comet

Comparison Table

Figure 8.2: MLflow tracking server architecture

Key Points

3. Hyperparameter Tuning

Grid and Random Search

Bayesian Optimization (BO)

Animation: Bayesian Optimization Loop

Figure 8.3: Bayesian optimization loop

Hyperband and ASHA

Animation: ASHA Rung-Based Early Stopping

Figure 8.4: ASHA rung-based early stopping

Population-Based Training (PBT)

BOHB

Algorithm Cheat-Sheet

Distributed HPO Tools

Figure 8.5: PBT exploit/explore cycle

Key Points

4. From Experiment to Pipeline

Codifying Winning Hyperparameters

Avoiding Overfit to the Validation Set

Reproducing Deterministically

Linking to the Model Registry

Figure 8.6: Experiment-to-pipeline promotion

Key Points

Your Progress

Answer Explanations