Chapter 10: Model Packaging, Registry, and Versioning
Learning Objectives
Compare model packaging formats (pickle, ONNX, TorchScript, SavedModel, safetensors, GGUF) and pick one to match a deployment target.
Explain how a model registry binds artifacts, metadata, and lineage and contrast MLflow, Vertex AI, and SageMaker registries.
Build secure, small, fast-starting serving containers using multi-stage Docker builds, vendor base images, and GPU runtime hygiene.
Design a versioning strategy that links model artifact, code SHA, dataset version, and container image with checklist-driven promotion and alias-based rollback.
Section 1: Model Packaging Formats
Pre-Reading Quiz — Packaging Formats
1. Why is loading a `pickle` or `joblib` file from an untrusted source considered a security risk?
It silently downgrades the model to lower precision.Pickle is Turing-complete and can execute arbitrary Python code on load.It always strips the model's weights to zero.It requires root privileges to read.
2. ONNX is best described as which of the following?
A PyTorch-only binary format.A TensorFlow directory layout.A framework-independent protobuf graph plus versioned opset, runnable across many inference runtimes.A wire protocol for streaming training gradients.
3. Which format is the safer, mmap-friendly replacement for Hugging Face's `pytorch_model.bin` pickles?
GGUFTorchScriptSavedModelsafetensors
A trained model in a notebook is like a finished symphony recorded only as a memory: vivid, complete, and useless to anyone else. Packaging converts that memory into a portable artifact whose contract is data, not code. The first instinct of `pickle.dump(model, f)` works but loads arbitrary bytecode, so anything crossing a security or version boundary belongs in ONNX, TorchScript, SavedModel, safetensors, or GGUF instead.
ONNX is a protobuf-described computation graph plus a versioned opset. Like a PDF, it is the print-ready exchange format any reader can render: a PyTorch model exported via `torch.onnx.export` can be loaded by ONNX Runtime in C++, TensorRT on an NVIDIA GPU, or OpenVINO on an Intel CPU without those runtimes knowing PyTorch exists. Common failure modes are unsupported operators, control flow tied to Python values, and undeclared dynamic shapes.
TorchScript is PyTorch's deployable JIT graph (via `torch.jit.script` or `torch.jit.trace`), runnable by LibTorch in C++ without the Python interpreter. SavedModel is TensorFlow's directory-based format carrying MetaGraphs, variables, assets, and named signatures like `serving_default`, consumed by TF Serving, TFX, and Vertex AI.
For LLMs specifically, safetensors is a flat tensor container that loads as data, not code, enabling zero-copy mmap loads of huge checkpoints. GGUF is the all-in-one llama.cpp cartridge bundling weights, tokenizer, metadata, and chat templates with quantization built in for CPU and edge inference.
Animation: ONNX as a cross-framework intermediate representation
Multiple training frameworks export to one ONNX IR, which fans out to multiple inference runtimes.
Figure 10.1: ONNX as the cross-framework intermediate representation
Pickle = code, not data. Loading a pickle executes arbitrary Python; fine for experiments, dangerous across security/version boundaries.
ONNX is the cross-framework PDF. A single protobuf graph + opset runs on ONNX Runtime, TensorRT, OpenVINO, and Triton.
Framework-native formats exist for a reason. TorchScript ships PyTorch models without Python; SavedModel is the TF-Serving canonical directory.
LLM-era formats. safetensors loads as data only (no exec); GGUF is the all-in-one quantization-aware cartridge for llama.cpp.
Trend in 2024-2025. PyTorch is investing in `torch.export`/`torch.compile`; cross-framework bets increasingly go through ONNX.
Post-Reading Quiz — Packaging Formats
1. Why is loading a `pickle` or `joblib` file from an untrusted source considered a security risk?
It silently downgrades the model to lower precision.Pickle is Turing-complete and can execute arbitrary Python code on load.It always strips the model's weights to zero.It requires root privileges to read.
2. ONNX is best described as which of the following?
A PyTorch-only binary format.A TensorFlow directory layout.A framework-independent protobuf graph plus versioned opset, runnable across many inference runtimes.A wire protocol for streaming training gradients.
3. Which format is the safer, mmap-friendly replacement for Hugging Face's `pytorch_model.bin` pickles?
GGUFTorchScriptSavedModelsafetensors
Section 2: The Model Registry
Pre-Reading Quiz — Model Registry
1. In the MLflow registry, what does `archive_existing_versions=True` accomplish during a promotion to Production?
It deletes the previous artifacts from object storage.It prevents two versions from claiming the Production stage simultaneously.It forces a retraining of the previous version.It disables alias-based serving.
2. Which registry exposes a first-class `ModelApprovalStatus` field (PendingManualApproval / Approved / Rejected)?
MLflow Model RegistryVertex AI Model RegistrySageMaker Model RegistryHugging Face Hub
3. What is the operational benefit of using a mutable alias like `@prod` to point to a model version?
It encrypts the artifact at rest.It guarantees backward compatibility of input schemas.Serving code can keep loading `models:/name@prod` and pick up new versions with no redeployment.It eliminates the need for any RBAC controls.
A registry is to a model what a card catalog is to a book: not the content itself, but the index that makes it findable, attributable, and governable. A registry entry bundles three things: the artifact (or pointer to object storage), the metadata (metrics, hyperparameters, schema), and the lineage (which run, which dataset, which commit). The question "which training run produced the model serving 95% of production traffic" should be one click away, not Slack archaeology.
MLflow's lifecycle defines four explicit stages: None, Staging, Production, Archived. A single API call moves a version forward and atomically archives any incumbent so that no two versions claim Production at once. Newer MLflow versions (>=1.30) add aliases like @prod or @champion — mutable pointers that let serving code load models:/churn_model@prod while you re-point the alias to v8 with no redeployment.
The three dominant registries diverge sharply. MLflow is OSS and vendor-neutral; governance is bring-your-own (tags, PR-driven alias flips). Vertex AI is GCP-native with deep end-to-end lineage and alias/label/endpoint-driven promotion (no fixed enum). SageMaker leans hardest on governance: every Model Package has an explicit ModelApprovalStatus, audited via CloudTrail and gated by IAM. Across all three, the rule is universal: promotion requires stronger permissions than read access.
Animation: Model registry lifecycle (None → Staging → Production → Archived)
States illuminate in sequence; the production version glows, and the amber rollback arrow shows that archived versions stay one alias-flip from live.
Figure 10.2: MLflow model registry lifecycle
stateDiagram-v2
[*] --> None: register_model()
None --> Staging: transition(Staging)
Staging --> Production: transition(Production, archive_existing=True)
Staging --> Archived: superseded
Production --> Archived: new version promoted
Production --> Staging: rollback for re-eval
Archived --> Staging: re-promote for rollback
Archived --> [*]
Key Points
A registry binds artifact + metadata + lineage. Without all three, a model is just a `.bin` in a bucket.
Stages give a defined lifecycle. MLflow's None → Staging → Production → Archived prevents ambiguity about what is live.
Aliases decouple serving from version numbers. `models:/name@prod` keeps working when `prod` re-points to v8.
SageMaker = strongest governance. Explicit `ModelApprovalStatus` and Model Cards make audit trivial.
RBAC is non-negotiable. Promotion requires stronger permissions than read access — otherwise the registry is a sticky note.
Post-Reading Quiz — Model Registry
1. In the MLflow registry, what does `archive_existing_versions=True` accomplish during a promotion to Production?
It deletes the previous artifacts from object storage.It prevents two versions from claiming the Production stage simultaneously.It forces a retraining of the previous version.It disables alias-based serving.
2. Which registry exposes a first-class `ModelApprovalStatus` field (PendingManualApproval / Approved / Rejected)?
MLflow Model RegistryVertex AI Model RegistrySageMaker Model RegistryHugging Face Hub
3. What is the operational benefit of using a mutable alias like `@prod` to point to a model version?
It encrypts the artifact at rest.It guarantees backward compatibility of input schemas.Serving code can keep loading `models:/name@prod` and pick up new versions with no redeployment.It eliminates the need for any RBAC controls.
Section 3: Containerization for Serving
Pre-Reading Quiz — Containerization
1. What is the primary advantage of a multi-stage Docker build for a model-serving image?
It encrypts model weights at rest.It strictly separates heavy build-time tooling from a lean runtime, shrinking the final image and attack surface.It guarantees GPU access without any host driver.It auto-versions images using semantic versioning.
2. Why is image size considered a cold-start concern, not just a storage concern?
Larger images consume more GPU memory at inference time.Each extra gigabyte must be pulled to a fresh node before the first prediction can be served.Cold starts are measured in image size, not seconds.Smaller images are required by the ONNX standard.
3. Which of the following is the recommended pattern for GPU containers?
Bake the NVIDIA kernel driver into the image for portability.Run privileged containers with no version pinning.Rely on the host's NVIDIA kernel driver, ship only the CUDA runtime in the container, and pin every version.Use the `latest` tag for the base image to always get security fixes.
A serving image is a contract: "given an HTTP/gRPC request, I will load this exact artifact in this exact runtime and return a prediction." The cleanest way to honor that contract is a multi-stage Dockerfile. A builder stage based on a CUDA `-devel` image installs compilers, wheels, and any custom CUDA kernels. A runtime stage based on `nvcr.io/nvidia/tritonserver:<version>-py3` then COPY --from=builder only the produced artifacts. The result contains the model and serving binary but no compiler, no header files, and no `apt` cache.
Vendor base images save weeks of CUDA debugging. NVIDIA Triton is the multi-backend powerhouse, hosting ONNX, TorchScript, SavedModel, TensorRT, and Python backends from one process — the reason many teams adopt a "default ONNX, native exceptions" pattern. TorchServe starts from `pytorch/torchserve` and requires aligning the PyTorch CUDA build (`+cu121`) with the base image. TF Serving publishes lean C++ CPU and GPU variants.
Image size is cold-start latency: every extra gigabyte must be pulled to a fresh Kubernetes node before the first prediction. Levers: multi-stage, runtime-only CUDA images, `--no-cache-dir`, no shells/editors in runtime, strip debug symbols, mount model repos from object storage. GPU container hygiene means do not bake driver components into the image, run with the NVIDIA Container Runtime, pin every version (no `latest`), rebuild regularly with `docker build --pull`, run as a non-root user (`USER triton`), set `readOnlyRootFilesystem: true`, drop unneeded capabilities, expose only required ports (Triton: 8000/8001/8002), and scan every build with Trivy or Grype.
Animation: Multi-stage Docker build — only artifacts cross into the lean runtime image
The builder is wide and contains the heavy toolchain; only the artifacts (wheels, model files) cross into the lean, hardened runtime image.
Multi-stage = small + secure. Heavy builder, lean runtime; only artifacts cross via `COPY --from=builder`.
Triton is the multi-backend default. One image serves ONNX, TorchScript, SavedModel, TensorRT, and Python backends.
Cold start = image size. Every GB is pulled to fresh nodes before the first prediction; mount big artifacts from object storage.
GPU hygiene. Host owns the kernel driver; container owns the CUDA runtime; never `--privileged`; pin everything.
Harden + scan. Non-root user, `readOnlyRootFilesystem`, dropped caps, Trivy/Grype in CI failing on critical CVEs.
Post-Reading Quiz — Containerization
1. What is the primary advantage of a multi-stage Docker build for a model-serving image?
It encrypts model weights at rest.It strictly separates heavy build-time tooling from a lean runtime, shrinking the final image and attack surface.It guarantees GPU access without any host driver.It auto-versions images using semantic versioning.
2. Why is image size considered a cold-start concern, not just a storage concern?
Larger images consume more GPU memory at inference time.Each extra gigabyte must be pulled to a fresh node before the first prediction can be served.Cold starts are measured in image size, not seconds.Smaller images are required by the ONNX standard.
3. Which of the following is the recommended pattern for GPU containers?
Bake the NVIDIA kernel driver into the image for portability.Run privileged containers with no version pinning.Rely on the host's NVIDIA kernel driver, ship only the CUDA runtime in the container, and pin every version.Use the `latest` tag for the base image to always get security fixes.
Section 4: Versioning Strategy
Pre-Reading Quiz — Versioning Strategy
1. In the model-adapted reading of semantic versioning, which change warrants a MAJOR bump?
A retrain on fresh data producing the same input/output schema.A bug fix in preprocessing that does not measurably change predictions.A new input feature or a different output schema that forces consumers to update integration code.A reduction in latency from a new ONNX export.
2. To make a model fully reproducible, the registry version must link to which combination?
Only the model artifact.Model artifact + Git commit SHA + dataset version/hash.Only the training metrics.Only the container image digest.
3. What is the "non-negotiable property" of a good rollback strategy described in the chapter?
Rollback must require human approval over the weekend.Rollback must trigger a full retraining run.Rollback must NOT require retraining; previous versions stay available and one alias-flip away.Rollback must delete the previous failing version permanently.
Semantic versioning (`MAJOR.MINOR.PATCH`) adapts to models if you read the numbers as a contract with consumers of predictions. MAJOR = breaking interface change (new feature, new output schema, re-encoded labels). MINOR = compatible behavior change (retrain on fresh data; same shape, similar quality). PATCH = transparent fix (preprocessing bug fix, optimization preserving outputs). That discipline is what lets a downstream team pin `churn_model >=2.3,<3.0`.
A model version is meaningful only as a triple: the artifact, the exact code SHA that trained it, and the exact dataset version that fed it. Miss any one and reproducibility is gone. The practical pattern stitches Git (commit SHA) + DVC/LakeFS/Delta time-travel (dataset hash) + registry (MLflow version, Vertex Model, SageMaker Model Package). MLflow runs log the code version automatically, so drilling from "prod v7" back to "trained by run abc123, SHA 7f3a9c1, dataset sha256:e8b..." is a single query.
Promotion is a checklist, not a hunch. A defensible gate covers: offline quality vs. current prod, subgroup fairness, p95 latency under SLA, shadow/canary results, attached Model Card and approval ticket, and complete lineage. Rollback is the dual of promotion and must be just as cheap. Three patterns dominate: re-point the alias (MLflow/Vertex), re-approve a previous Model Package (SageMaker), or shift traffic in a blue/green split. The non-negotiable property: rollback must never require retraining.
Figure 10.4: Linkage between a model version and its inputs
graph LR
MV[Model Version v7] --> CODE[Git Commit SHA 7f3a9c1]
MV --> DATA[Dataset Hash sha256:e8b...]
MV --> RUN[Training Run run_id abc123]
MV --> IMG[Container Image tag + digest]
RUN --> METRICS[Metrics & Hyperparams]
DATA --> DVC[DVC / LakeFS / Delta time-travel]
CODE --> REPO[Git Repository]
IMG --> REG[(Container Registry)]
Figure 10.5: Promotion workflow
sequenceDiagram
participant DS as Data Scientist
participant CI as CI Pipeline
participant REG as Model Registry
participant SRE as SRE / Approver
participant PROD as Prod Endpoint
DS->>CI: Push training code + config
CI->>CI: Train, evaluate, log run
CI->>REG: Register version v7 (None)
CI->>REG: Run quality / latency / fairness gates
REG->>SRE: Notify PendingApproval
SRE->>REG: Review Model Card + lineage
SRE->>REG: Approve / move @prod alias
REG->>PROD: Serving container reloads model
PROD-->>DS: Live predictions on v7
Key Points
SemVer for models. MAJOR = breaking interface, MINOR = compatible behavior, PATCH = transparent fix.
The reproducibility triple. Model + code SHA + dataset version; missing one means unreproducible.
Promotion = checklist. Quality, fairness, latency, shadow/canary, Model Card, full lineage — not a hunch.
Rollback is one alias flip. If undoing a bad model requires retraining, your registry is failing the property-room job.
Layered versioning. Code, data, model, container image digest, and deployment endpoint config all carry versions that lock together.
Post-Reading Quiz — Versioning Strategy
1. In the model-adapted reading of semantic versioning, which change warrants a MAJOR bump?
A retrain on fresh data producing the same input/output schema.A bug fix in preprocessing that does not measurably change predictions.A new input feature or a different output schema that forces consumers to update integration code.A reduction in latency from a new ONNX export.
2. To make a model fully reproducible, the registry version must link to which combination?
Only the model artifact.Model artifact + Git commit SHA + dataset version/hash.Only the training metrics.Only the container image digest.
3. What is the "non-negotiable property" of a good rollback strategy described in the chapter?
Rollback must require human approval over the weekend.Rollback must trigger a full retraining run.Rollback must NOT require retraining; previous versions stay available and one alias-flip away.Rollback must delete the previous failing version permanently.