Chapter 10: Model Packaging, Registry, and Versioning

Learning Objectives

Section 1: Model Packaging Formats

Pre-Reading Quiz — Packaging Formats

1. Why is loading a `pickle` or `joblib` file from an untrusted source considered a security risk?

It silently downgrades the model to lower precision. Pickle is Turing-complete and can execute arbitrary Python code on load. It always strips the model's weights to zero. It requires root privileges to read.

2. ONNX is best described as which of the following?

A PyTorch-only binary format. A TensorFlow directory layout. A framework-independent protobuf graph plus versioned opset, runnable across many inference runtimes. A wire protocol for streaming training gradients.

3. Which format is the safer, mmap-friendly replacement for Hugging Face's `pytorch_model.bin` pickles?

GGUF TorchScript SavedModel safetensors

A trained model in a notebook is like a finished symphony recorded only as a memory: vivid, complete, and useless to anyone else. Packaging converts that memory into a portable artifact whose contract is data, not code. The first instinct of `pickle.dump(model, f)` works but loads arbitrary bytecode, so anything crossing a security or version boundary belongs in ONNX, TorchScript, SavedModel, safetensors, or GGUF instead.

ONNX is a protobuf-described computation graph plus a versioned opset. Like a PDF, it is the print-ready exchange format any reader can render: a PyTorch model exported via `torch.onnx.export` can be loaded by ONNX Runtime in C++, TensorRT on an NVIDIA GPU, or OpenVINO on an Intel CPU without those runtimes knowing PyTorch exists. Common failure modes are unsupported operators, control flow tied to Python values, and undeclared dynamic shapes.

TorchScript is PyTorch's deployable JIT graph (via `torch.jit.script` or `torch.jit.trace`), runnable by LibTorch in C++ without the Python interpreter. SavedModel is TensorFlow's directory-based format carrying MetaGraphs, variables, assets, and named signatures like `serving_default`, consumed by TF Serving, TFX, and Vertex AI.

For LLMs specifically, safetensors is a flat tensor container that loads as data, not code, enabling zero-copy mmap loads of huge checkpoints. GGUF is the all-in-one llama.cpp cartridge bundling weights, tokenizer, metadata, and chat templates with quantization built in for CPU and edge inference.

Animation: ONNX as a cross-framework intermediate representation

PyTorch torch.onnx.export TensorFlow tf2onnx ONNX Graph + Opset (framework-neutral IR) Triton ONNX backend ONNX Runtime CPU / GPU TensorRT NVIDIA GPU
Multiple training frameworks export to one ONNX IR, which fans out to multiple inference runtimes.

Figure 10.1: ONNX as the cross-framework intermediate representation

flowchart LR PT[PyTorch Model] -->|torch.onnx.export| ONNX[(ONNX Graph
+ Opset)] TF[TensorFlow Model] -->|tf2onnx| ONNX SK[scikit-learn] -->|skl2onnx| ONNX ONNX --> ORT[ONNX Runtime
CPU/GPU] ONNX --> TRT[TensorRT
NVIDIA GPU] ONNX --> OV[OpenVINO
Intel CPU/NPU] ONNX --> TRI[Triton
ONNX Backend]

Key Points

Post-Reading Quiz — Packaging Formats

1. Why is loading a `pickle` or `joblib` file from an untrusted source considered a security risk?

It silently downgrades the model to lower precision. Pickle is Turing-complete and can execute arbitrary Python code on load. It always strips the model's weights to zero. It requires root privileges to read.

2. ONNX is best described as which of the following?

A PyTorch-only binary format. A TensorFlow directory layout. A framework-independent protobuf graph plus versioned opset, runnable across many inference runtimes. A wire protocol for streaming training gradients.

3. Which format is the safer, mmap-friendly replacement for Hugging Face's `pytorch_model.bin` pickles?

GGUF TorchScript SavedModel safetensors

Section 2: The Model Registry

Pre-Reading Quiz — Model Registry

1. In the MLflow registry, what does `archive_existing_versions=True` accomplish during a promotion to Production?

It deletes the previous artifacts from object storage. It prevents two versions from claiming the Production stage simultaneously. It forces a retraining of the previous version. It disables alias-based serving.

2. Which registry exposes a first-class `ModelApprovalStatus` field (PendingManualApproval / Approved / Rejected)?

MLflow Model Registry Vertex AI Model Registry SageMaker Model Registry Hugging Face Hub

3. What is the operational benefit of using a mutable alias like `@prod` to point to a model version?

It encrypts the artifact at rest. It guarantees backward compatibility of input schemas. Serving code can keep loading `models:/name@prod` and pick up new versions with no redeployment. It eliminates the need for any RBAC controls.

A registry is to a model what a card catalog is to a book: not the content itself, but the index that makes it findable, attributable, and governable. A registry entry bundles three things: the artifact (or pointer to object storage), the metadata (metrics, hyperparameters, schema), and the lineage (which run, which dataset, which commit). The question "which training run produced the model serving 95% of production traffic" should be one click away, not Slack archaeology.

MLflow's lifecycle defines four explicit stages: None, Staging, Production, Archived. A single API call moves a version forward and atomically archives any incumbent so that no two versions claim Production at once. Newer MLflow versions (>=1.30) add aliases like @prod or @champion — mutable pointers that let serving code load models:/churn_model@prod while you re-point the alias to v8 with no redeployment.

The three dominant registries diverge sharply. MLflow is OSS and vendor-neutral; governance is bring-your-own (tags, PR-driven alias flips). Vertex AI is GCP-native with deep end-to-end lineage and alias/label/endpoint-driven promotion (no fixed enum). SageMaker leans hardest on governance: every Model Package has an explicit ModelApprovalStatus, audited via CloudTrail and gated by IAM. Across all three, the rule is universal: promotion requires stronger permissions than read access.

Animation: Model registry lifecycle (None → Staging → Production → Archived)

None register_model() Staging integration tests promote Production @prod alias superseded Archived retained rollback (re-promote) active production version rollback path
States illuminate in sequence; the production version glows, and the amber rollback arrow shows that archived versions stay one alias-flip from live.

Figure 10.2: MLflow model registry lifecycle

stateDiagram-v2 [*] --> None: register_model() None --> Staging: transition(Staging) Staging --> Production: transition(Production, archive_existing=True) Staging --> Archived: superseded Production --> Archived: new version promoted Production --> Staging: rollback for re-eval Archived --> Staging: re-promote for rollback Archived --> [*]

Key Points

Post-Reading Quiz — Model Registry

1. In the MLflow registry, what does `archive_existing_versions=True` accomplish during a promotion to Production?

It deletes the previous artifacts from object storage. It prevents two versions from claiming the Production stage simultaneously. It forces a retraining of the previous version. It disables alias-based serving.

2. Which registry exposes a first-class `ModelApprovalStatus` field (PendingManualApproval / Approved / Rejected)?

MLflow Model Registry Vertex AI Model Registry SageMaker Model Registry Hugging Face Hub

3. What is the operational benefit of using a mutable alias like `@prod` to point to a model version?

It encrypts the artifact at rest. It guarantees backward compatibility of input schemas. Serving code can keep loading `models:/name@prod` and pick up new versions with no redeployment. It eliminates the need for any RBAC controls.

Section 3: Containerization for Serving

Pre-Reading Quiz — Containerization

1. What is the primary advantage of a multi-stage Docker build for a model-serving image?

It encrypts model weights at rest. It strictly separates heavy build-time tooling from a lean runtime, shrinking the final image and attack surface. It guarantees GPU access without any host driver. It auto-versions images using semantic versioning.

2. Why is image size considered a cold-start concern, not just a storage concern?

Larger images consume more GPU memory at inference time. Each extra gigabyte must be pulled to a fresh node before the first prediction can be served. Cold starts are measured in image size, not seconds. Smaller images are required by the ONNX standard.

3. Which of the following is the recommended pattern for GPU containers?

Bake the NVIDIA kernel driver into the image for portability. Run privileged containers with no version pinning. Rely on the host's NVIDIA kernel driver, ship only the CUDA runtime in the container, and pin every version. Use the `latest` tag for the base image to always get security fixes.

A serving image is a contract: "given an HTTP/gRPC request, I will load this exact artifact in this exact runtime and return a prediction." The cleanest way to honor that contract is a multi-stage Dockerfile. A builder stage based on a CUDA `-devel` image installs compilers, wheels, and any custom CUDA kernels. A runtime stage based on `nvcr.io/nvidia/tritonserver:<version>-py3` then COPY --from=builder only the produced artifacts. The result contains the model and serving binary but no compiler, no header files, and no `apt` cache.

Vendor base images save weeks of CUDA debugging. NVIDIA Triton is the multi-backend powerhouse, hosting ONNX, TorchScript, SavedModel, TensorRT, and Python backends from one process — the reason many teams adopt a "default ONNX, native exceptions" pattern. TorchServe starts from `pytorch/torchserve` and requires aligning the PyTorch CUDA build (`+cu121`) with the base image. TF Serving publishes lean C++ CPU and GPU variants.

Image size is cold-start latency: every extra gigabyte must be pulled to a fresh Kubernetes node before the first prediction. Levers: multi-stage, runtime-only CUDA images, `--no-cache-dir`, no shells/editors in runtime, strip debug symbols, mount model repos from object storage. GPU container hygiene means do not bake driver components into the image, run with the NVIDIA Container Runtime, pin every version (no `latest`), rebuild regularly with `docker build --pull`, run as a non-root user (`USER triton`), set `readOnlyRootFilesystem: true`, drop unneeded capabilities, expose only required ports (Triton: 8000/8001/8002), and scan every build with Trivy or Grype.

Animation: Multi-stage Docker build — only artifacts cross into the lean runtime image

Stage 1: builder FROM nvidia/cuda:12.1-devel ~6.4 GB build-essential cmake protoc CUDA -devel headers pip wheels (build) apt cache test/debug tools source tree /wheels/*.whl /models/*.onnx ← only these cross COPY --from=builder /wheels /models Stage 2: runtime (final image) FROM nvcr.io/nvidia/tritonserver:24.05-py3 USER triton ~1.7 GB tritonserver CUDA runtime libs /wheels (installed) /models/*.onnx no compilers no apt cache no source tree → faster cold start → smaller attack surface → smaller pull
The builder is wide and contains the heavy toolchain; only the artifacts (wheels, model files) cross into the lean, hardened runtime image.

Figure 10.3: Multi-stage container build pipeline

flowchart LR SRC[Source + Dockerfile
+ requirements.txt] --> BUILD[Builder Stage
CUDA -devel
compilers, wheels] BUILD -->|COPY --from=builder| RT[Runtime Stage
tritonserver -py3
non-root user] RT --> IMG[Image Layers] IMG --> SCAN[Vulnerability Scan
Trivy / Grype / Scout] SCAN -->|pass| REG[(Container Registry
tagged + digest)] SCAN -->|critical CVE| FAIL[Fail CI] REG --> K8S[Kubernetes /
Serving Cluster]

Key Points

Post-Reading Quiz — Containerization

1. What is the primary advantage of a multi-stage Docker build for a model-serving image?

It encrypts model weights at rest. It strictly separates heavy build-time tooling from a lean runtime, shrinking the final image and attack surface. It guarantees GPU access without any host driver. It auto-versions images using semantic versioning.

2. Why is image size considered a cold-start concern, not just a storage concern?

Larger images consume more GPU memory at inference time. Each extra gigabyte must be pulled to a fresh node before the first prediction can be served. Cold starts are measured in image size, not seconds. Smaller images are required by the ONNX standard.

3. Which of the following is the recommended pattern for GPU containers?

Bake the NVIDIA kernel driver into the image for portability. Run privileged containers with no version pinning. Rely on the host's NVIDIA kernel driver, ship only the CUDA runtime in the container, and pin every version. Use the `latest` tag for the base image to always get security fixes.

Section 4: Versioning Strategy

Pre-Reading Quiz — Versioning Strategy

1. In the model-adapted reading of semantic versioning, which change warrants a MAJOR bump?

A retrain on fresh data producing the same input/output schema. A bug fix in preprocessing that does not measurably change predictions. A new input feature or a different output schema that forces consumers to update integration code. A reduction in latency from a new ONNX export.

2. To make a model fully reproducible, the registry version must link to which combination?

Only the model artifact. Model artifact + Git commit SHA + dataset version/hash. Only the training metrics. Only the container image digest.

3. What is the "non-negotiable property" of a good rollback strategy described in the chapter?

Rollback must require human approval over the weekend. Rollback must trigger a full retraining run. Rollback must NOT require retraining; previous versions stay available and one alias-flip away. Rollback must delete the previous failing version permanently.

Semantic versioning (`MAJOR.MINOR.PATCH`) adapts to models if you read the numbers as a contract with consumers of predictions. MAJOR = breaking interface change (new feature, new output schema, re-encoded labels). MINOR = compatible behavior change (retrain on fresh data; same shape, similar quality). PATCH = transparent fix (preprocessing bug fix, optimization preserving outputs). That discipline is what lets a downstream team pin `churn_model >=2.3,<3.0`.

A model version is meaningful only as a triple: the artifact, the exact code SHA that trained it, and the exact dataset version that fed it. Miss any one and reproducibility is gone. The practical pattern stitches Git (commit SHA) + DVC/LakeFS/Delta time-travel (dataset hash) + registry (MLflow version, Vertex Model, SageMaker Model Package). MLflow runs log the code version automatically, so drilling from "prod v7" back to "trained by run abc123, SHA 7f3a9c1, dataset sha256:e8b..." is a single query.

Promotion is a checklist, not a hunch. A defensible gate covers: offline quality vs. current prod, subgroup fairness, p95 latency under SLA, shadow/canary results, attached Model Card and approval ticket, and complete lineage. Rollback is the dual of promotion and must be just as cheap. Three patterns dominate: re-point the alias (MLflow/Vertex), re-approve a previous Model Package (SageMaker), or shift traffic in a blue/green split. The non-negotiable property: rollback must never require retraining.

Figure 10.4: Linkage between a model version and its inputs

graph LR MV[Model Version
v7] --> CODE[Git Commit SHA
7f3a9c1] MV --> DATA[Dataset Hash
sha256:e8b...] MV --> RUN[Training Run
run_id abc123] MV --> IMG[Container Image
tag + digest] RUN --> METRICS[Metrics &
Hyperparams] DATA --> DVC[DVC / LakeFS /
Delta time-travel] CODE --> REPO[Git Repository] IMG --> REG[(Container
Registry)]

Figure 10.5: Promotion workflow

sequenceDiagram participant DS as Data Scientist participant CI as CI Pipeline participant REG as Model Registry participant SRE as SRE / Approver participant PROD as Prod Endpoint DS->>CI: Push training code + config CI->>CI: Train, evaluate, log run CI->>REG: Register version v7 (None) CI->>REG: Run quality / latency / fairness gates REG->>SRE: Notify PendingApproval SRE->>REG: Review Model Card + lineage SRE->>REG: Approve / move @prod alias REG->>PROD: Serving container reloads model PROD-->>DS: Live predictions on v7

Key Points

Post-Reading Quiz — Versioning Strategy

1. In the model-adapted reading of semantic versioning, which change warrants a MAJOR bump?

A retrain on fresh data producing the same input/output schema. A bug fix in preprocessing that does not measurably change predictions. A new input feature or a different output schema that forces consumers to update integration code. A reduction in latency from a new ONNX export.

2. To make a model fully reproducible, the registry version must link to which combination?

Only the model artifact. Model artifact + Git commit SHA + dataset version/hash. Only the training metrics. Only the container image digest.

3. What is the "non-negotiable property" of a good rollback strategy described in the chapter?

Rollback must require human approval over the weekend. Rollback must trigger a full retraining run. Rollback must NOT require retraining; previous versions stay available and one alias-flip away. Rollback must delete the previous failing version permanently.

Your Progress

Answer Explanations