Study Guide: Model Packaging, Registry, and Versioning

Learning Objectives

Compare model packaging formats (pickle, ONNX, TorchScript, SavedModel, safetensors, GGUF) and pick one to match a deployment target.
Explain how a model registry binds artifacts, metadata, and lineage and contrast MLflow, Vertex AI, and SageMaker registries.
Build secure, small, fast-starting serving containers using multi-stage Docker builds, vendor base images, and GPU runtime hygiene.
Design a versioning strategy that links model artifact, code SHA, dataset version, and container image with checklist-driven promotion and alias-based rollback.

Section 1: Model Packaging Formats

Pre-Reading Quiz — Packaging Formats

1. Why is loading a `pickle` or `joblib` file from an untrusted source considered a security risk?

It silently downgrades the model to lower precision. Pickle is Turing-complete and can execute arbitrary Python code on load. It always strips the model's weights to zero. It requires root privileges to read.

2. ONNX is best described as which of the following?

A PyTorch-only binary format. A TensorFlow directory layout. A framework-independent protobuf graph plus versioned opset, runnable across many inference runtimes. A wire protocol for streaming training gradients.

3. Which format is the safer, mmap-friendly replacement for Hugging Face's `pytorch_model.bin` pickles?

GGUF TorchScript SavedModel safetensors

A trained model in a notebook is like a finished symphony recorded only as a memory: vivid, complete, and useless to anyone else. Packaging converts that memory into a portable artifact whose contract is data, not code. The first instinct of `pickle.dump(model, f)` works but loads arbitrary bytecode, so anything crossing a security or version boundary belongs in ONNX, TorchScript, SavedModel, safetensors, or GGUF instead.

ONNX is a protobuf-described computation graph plus a versioned opset. Like a PDF, it is the print-ready exchange format any reader can render: a PyTorch model exported via `torch.onnx.export` can be loaded by ONNX Runtime in C++, TensorRT on an NVIDIA GPU, or OpenVINO on an Intel CPU without those runtimes knowing PyTorch exists. Common failure modes are unsupported operators, control flow tied to Python values, and undeclared dynamic shapes.

TorchScript is PyTorch's deployable JIT graph (via `torch.jit.script` or `torch.jit.trace`), runnable by LibTorch in C++ without the Python interpreter. SavedModel is TensorFlow's directory-based format carrying MetaGraphs, variables, assets, and named signatures like `serving_default`, consumed by TF Serving, TFX, and Vertex AI.

For LLMs specifically, safetensors is a flat tensor container that loads as data, not code, enabling zero-copy mmap loads of huge checkpoints. GGUF is the all-in-one llama.cpp cartridge bundling weights, tokenizer, metadata, and chat templates with quantization built in for CPU and edge inference.

Figure 10.1: ONNX as the cross-framework intermediate representation

Key Points

Pickle = code, not data. Loading a pickle executes arbitrary Python; fine for experiments, dangerous across security/version boundaries.
ONNX is the cross-framework PDF. A single protobuf graph + opset runs on ONNX Runtime, TensorRT, OpenVINO, and Triton.
Framework-native formats exist for a reason. TorchScript ships PyTorch models without Python; SavedModel is the TF-Serving canonical directory.
LLM-era formats. safetensors loads as data only (no exec); GGUF is the all-in-one quantization-aware cartridge for llama.cpp.
Trend in 2024-2025. PyTorch is investing in `torch.export`/`torch.compile`; cross-framework bets increasingly go through ONNX.

Post-Reading Quiz — Packaging Formats

1. Why is loading a `pickle` or `joblib` file from an untrusted source considered a security risk?

2. ONNX is best described as which of the following?

3. Which format is the safer, mmap-friendly replacement for Hugging Face's `pytorch_model.bin` pickles?

GGUF TorchScript SavedModel safetensors

Section 2: The Model Registry

Pre-Reading Quiz — Model Registry

1. In the MLflow registry, what does `archive_existing_versions=True` accomplish during a promotion to Production?

It deletes the previous artifacts from object storage. It prevents two versions from claiming the Production stage simultaneously. It forces a retraining of the previous version. It disables alias-based serving.

2. Which registry exposes a first-class `ModelApprovalStatus` field (PendingManualApproval / Approved / Rejected)?

MLflow Model Registry Vertex AI Model Registry SageMaker Model Registry Hugging Face Hub

3. What is the operational benefit of using a mutable alias like `@prod` to point to a model version?

It encrypts the artifact at rest. It guarantees backward compatibility of input schemas. Serving code can keep loading `models:/name@prod` and pick up new versions with no redeployment. It eliminates the need for any RBAC controls.

A registry is to a model what a card catalog is to a book: not the content itself, but the index that makes it findable, attributable, and governable. A registry entry bundles three things: the artifact (or pointer to object storage), the metadata (metrics, hyperparameters, schema), and the lineage (which run, which dataset, which commit). The question "which training run produced the model serving 95% of production traffic" should be one click away, not Slack archaeology.

MLflow's lifecycle defines four explicit stages: None, Staging, Production, Archived. A single API call moves a version forward and atomically archives any incumbent so that no two versions claim Production at once. Newer MLflow versions (>=1.30) add aliases like @prod or @champion — mutable pointers that let serving code load models:/churn_model@prod while you re-point the alias to v8 with no redeployment.

The three dominant registries diverge sharply. MLflow is OSS and vendor-neutral; governance is bring-your-own (tags, PR-driven alias flips). Vertex AI is GCP-native with deep end-to-end lineage and alias/label/endpoint-driven promotion (no fixed enum). SageMaker leans hardest on governance: every Model Package has an explicit ModelApprovalStatus, audited via CloudTrail and gated by IAM. Across all three, the rule is universal: promotion requires stronger permissions than read access.

Figure 10.2: MLflow model registry lifecycle

Key Points

A registry binds artifact + metadata + lineage. Without all three, a model is just a `.bin` in a bucket.
Stages give a defined lifecycle. MLflow's None → Staging → Production → Archived prevents ambiguity about what is live.
Aliases decouple serving from version numbers. `models:/name@prod` keeps working when `prod` re-points to v8.
SageMaker = strongest governance. Explicit `ModelApprovalStatus` and Model Cards make audit trivial.
RBAC is non-negotiable. Promotion requires stronger permissions than read access — otherwise the registry is a sticky note.

Post-Reading Quiz — Model Registry

1. In the MLflow registry, what does `archive_existing_versions=True` accomplish during a promotion to Production?

2. Which registry exposes a first-class `ModelApprovalStatus` field (PendingManualApproval / Approved / Rejected)?

MLflow Model Registry Vertex AI Model Registry SageMaker Model Registry Hugging Face Hub

3. What is the operational benefit of using a mutable alias like `@prod` to point to a model version?

Section 3: Containerization for Serving

Pre-Reading Quiz — Containerization

1. What is the primary advantage of a multi-stage Docker build for a model-serving image?

It encrypts model weights at rest. It strictly separates heavy build-time tooling from a lean runtime, shrinking the final image and attack surface. It guarantees GPU access without any host driver. It auto-versions images using semantic versioning.

2. Why is image size considered a cold-start concern, not just a storage concern?

Larger images consume more GPU memory at inference time. Each extra gigabyte must be pulled to a fresh node before the first prediction can be served. Cold starts are measured in image size, not seconds. Smaller images are required by the ONNX standard.

3. Which of the following is the recommended pattern for GPU containers?

Bake the NVIDIA kernel driver into the image for portability. Run privileged containers with no version pinning. Rely on the host's NVIDIA kernel driver, ship only the CUDA runtime in the container, and pin every version. Use the `latest` tag for the base image to always get security fixes.

A serving image is a contract: "given an HTTP/gRPC request, I will load this exact artifact in this exact runtime and return a prediction." The cleanest way to honor that contract is a multi-stage Dockerfile. A builder stage based on a CUDA `-devel` image installs compilers, wheels, and any custom CUDA kernels. A runtime stage based on `nvcr.io/nvidia/tritonserver:<version>-py3` then COPY --from=builder only the produced artifacts. The result contains the model and serving binary but no compiler, no header files, and no `apt` cache.

Vendor base images save weeks of CUDA debugging. NVIDIA Triton is the multi-backend powerhouse, hosting ONNX, TorchScript, SavedModel, TensorRT, and Python backends from one process — the reason many teams adopt a "default ONNX, native exceptions" pattern. TorchServe starts from `pytorch/torchserve` and requires aligning the PyTorch CUDA build (`+cu121`) with the base image. TF Serving publishes lean C++ CPU and GPU variants.

Image size is cold-start latency: every extra gigabyte must be pulled to a fresh Kubernetes node before the first prediction. Levers: multi-stage, runtime-only CUDA images, `--no-cache-dir`, no shells/editors in runtime, strip debug symbols, mount model repos from object storage. GPU container hygiene means do not bake driver components into the image, run with the NVIDIA Container Runtime, pin every version (no `latest`), rebuild regularly with `docker build --pull`, run as a non-root user (`USER triton`), set `readOnlyRootFilesystem: true`, drop unneeded capabilities, expose only required ports (Triton: 8000/8001/8002), and scan every build with Trivy or Grype.

Figure 10.3: Multi-stage container build pipeline

Key Points

Multi-stage = small + secure. Heavy builder, lean runtime; only artifacts cross via `COPY --from=builder`.
Triton is the multi-backend default. One image serves ONNX, TorchScript, SavedModel, TensorRT, and Python backends.
Cold start = image size. Every GB is pulled to fresh nodes before the first prediction; mount big artifacts from object storage.
GPU hygiene. Host owns the kernel driver; container owns the CUDA runtime; never `--privileged`; pin everything.
Harden + scan. Non-root user, `readOnlyRootFilesystem`, dropped caps, Trivy/Grype in CI failing on critical CVEs.

Post-Reading Quiz — Containerization

1. What is the primary advantage of a multi-stage Docker build for a model-serving image?

2. Why is image size considered a cold-start concern, not just a storage concern?

3. Which of the following is the recommended pattern for GPU containers?

Section 4: Versioning Strategy

Pre-Reading Quiz — Versioning Strategy

1. In the model-adapted reading of semantic versioning, which change warrants a MAJOR bump?

A retrain on fresh data producing the same input/output schema. A bug fix in preprocessing that does not measurably change predictions. A new input feature or a different output schema that forces consumers to update integration code. A reduction in latency from a new ONNX export.

2. To make a model fully reproducible, the registry version must link to which combination?

Only the model artifact. Model artifact + Git commit SHA + dataset version/hash. Only the training metrics. Only the container image digest.

3. What is the "non-negotiable property" of a good rollback strategy described in the chapter?

Rollback must require human approval over the weekend. Rollback must trigger a full retraining run. Rollback must NOT require retraining; previous versions stay available and one alias-flip away. Rollback must delete the previous failing version permanently.

Semantic versioning (`MAJOR.MINOR.PATCH`) adapts to models if you read the numbers as a contract with consumers of predictions. MAJOR = breaking interface change (new feature, new output schema, re-encoded labels). MINOR = compatible behavior change (retrain on fresh data; same shape, similar quality). PATCH = transparent fix (preprocessing bug fix, optimization preserving outputs). That discipline is what lets a downstream team pin `churn_model >=2.3,<3.0`.

A model version is meaningful only as a triple: the artifact, the exact code SHA that trained it, and the exact dataset version that fed it. Miss any one and reproducibility is gone. The practical pattern stitches Git (commit SHA) + DVC/LakeFS/Delta time-travel (dataset hash) + registry (MLflow version, Vertex Model, SageMaker Model Package). MLflow runs log the code version automatically, so drilling from "prod v7" back to "trained by run abc123, SHA 7f3a9c1, dataset sha256:e8b..." is a single query.

Promotion is a checklist, not a hunch. A defensible gate covers: offline quality vs. current prod, subgroup fairness, p95 latency under SLA, shadow/canary results, attached Model Card and approval ticket, and complete lineage. Rollback is the dual of promotion and must be just as cheap. Three patterns dominate: re-point the alias (MLflow/Vertex), re-approve a previous Model Package (SageMaker), or shift traffic in a blue/green split. The non-negotiable property: rollback must never require retraining.

Figure 10.4: Linkage between a model version and its inputs

Figure 10.5: Promotion workflow

sequenceDiagram participant DS as Data Scientist participant CI as CI Pipeline participant REG as Model Registry participant SRE as SRE / Approver participant PROD as Prod Endpoint DS->>CI: Push training code + config CI->>CI: Train, evaluate, log run CI->>REG: Register version v7 (None) CI->>REG: Run quality / latency / fairness gates REG->>SRE: Notify PendingApproval SRE->>REG: Review Model Card + lineage SRE->>REG: Approve / move @prod alias REG->>PROD: Serving container reloads model PROD-->>DS: Live predictions on v7

Key Points

SemVer for models. MAJOR = breaking interface, MINOR = compatible behavior, PATCH = transparent fix.
The reproducibility triple. Model + code SHA + dataset version; missing one means unreproducible.
Promotion = checklist. Quality, fairness, latency, shadow/canary, Model Card, full lineage — not a hunch.
Rollback is one alias flip. If undoing a bad model requires retraining, your registry is failing the property-room job.
Layered versioning. Code, data, model, container image digest, and deployment endpoint config all carry versions that lock together.

Post-Reading Quiz — Versioning Strategy

1. In the model-adapted reading of semantic versioning, which change warrants a MAJOR bump?

2. To make a model fully reproducible, the registry version must link to which combination?

Only the model artifact. Model artifact + Git commit SHA + dataset version/hash. Only the training metrics. Only the container image digest.

3. What is the "non-negotiable property" of a good rollback strategy described in the chapter?

Chapter 10: Model Packaging, Registry, and Versioning

Learning Objectives

Section 1: Model Packaging Formats

Animation: ONNX as a cross-framework intermediate representation

Figure 10.1: ONNX as the cross-framework intermediate representation

Key Points

Section 2: The Model Registry

Animation: Model registry lifecycle (None → Staging → Production → Archived)

Figure 10.2: MLflow model registry lifecycle

Key Points

Section 3: Containerization for Serving

Animation: Multi-stage Docker build — only artifacts cross into the lean runtime image

Figure 10.3: Multi-stage container build pipeline

Key Points

Section 4: Versioning Strategy

Figure 10.4: Linkage between a model version and its inputs

Figure 10.5: Promotion workflow

Key Points

Your Progress

Answer Explanations