Study Guide: Model Deployment Patterns: Batch, Online, and Edge

Pre-Reading Check — Inference Patterns

1. A retailer needs to score nightly customer lifetime value for a CRM team that consumes scores the next morning. Which inference pattern fits best?

A) Online synchronous inference behind a REST API B) Batch offline inference scheduled by Airflow C) Streaming inference on a Kafka topic D) Embedded inference on the device

2. Why does online inference cost the most per prediction of any pattern?

A) Online models always run on GPUs B) Capacity must be provisioned for peak QPS with strict latency SLOs, leaving headroom idle C) Online models cannot use spot instances D) Online predictions are always written to a database

3. Which property best distinguishes streaming inference from batch and online?

A) It runs only on Kubernetes B) Inputs arrive as a continuous flow of events with near-real-time latency C) Predictions are always returned synchronously to a user D) It eliminates the need for a feature store

A trained model creates no value until it reaches a prediction surface. The four core inference patterns — batch, online, streaming, and embedded — trade off latency, throughput, cost, and complexity. Picking the wrong pattern shows up as operational pain: a fraud system that runs nightly when it should run per-transaction, or a recommender that reloads embeddings every request.

Batch Offline Inference

Batch runs the model on large groups of inputs at scheduled intervals — hourly, nightly — triggered by a scheduler (Airflow, Argo, Prefect, cron). The classic stack uses Spark, Beam, or Dask writing predictions to BigQuery, Snowflake, S3, or Redis. Latency is minutes to hours, throughput is enormous (billions of records), and cost per prediction is the lowest of any pattern. Batch is the printing press: expensive setup, but each page is nearly free. It fits anywhere downstream tolerates stale predictions — nightly CLV, precomputed top-N recommendations, and historical backfills when a new model ships.

Synchronous Online Inference

Online runs per request behind REST or gRPC, fetching features and returning a response in the same round trip. Typical latency budgets are 1–200ms at p95. The stack is a serving runtime (FastAPI, TF Serving, TorchServe, BentoML, Triton) plus a feature store (Feast, Tecton) or low-latency cache. Because the model sits in the critical path, capacity must cover peak QPS, GPUs may sit idle to guarantee p99 latency, and cost per prediction is the highest. Online is unavoidable when a human is waiting — fraud during card authorization, search ranking, autocomplete, chatbot replies.

Streaming and Async Inference

Streaming sits between batch and online: the model runs continuously on Kafka/Kinesis/Pub-Sub flows processed by Flink, Spark Structured Streaming, Kafka Streams, or Beam. Predictions are another stream. Latency is sub-second to a few seconds, throughput reaches millions of events per second, and operational complexity is the highest — checkpointing, exactly-once semantics, backpressure, and stateful windowing all matter. Streaming shines for trending content, fraud with rolling 5-minute features, and live moderation. A lighter cousin, async inference, enqueues requests for workers and polls/callbacks results — useful for slow models like document summarization.

Embedded and On-Device Inference

Embedded runs on the device generating the data: phone, camera, car, sensor. There is no network call. It unlocks three properties no server-side pattern matches: privacy (raw input never leaves the device), single-digit-ms latency (no round trip), and reliability in poor connectivity. The trade-offs are equally real: model size, RAM, battery, thermal limits, OTA-update engineering, and lost server-side observability.

Dimension	Batch	Online	Streaming	Embedded/Edge
Trigger	Schedule	Per request	Event flow	Local app
Latency	Minutes-hours	1-200ms p95	Seconds or sub-second	Single-digit ms
Cost/pred.	Low	High	Medium-High	None at runtime
Complexity	Low-Medium	Medium-High	High	High (compression + OTA)
Use case	Nightly CLV	Fraud auth	Trending content	AR filters

Most production systems combine all four — a recommender precomputes candidates nightly, freshens "recently viewed" features via streaming, and runs final ranking online at page load.

2. Safe Rollout Strategies

Pre-Reading Check — Safe Rollout Strategies

1. What is the key property of shadow (mirror) deployment that distinguishes it from canary?

A) Shadow sends 5% of traffic to the new model B) Shadow mirrors production traffic to the candidate but its predictions never reach users C) Shadow is a 50/50 randomized statistical experiment D) Shadow swaps entire environments at once

2. A team wants to confirm a new ranking model improves NDCG and conversion. Which strategy is appropriate?

A) Canary with auto-rollback only B) Shadow mode indefinitely C) A/B test with sticky per-user assignment and pre-registered metrics D) Blue-green flip with no measurement window

3. Why must blue-green deployments handle feature-store schemas carefully?

A) Schemas are unrelated to deployment B) Green may depend on user_features_v2; the rollback path needs v1 still computable C) Feature stores must be replaced before any blue-green D) Blue-green only works without feature stores

Shipping a new model is riskier than shipping new code. Models degrade silently — no 5xx errors, no stack traces, just worse predictions. Labels arrive with delay, so you may not know a rollout is bad for hours or days. The four rollout strategies — shadow, canary, A/B, and blue-green — exist to manage this risk. Mature teams use them in sequence.

Shadow and Mirror Traffic

Shadow mode tests a candidate under real production traffic without exposing users. Production continues to serve; the candidate receives a mirrored copy, runs inference, and logs predictions for offline comparison — outputs never reach users. Istio's VirtualService mirror directive, Seldon's shadow predictor, and KServe with mesh-level mirroring all implement this. The hard problem is suppressing side effects: if the candidate writes to a DB or calls external APIs, those must be disabled or rerouted.

Canary and Progressive Rollout

A canary sends a small slice — typically 1 to 5% — to the new model and watches metrics in near real time. If signals are healthy, traffic ramps to 10%, 25%, 50%, 100%; if anything degrades, traffic instantly reverts. Canary is operational risk mitigation, not a statistical experiment — its job is to catch catastrophic failures before they reach everyone. Istio supports weighted routing via VirtualService; KServe has a canaryTraffic field; Seldon and tools like Argo Rollouts/Flagger automate the ramp.

Monitoring during canary spans three classes: system metrics (p50/p95/p99 latency, error rates), model quality (CTR, conversion, AUC once labels arrive), and data/drift (feature distribution shifts, training-serving skew). Because labels often lag, early canary decisions lean on proxy metrics. Ensure canary traffic is representative (don't accidentally route only one region), and use sticky assignment so users don't see flipping behavior.

A/B Tests and Multi-Armed Bandits

A/B testing is a randomized statistical experiment, not a rollout. A canary asks "is the new model breaking?"; an A/B asks "is the new model genuinely better on business metrics?" Typical setup is 50/50 with sticky per-user assignment, run for a predetermined window with pre-registered primary and secondary metrics. Implementations separate assignment (consistent hashing over user_id) from routing (mesh routes on a variant header). For ranking, evaluate at list level (NDCG, MAP). Watch for leakage via shared caches. For exploration policies, multi-armed bandits (epsilon-greedy, Thompson sampling, UCB) shift traffic dynamically — more sample-efficient than fixed-allocation A/B.

Blue-Green and Rollback

Blue-green maintains two complete environments — blue (current production) and green (new version) — and switches all traffic at once via a single config change (an Istio DestinationRule flip or load balancer retarget). Rollback is the same change in reverse. Blue-green fits major bundled changes: new architecture, new feature store schema, redesigned ranking. ML challenges are state and schemas: online-updating models diverge between blue/green, and a new schema (user_features_v2) requires v1 still computable for the rollback path — usually via versioned feature views and dual-write windows.

Strategy	User Impact	Traffic Split	Best For	Rollback
Shadow	None	100% mirrored, 0% served	Validating safety on real data	Stop mirroring
Canary	Small slice	1-5% ramping to 100%	Catching regressions early	Revert weights
A/B Test	Half users	50/50 sticky assignment	Measuring true uplift	Terminate experiment
Blue-Green	All-or-nothing	0% or 100%	Major bundled changes	Flip routing back

3. Edge and Mobile Deployment

Pre-Reading Check — Edge and Mobile Deployment

1. Which edge framework is the native runtime for Apple's Neural Engine?

A) TFLite B) Core ML C) ONNX Runtime D) PyTorch Mobile

2. When should you escalate from PTQ to QAT?

A) Always — QAT is universally better B) When PTQ produces an unacceptable accuracy gap on task-specific metrics C) Only when the model has no quantization tooling available D) Only for float16 conversions

3. Why does magnitude-based (unstructured) pruning rarely speed up mobile inference?

A) It always increases model size B) Most edge runtimes lack optimized sparse kernels to exploit zeroed weights C) Mobile devices reject pruned models D) Unstructured pruning always destroys accuracy

Edge deployment trades server-side flexibility for privacy, latency, and offline operation. The constraint set flips: instead of optimizing for cluster throughput, you optimize for milliseconds and milliwatts on a phone CPU or microcontroller. Three compression techniques — quantization, pruning, distillation — and four frameworks — TFLite, Core ML, ONNX Runtime, PyTorch Mobile — form the working vocabulary.

Edge Frameworks

TFLite dominates Android and microcontrollers, with mature PTQ/QAT, NNAPI/GPU delegates, and TFLM for kilobyte-RAM devices. PyTorch models must round-trip through ONNX or custom converters. Core ML is Apple's native runtime, the only way to fully exploit the Neural Engine across iPhone/iPad/Mac/Watch; coremltools auto-partitions work across NE/GPU/CPU. ONNX Runtime is the cross-platform option — one ONNX model runs on Android (NNAPI/XNNPACK), iOS (Core ML EP), desktop, and server — at the cost of opset version discipline. PyTorch Mobile and ExecuTorch minimize conversion friction for PyTorch-first teams; production teams often still convert to TFLite or ONNX for final optimization on tiny devices.

INT8 and INT4 Quantization plus Pruning

Quantization reduces precision (float32 → int8) for ~4× size and inference acceleration on NNAPI, Neural Engine, Edge TPUs, XNNPACK. Post-Training Quantization (PTQ) uses a small calibration set, takes minutes, no retraining; weak on small models, transformers, highly nonlinear architectures. Quantization-Aware Training (QAT) inserts quantization stubs during training so weights learn robustness to int8 arithmetic — preserves accuracy at the cost of a training pipeline. INT4 + mixed precision (GPTQ, AWQ) extends to 8× shrinkage for LLMs on phones.

Pruning is complementary. Magnitude-based (unstructured) zeroes small weights — shrinks size but rarely speeds inference without sparse kernels. Structured removes whole channels, attention heads, or transformer blocks — directly shrinks the graph and delivers latency wins, at higher accuracy risk.

Knowledge Distillation

Distillation trains a small student to mimic a large teacher using soft logits (with a temperature) plus hard labels. The student is designed from scratch for the device budget rather than retrofitted from a too-big architecture, and often beats direct compression. It is dominant for shipping transformers to mobile: TinyBERT, DistilBERT, MobileBERT. You can pile pruning and quantization on top.

OTA Model Updates

Models age fast: drift, new content, new attacks, bug fixes. OTA updates decouple the model artifact from the app binary — download from a CDN or model server, verify signatures, atomically swap, fall back if health checks fail. Best practices: staged rollouts (small device % first), differential updates, on-device A/B between versions, telemetry on latency, prediction distributions, and energy back to detect regressions in the wild. Recommended compression order: mobile-architected baseline (MobileNet, EfficientNet-Lite, distilled transformer) → structured pruning → distill if needed → PTQ first, QAT if accuracy demands. Always profile on the actual target device with the final framework — desktop benchmarks systematically mislead.

Framework	Primary Target	Strengths	Trade-offs
TFLite	Android, MCUs	Mature PTQ/QAT, NNAPI/GPU delegates, TFLM	Best for TF; PyTorch needs conversion
Core ML	Apple	Neural Engine, low power, auto partition	iOS-only; custom layers for novel ops
ONNX Runtime	Cross-platform	One model across Android/iOS/desktop	Opset version care, glue code
PyTorch Mobile	PyTorch teams	TorchScript alignment, no conversion	Less lean than TFLite for tiny devices

4. Serving Platforms

Pre-Reading Check — Serving Platforms

1. Which self-hosted platform's CRD natively treats shadow predictors, A/B routing, ensembles, and outlier detectors as first-class concepts?

A) KServe B) BentoML C) Seldon Core D) Cloud Run

2. A startup has low, spiky inference traffic and a small model (~300 MB). Which serving option is most likely cheapest?

A) Always-on KServe cluster with GPUs B) Serverless (Lambda / Cloud Run) with scale-to-zero C) Self-hosted Seldon with peak-QPS provisioning D) Multi-region SageMaker MMEs

3. Why might a team adopt multi-model serving like SageMaker MME or Triton model repository?

A) To shrink any model's accuracy gap B) To pack many model variants into shared pods when one pod per model is uneconomic C) To eliminate the need for autoscaling D) To bypass Kubernetes entirely

The serving platform is the substrate that turns a model artifact into a live prediction service: autoscaling, traffic splitting, monitoring, multi-model packing, and the glue between registry and user. The landscape splits into four categories.

Self-Hosted: KServe, BentoML, Seldon

KServe (formerly KFServing) is Kubernetes-native, serverless-style. Its InferenceService CRD packages predictor + transformer + explainer, speaks the Open Inference Protocol across many backends (TF Serving, TorchServe, Triton, scikit-learn), and integrates with Knative for scale-to-zero and Istio for routing. canaryTraffic makes a percentage-based canary a one-line change. BentoML centers on DX: declare a service with Python decorators, package as a reproducible "bento," deploy to Docker, Kubernetes, or BentoML Cloud — fast notebook-to-production with less granular k8s control. Seldon Core is the most ML-feature-rich: SeldonDeployment natively understands predictors, shadow predictors, A/B routing, ensembles, explainers, outlier detectors, tightly integrated with Istio.

Managed: SageMaker, Vertex AI, Azure ML

Upload a model artifact, declare endpoint config, and the cloud handles autoscaling, load balancing, health checks, and (often) canary. SageMaker offers multi-model endpoints, serverless inference, and asynchronous inference modes. Vertex AI integrates tightly with Google's data stack. Azure ML's managed online endpoints have built-in blue-green and traffic-split semantics. Trade-offs: vendor lock-in, opaque pricing at high QPS, limits on customizing the request path. Often the right starting point for small teams that want to ship quickly.

Serverless: Lambda, Cloud Run, Functions

Serverless provisions compute per request, charges by execution time, and scales to zero. For low-QPS or spiky workloads it is dramatically cheaper than always-on serving. Constraints: model size caps (a few hundred MB for Lambda containers), cold-start latency (seconds for large model loads), no native GPU on most platforms (Cloud Run supports GPUs in limited regions), and a request-response model that suits neither batched nor streaming inference. Cloud Run is the most ML-friendly because it supports containers up to several GB and concurrency-per-instance amortizes model load cost.

Multi-Model and Multi-Tenant Serving

Teams that ship 200 personalized variants — one per merchant or experiment — cannot afford one pod per variant. Multi-model serving packs many models into one serving process with on-demand loading: SageMaker Multi-Model Endpoints, Triton's model repository, BentoML's multi-runner. Trade-offs: cache management (which models stay warm?), cold-load memory pressure, noisy neighbors. Multi-tenant serving generalizes the idea across users/organizations with isolation, quotas, authentication — first-class concerns for SaaS ML products.

Category	Examples	Strengths	Trade-offs
Self-hosted	KServe, BentoML, Seldon	Full control, ML features, no lock-in	Kubernetes operational burden
Managed	SageMaker, Vertex AI, Azure ML	Quick start, managed scaling/rollout	Lock-in, opaque cost at scale
Serverless	Lambda, Cloud Run, Functions	Cheap for low/spiky QPS, scale-to-zero	Size limits, cold starts, weak GPU
Multi-model	SM MME, Triton, BentoML	Many models per pod, cost efficient	Cache complexity, noisy neighbors

Key Points

KServe: k8s-native, Open Inference Protocol, Knative scale-to-zero, one-line canaryTraffic. Default for k8s+Knative teams.
BentoML: Python-first DX, reproducible "bentos," fast notebook-to-prod. Less granular k8s control.
Seldon Core: richest ML-aware Kubernetes CRDs — shadow predictors, A/B, ensembles, explainers, outlier detectors as first-class.
Managed: SageMaker / Vertex AI / Azure ML trade flexibility for operational simplification. Good first step; watch vendor lock and high-QPS cost.
Serverless: cheapest for low/spiky QPS; capped by size, cold starts, weak GPU. Cloud Run leads ML support.
Multi-model: pack many variants per pod when dedicated capacity is uneconomic; budget for cache management and noisy neighbors.

Post-Reading Quiz

Now that you have studied the chapter, retake the same questions to measure improvement. Answers are revealed at the end.

Post-Reading — Inference Patterns

1. A retailer needs to score nightly customer lifetime value for a CRM team that consumes scores the next morning. Which inference pattern fits best?

A) Online synchronous inference behind a REST API B) Batch offline inference scheduled by Airflow C) Streaming inference on a Kafka topic D) Embedded inference on the device

2. Why does online inference cost the most per prediction of any pattern?

3. Which property best distinguishes streaming inference from batch and online?

Post-Reading — Safe Rollout Strategies

1. What is the key property of shadow (mirror) deployment that distinguishes it from canary?

2. A team wants to confirm a new ranking model improves NDCG and conversion. Which strategy is appropriate?

A) Canary with auto-rollback only B) Shadow mode indefinitely C) A/B test with sticky per-user assignment and pre-registered metrics D) Blue-green flip with no measurement window

3. Why must blue-green deployments handle feature-store schemas carefully?

Post-Reading — Edge and Mobile Deployment

1. Which edge framework is the native runtime for Apple's Neural Engine?

A) TFLite B) Core ML C) ONNX Runtime D) PyTorch Mobile

2. When should you escalate from PTQ to QAT?

3. Why does magnitude-based (unstructured) pruning rarely speed up mobile inference?

A) It always increases model size B) Most edge runtimes lack optimized sparse kernels to exploit zeroed weights C) Mobile devices reject pruned models D) Unstructured pruning always destroys accuracy

Post-Reading — Serving Platforms

1. Which self-hosted platform's CRD natively treats shadow predictors, A/B routing, ensembles, and outlier detectors as first-class concepts?

A) KServe B) BentoML C) Seldon Core D) Cloud Run

2. A startup has low, spiky inference traffic and a small model (~300 MB). Which serving option is most likely cheapest?

A) Always-on KServe cluster with GPUs B) Serverless (Lambda / Cloud Run) with scale-to-zero C) Self-hosted Seldon with peak-QPS provisioning D) Multi-region SageMaker MMEs

3. Why might a team adopt multi-model serving like SageMaker MME or Triton model repository?

A) To shrink any model's accuracy gap B) To pack many model variants into shared pods when one pod per model is uneconomic C) To eliminate the need for autoscaling D) To bypass Kubernetes entirely

Chapter 11: Model Deployment Patterns: Batch, Online, and Edge

Learning Objectives

1. Inference Patterns

Batch Offline Inference

Synchronous Online Inference

Streaming and Async Inference

Embedded and On-Device Inference

Key Points

2. Safe Rollout Strategies

Shadow and Mirror Traffic

Canary and Progressive Rollout

A/B Tests and Multi-Armed Bandits

Blue-Green and Rollback

Key Points

3. Edge and Mobile Deployment

Edge Frameworks

INT8 and INT4 Quantization plus Pruning

Knowledge Distillation

OTA Model Updates

Key Points

4. Serving Platforms

Self-Hosted: KServe, BentoML, Seldon

Managed: SageMaker, Vertex AI, Azure ML

Serverless: Lambda, Cloud Run, Functions

Multi-Model and Multi-Tenant Serving

Key Points

Post-Reading Quiz

Your Progress

Answer Explanations