Chapter 11: Model Deployment Patterns: Batch, Online, and Edge
Learning Objectives
Differentiate batch, online, streaming, and embedded inference along latency, throughput, cost, and complexity dimensions.
Apply safe rollout strategies — shadow, canary, A/B testing, and blue-green — and know when to compose them.
Design edge and mobile deployments using quantization, pruning, distillation, and OTA updates.
Select the right serving platform — self-hosted, managed, serverless, or multi-model — given team scale and workload.
1. Inference Patterns
Pre-Reading Check — Inference Patterns
1. A retailer needs to score nightly customer lifetime value for a CRM team that consumes scores the next morning. Which inference pattern fits best?
2. Why does online inference cost the most per prediction of any pattern?
3. Which property best distinguishes streaming inference from batch and online?
A trained model creates no value until it reaches a prediction surface. The four core inference patterns — batch, online, streaming, and embedded — trade off latency, throughput, cost, and complexity. Picking the wrong pattern shows up as operational pain: a fraud system that runs nightly when it should run per-transaction, or a recommender that reloads embeddings every request.
Batch Offline Inference
Batch runs the model on large groups of inputs at scheduled intervals — hourly, nightly — triggered by a scheduler (Airflow, Argo, Prefect, cron). The classic stack uses Spark, Beam, or Dask writing predictions to BigQuery, Snowflake, S3, or Redis. Latency is minutes to hours, throughput is enormous (billions of records), and cost per prediction is the lowest of any pattern. Batch is the printing press: expensive setup, but each page is nearly free. It fits anywhere downstream tolerates stale predictions — nightly CLV, precomputed top-N recommendations, and historical backfills when a new model ships.
Synchronous Online Inference
Online runs per request behind REST or gRPC, fetching features and returning a response in the same round trip. Typical latency budgets are 1–200ms at p95. The stack is a serving runtime (FastAPI, TF Serving, TorchServe, BentoML, Triton) plus a feature store (Feast, Tecton) or low-latency cache. Because the model sits in the critical path, capacity must cover peak QPS, GPUs may sit idle to guarantee p99 latency, and cost per prediction is the highest. Online is unavoidable when a human is waiting — fraud during card authorization, search ranking, autocomplete, chatbot replies.
Streaming and Async Inference
Streaming sits between batch and online: the model runs continuously on Kafka/Kinesis/Pub-Sub flows processed by Flink, Spark Structured Streaming, Kafka Streams, or Beam. Predictions are another stream. Latency is sub-second to a few seconds, throughput reaches millions of events per second, and operational complexity is the highest — checkpointing, exactly-once semantics, backpressure, and stateful windowing all matter. Streaming shines for trending content, fraud with rolling 5-minute features, and live moderation. A lighter cousin, async inference, enqueues requests for workers and polls/callbacks results — useful for slow models like document summarization.
Embedded and On-Device Inference
Embedded runs on the device generating the data: phone, camera, car, sensor. There is no network call. It unlocks three properties no server-side pattern matches: privacy (raw input never leaves the device), single-digit-ms latency (no round trip), and reliability in poor connectivity. The trade-offs are equally real: model size, RAM, battery, thermal limits, OTA-update engineering, and lost server-side observability.
Dimension
Batch
Online
Streaming
Embedded/Edge
Trigger
Schedule
Per request
Event flow
Local app
Latency
Minutes-hours
1-200ms p95
Seconds or sub-second
Single-digit ms
Cost/pred.
Low
High
Medium-High
None at runtime
Complexity
Low-Medium
Medium-High
High
High (compression + OTA)
Use case
Nightly CLV
Fraud auth
Trending content
AR filters
Most production systems combine all four — a recommender precomputes candidates nightly, freshens "recently viewed" features via streaming, and runs final ranking online at page load.
graph TD
A[Inference Patterns] --> B[Batch Offline]
A --> C[Online Synchronous]
A --> D[Streaming]
A --> E[Embedded/Edge]
B --> B1[Trigger: Scheduler Latency: Minutes-Hours Cost: Low]
B --> B2[Use: Nightly CLV Precomputed Recs Backfills]
C --> C1[Trigger: Per Request Latency: 1-200ms p95 Cost: High]
C --> C2[Use: Fraud Auth Search Ranking Autocomplete]
D --> D1[Trigger: Event Flow Latency: Sub-second Cost: Medium-High]
D --> D2[Use: Trending Content Live Moderation Rolling Features]
E --> E1[Trigger: Local App Latency: Single-digit ms Cost: None at runtime]
E --> E2[Use: AR Filters Offline Voice On-device Vision]
Embedded = on-device: privacy + offline + single-digit ms, but capped by RAM, battery, OTA engineering.
Hybrid wins: production systems compose batch + streaming + online (Lambda-like) to maximize freshness while minimizing cost.
2. Safe Rollout Strategies
Pre-Reading Check — Safe Rollout Strategies
1. What is the key property of shadow (mirror) deployment that distinguishes it from canary?
2. A team wants to confirm a new ranking model improves NDCG and conversion. Which strategy is appropriate?
3. Why must blue-green deployments handle feature-store schemas carefully?
Shipping a new model is riskier than shipping new code. Models degrade silently — no 5xx errors, no stack traces, just worse predictions. Labels arrive with delay, so you may not know a rollout is bad for hours or days. The four rollout strategies — shadow, canary, A/B, and blue-green — exist to manage this risk. Mature teams use them in sequence.
Shadow and Mirror Traffic
Shadow mode tests a candidate under real production traffic without exposing users. Production continues to serve; the candidate receives a mirrored copy, runs inference, and logs predictions for offline comparison — outputs never reach users. Istio's VirtualService mirror directive, Seldon's shadow predictor, and KServe with mesh-level mirroring all implement this. The hard problem is suppressing side effects: if the candidate writes to a DB or calls external APIs, those must be disabled or rerouted.
Animation 1: Shadow Deployment — mirrored traffic, no user impact
Canary and Progressive Rollout
A canary sends a small slice — typically 1 to 5% — to the new model and watches metrics in near real time. If signals are healthy, traffic ramps to 10%, 25%, 50%, 100%; if anything degrades, traffic instantly reverts. Canary is operational risk mitigation, not a statistical experiment — its job is to catch catastrophic failures before they reach everyone. Istio supports weighted routing via VirtualService; KServe has a canaryTraffic field; Seldon and tools like Argo Rollouts/Flagger automate the ramp.
Monitoring during canary spans three classes: system metrics (p50/p95/p99 latency, error rates), model quality (CTR, conversion, AUC once labels arrive), and data/drift (feature distribution shifts, training-serving skew). Because labels often lag, early canary decisions lean on proxy metrics. Ensure canary traffic is representative (don't accidentally route only one region), and use sticky assignment so users don't see flipping behavior.
A/B testing is a randomized statistical experiment, not a rollout. A canary asks "is the new model breaking?"; an A/B asks "is the new model genuinely better on business metrics?" Typical setup is 50/50 with sticky per-user assignment, run for a predetermined window with pre-registered primary and secondary metrics. Implementations separate assignment (consistent hashing over user_id) from routing (mesh routes on a variant header). For ranking, evaluate at list level (NDCG, MAP). Watch for leakage via shared caches. For exploration policies, multi-armed bandits (epsilon-greedy, Thompson sampling, UCB) shift traffic dynamically — more sample-efficient than fixed-allocation A/B.
Blue-Green and Rollback
Blue-green maintains two complete environments — blue (current production) and green (new version) — and switches all traffic at once via a single config change (an Istio DestinationRule flip or load balancer retarget). Rollback is the same change in reverse. Blue-green fits major bundled changes: new architecture, new feature store schema, redesigned ranking. ML challenges are state and schemas: online-updating models diverge between blue/green, and a new schema (user_features_v2) requires v1 still computable for the rollback path — usually via versioned feature views and dual-write windows.
Blue-green: single-flip cutover for bundled changes (new schema, new architecture). Keep v1 hot for instant flip-back.
Sequence: offline → shadow → small canary → A/B 50/50 → blue-green cutover. Compose rather than choose.
3. Edge and Mobile Deployment
Pre-Reading Check — Edge and Mobile Deployment
1. Which edge framework is the native runtime for Apple's Neural Engine?
2. When should you escalate from PTQ to QAT?
3. Why does magnitude-based (unstructured) pruning rarely speed up mobile inference?
Edge deployment trades server-side flexibility for privacy, latency, and offline operation. The constraint set flips: instead of optimizing for cluster throughput, you optimize for milliseconds and milliwatts on a phone CPU or microcontroller. Three compression techniques — quantization, pruning, distillation — and four frameworks — TFLite, Core ML, ONNX Runtime, PyTorch Mobile — form the working vocabulary.
Edge Frameworks
TFLite dominates Android and microcontrollers, with mature PTQ/QAT, NNAPI/GPU delegates, and TFLM for kilobyte-RAM devices. PyTorch models must round-trip through ONNX or custom converters. Core ML is Apple's native runtime, the only way to fully exploit the Neural Engine across iPhone/iPad/Mac/Watch; coremltools auto-partitions work across NE/GPU/CPU. ONNX Runtime is the cross-platform option — one ONNX model runs on Android (NNAPI/XNNPACK), iOS (Core ML EP), desktop, and server — at the cost of opset version discipline. PyTorch Mobile and ExecuTorch minimize conversion friction for PyTorch-first teams; production teams often still convert to TFLite or ONNX for final optimization on tiny devices.
INT8 and INT4 Quantization plus Pruning
Quantization reduces precision (float32 → int8) for ~4× size and inference acceleration on NNAPI, Neural Engine, Edge TPUs, XNNPACK. Post-Training Quantization (PTQ) uses a small calibration set, takes minutes, no retraining; weak on small models, transformers, highly nonlinear architectures. Quantization-Aware Training (QAT) inserts quantization stubs during training so weights learn robustness to int8 arithmetic — preserves accuracy at the cost of a training pipeline. INT4 + mixed precision (GPTQ, AWQ) extends to 8× shrinkage for LLMs on phones.
Pruning is complementary. Magnitude-based (unstructured) zeroes small weights — shrinks size but rarely speeds inference without sparse kernels. Structured removes whole channels, attention heads, or transformer blocks — directly shrinks the graph and delivers latency wins, at higher accuracy risk.
Knowledge Distillation
Distillation trains a small student to mimic a large teacher using soft logits (with a temperature) plus hard labels. The student is designed from scratch for the device budget rather than retrofitted from a too-big architecture, and often beats direct compression. It is dominant for shipping transformers to mobile: TinyBERT, DistilBERT, MobileBERT. You can pile pruning and quantization on top.
OTA Model Updates
Models age fast: drift, new content, new attacks, bug fixes. OTA updates decouple the model artifact from the app binary — download from a CDN or model server, verify signatures, atomically swap, fall back if health checks fail. Best practices: staged rollouts (small device % first), differential updates, on-device A/B between versions, telemetry on latency, prediction distributions, and energy back to detect regressions in the wild. Recommended compression order: mobile-architected baseline (MobileNet, EfficientNet-Lite, distilled transformer) → structured pruning → distill if needed → PTQ first, QAT if accuracy demands. Always profile on the actual target device with the final framework — desktop benchmarks systematically mislead.
Framework
Primary Target
Strengths
Trade-offs
TFLite
Android, MCUs
Mature PTQ/QAT, NNAPI/GPU delegates, TFLM
Best for TF; PyTorch needs conversion
Core ML
Apple
Neural Engine, low power, auto partition
iOS-only; custom layers for novel ops
ONNX Runtime
Cross-platform
One model across Android/iOS/desktop
Opset version care, glue code
PyTorch Mobile
PyTorch teams
TorchScript alignment, no conversion
Less lean than TFLite for tiny devices
flowchart LR
T[Trained Float32 Model] --> D[Knowledge Distillation Teacher to Student]
D --> P[Structured Pruning Remove Channels/Heads]
P --> Q{Quantization}
Q -->|PTQ first| Q1[INT8 / INT4 Weights]
Q -->|QAT if accuracy gap| Q2[Quantization-Aware Trained]
Q1 --> CV[Framework Conversion]
Q2 --> CV
CV --> A[TFLite Android/MCU]
CV --> B[Core ML Apple Neural Engine]
CV --> C[ONNX Runtime Cross-platform]
A --> CDN[(Signed Model CDN)]
B --> CDN
C --> CDN
CDN --> OTA[OTA Staged Rollout 1% to 100%]
OTA --> DEV[On-Device Atomic Swap + Fallback]
DEV --> TEL[Telemetry: latency, distributions, energy]
TEL -.->|Drift signal| T
Key Points
Framework map: TFLite → Android/MCU; Core ML → Apple Neural Engine; ONNX → cross-platform; PyTorch Mobile/ExecuTorch → PyTorch-first.
Quantization: PTQ first (minutes, no retraining); escalate to QAT only if accuracy gap on real task metrics is unacceptable. INT4 + GPTQ/AWQ for LLMs on phones.
Pruning: structured (whole channels/heads) gives real mobile latency wins; unstructured needs sparse kernels most edge runtimes lack.
Distillation: dominant transformer-to-mobile path; student designed for device budget often beats compressed teacher.
OTA: signed artifacts, staged rollouts, atomic swap with fallback, telemetry on latency + distributions + energy. Always profile on the actual device.
4. Serving Platforms
Pre-Reading Check — Serving Platforms
1. Which self-hosted platform's CRD natively treats shadow predictors, A/B routing, ensembles, and outlier detectors as first-class concepts?
2. A startup has low, spiky inference traffic and a small model (~300 MB). Which serving option is most likely cheapest?
3. Why might a team adopt multi-model serving like SageMaker MME or Triton model repository?
The serving platform is the substrate that turns a model artifact into a live prediction service: autoscaling, traffic splitting, monitoring, multi-model packing, and the glue between registry and user. The landscape splits into four categories.
Self-Hosted: KServe, BentoML, Seldon
KServe (formerly KFServing) is Kubernetes-native, serverless-style. Its InferenceService CRD packages predictor + transformer + explainer, speaks the Open Inference Protocol across many backends (TF Serving, TorchServe, Triton, scikit-learn), and integrates with Knative for scale-to-zero and Istio for routing. canaryTraffic makes a percentage-based canary a one-line change. BentoML centers on DX: declare a service with Python decorators, package as a reproducible "bento," deploy to Docker, Kubernetes, or BentoML Cloud — fast notebook-to-production with less granular k8s control. Seldon Core is the most ML-feature-rich: SeldonDeployment natively understands predictors, shadow predictors, A/B routing, ensembles, explainers, outlier detectors, tightly integrated with Istio.
Managed: SageMaker, Vertex AI, Azure ML
Upload a model artifact, declare endpoint config, and the cloud handles autoscaling, load balancing, health checks, and (often) canary. SageMaker offers multi-model endpoints, serverless inference, and asynchronous inference modes. Vertex AI integrates tightly with Google's data stack. Azure ML's managed online endpoints have built-in blue-green and traffic-split semantics. Trade-offs: vendor lock-in, opaque pricing at high QPS, limits on customizing the request path. Often the right starting point for small teams that want to ship quickly.
Serverless: Lambda, Cloud Run, Functions
Serverless provisions compute per request, charges by execution time, and scales to zero. For low-QPS or spiky workloads it is dramatically cheaper than always-on serving. Constraints: model size caps (a few hundred MB for Lambda containers), cold-start latency (seconds for large model loads), no native GPU on most platforms (Cloud Run supports GPUs in limited regions), and a request-response model that suits neither batched nor streaming inference. Cloud Run is the most ML-friendly because it supports containers up to several GB and concurrency-per-instance amortizes model load cost.
Multi-Model and Multi-Tenant Serving
Teams that ship 200 personalized variants — one per merchant or experiment — cannot afford one pod per variant. Multi-model serving packs many models into one serving process with on-demand loading: SageMaker Multi-Model Endpoints, Triton's model repository, BentoML's multi-runner. Trade-offs: cache management (which models stay warm?), cold-load memory pressure, noisy neighbors. Multi-tenant serving generalizes the idea across users/organizations with isolation, quotas, authentication — first-class concerns for SaaS ML products.
Category
Examples
Strengths
Trade-offs
Self-hosted
KServe, BentoML, Seldon
Full control, ML features, no lock-in
Kubernetes operational burden
Managed
SageMaker, Vertex AI, Azure ML
Quick start, managed scaling/rollout
Lock-in, opaque cost at scale
Serverless
Lambda, Cloud Run, Functions
Cheap for low/spiky QPS, scale-to-zero
Size limits, cold starts, weak GPU
Multi-model
SM MME, Triton, BentoML
Many models per pod, cost efficient
Cache complexity, noisy neighbors
Key Points
KServe: k8s-native, Open Inference Protocol, Knative scale-to-zero, one-line canaryTraffic. Default for k8s+Knative teams.
BentoML: Python-first DX, reproducible "bentos," fast notebook-to-prod. Less granular k8s control.