Chapter 11: Model Deployment Patterns: Batch, Online, and Edge

Learning Objectives

1. Inference Patterns

Pre-Reading Check — Inference Patterns

1. A retailer needs to score nightly customer lifetime value for a CRM team that consumes scores the next morning. Which inference pattern fits best?

2. Why does online inference cost the most per prediction of any pattern?

3. Which property best distinguishes streaming inference from batch and online?

A trained model creates no value until it reaches a prediction surface. The four core inference patterns — batch, online, streaming, and embedded — trade off latency, throughput, cost, and complexity. Picking the wrong pattern shows up as operational pain: a fraud system that runs nightly when it should run per-transaction, or a recommender that reloads embeddings every request.

Batch Offline Inference

Batch runs the model on large groups of inputs at scheduled intervals — hourly, nightly — triggered by a scheduler (Airflow, Argo, Prefect, cron). The classic stack uses Spark, Beam, or Dask writing predictions to BigQuery, Snowflake, S3, or Redis. Latency is minutes to hours, throughput is enormous (billions of records), and cost per prediction is the lowest of any pattern. Batch is the printing press: expensive setup, but each page is nearly free. It fits anywhere downstream tolerates stale predictions — nightly CLV, precomputed top-N recommendations, and historical backfills when a new model ships.

Synchronous Online Inference

Online runs per request behind REST or gRPC, fetching features and returning a response in the same round trip. Typical latency budgets are 1–200ms at p95. The stack is a serving runtime (FastAPI, TF Serving, TorchServe, BentoML, Triton) plus a feature store (Feast, Tecton) or low-latency cache. Because the model sits in the critical path, capacity must cover peak QPS, GPUs may sit idle to guarantee p99 latency, and cost per prediction is the highest. Online is unavoidable when a human is waiting — fraud during card authorization, search ranking, autocomplete, chatbot replies.

Streaming and Async Inference

Streaming sits between batch and online: the model runs continuously on Kafka/Kinesis/Pub-Sub flows processed by Flink, Spark Structured Streaming, Kafka Streams, or Beam. Predictions are another stream. Latency is sub-second to a few seconds, throughput reaches millions of events per second, and operational complexity is the highest — checkpointing, exactly-once semantics, backpressure, and stateful windowing all matter. Streaming shines for trending content, fraud with rolling 5-minute features, and live moderation. A lighter cousin, async inference, enqueues requests for workers and polls/callbacks results — useful for slow models like document summarization.

Embedded and On-Device Inference

Embedded runs on the device generating the data: phone, camera, car, sensor. There is no network call. It unlocks three properties no server-side pattern matches: privacy (raw input never leaves the device), single-digit-ms latency (no round trip), and reliability in poor connectivity. The trade-offs are equally real: model size, RAM, battery, thermal limits, OTA-update engineering, and lost server-side observability.

DimensionBatchOnlineStreamingEmbedded/Edge
TriggerSchedulePer requestEvent flowLocal app
LatencyMinutes-hours1-200ms p95Seconds or sub-secondSingle-digit ms
Cost/pred.LowHighMedium-HighNone at runtime
ComplexityLow-MediumMedium-HighHighHigh (compression + OTA)
Use caseNightly CLVFraud authTrending contentAR filters

Most production systems combine all four — a recommender precomputes candidates nightly, freshens "recently viewed" features via streaming, and runs final ranking online at page load.

graph TD A[Inference Patterns] --> B[Batch Offline] A --> C[Online Synchronous] A --> D[Streaming] A --> E[Embedded/Edge] B --> B1[Trigger: Scheduler
Latency: Minutes-Hours
Cost: Low] B --> B2[Use: Nightly CLV
Precomputed Recs
Backfills] C --> C1[Trigger: Per Request
Latency: 1-200ms p95
Cost: High] C --> C2[Use: Fraud Auth
Search Ranking
Autocomplete] D --> D1[Trigger: Event Flow
Latency: Sub-second
Cost: Medium-High] D --> D2[Use: Trending Content
Live Moderation
Rolling Features] E --> E1[Trigger: Local App
Latency: Single-digit ms
Cost: None at runtime] E --> E2[Use: AR Filters
Offline Voice
On-device Vision]

Key Points

2. Safe Rollout Strategies

Pre-Reading Check — Safe Rollout Strategies

1. What is the key property of shadow (mirror) deployment that distinguishes it from canary?

2. A team wants to confirm a new ranking model improves NDCG and conversion. Which strategy is appropriate?

3. Why must blue-green deployments handle feature-store schemas carefully?

Shipping a new model is riskier than shipping new code. Models degrade silently — no 5xx errors, no stack traces, just worse predictions. Labels arrive with delay, so you may not know a rollout is bad for hours or days. The four rollout strategies — shadow, canary, A/B, and blue-green — exist to manage this risk. Mature teams use them in sequence.

Shadow and Mirror Traffic

Shadow mode tests a candidate under real production traffic without exposing users. Production continues to serve; the candidate receives a mirrored copy, runs inference, and logs predictions for offline comparison — outputs never reach users. Istio's VirtualService mirror directive, Seldon's shadow predictor, and KServe with mesh-level mirroring all implement this. The hard problem is suppressing side effects: if the candidate writes to a DB or calls external APIs, those must be disabled or rerouted.

Animation 1: Shadow Deployment — mirrored traffic, no user impact
request API Gateway mirror: true Live Model v1 returns response Shadow v2 logs only Prediction Log offline compare Comparison score delta:+0.014 drift KS:0.03 p95 latency:42ms Blue solid = serves users   |   Gray dashed = mirrored, side effects disabled

Canary and Progressive Rollout

A canary sends a small slice — typically 1 to 5% — to the new model and watches metrics in near real time. If signals are healthy, traffic ramps to 10%, 25%, 50%, 100%; if anything degrades, traffic instantly reverts. Canary is operational risk mitigation, not a statistical experiment — its job is to catch catastrophic failures before they reach everyone. Istio supports weighted routing via VirtualService; KServe has a canaryTraffic field; Seldon and tools like Argo Rollouts/Flagger automate the ramp.

Monitoring during canary spans three classes: system metrics (p50/p95/p99 latency, error rates), model quality (CTR, conversion, AUC once labels arrive), and data/drift (feature distribution shifts, training-serving skew). Because labels often lag, early canary decisions lean on proxy metrics. Ensure canary traffic is representative (don't accidentally route only one region), and use sticky assignment so users don't see flipping behavior.

Animation 2: Canary Rollout — 5% → 25% → 50% → 100% with abort path
Traffic gauge: % to v2 over time 5% 25% 50% 100% 5% t=0 SLO check 25% t+30m drift check 50% t+2h CTR proxy 100% t+24h promoted SLO breach: error spike → instant rollback to v1 Healthy ramp = blue gauge grows; failure at any stage flips traffic back to 0% on v2.

A/B Tests and Multi-Armed Bandits

A/B testing is a randomized statistical experiment, not a rollout. A canary asks "is the new model breaking?"; an A/B asks "is the new model genuinely better on business metrics?" Typical setup is 50/50 with sticky per-user assignment, run for a predetermined window with pre-registered primary and secondary metrics. Implementations separate assignment (consistent hashing over user_id) from routing (mesh routes on a variant header). For ranking, evaluate at list level (NDCG, MAP). Watch for leakage via shared caches. For exploration policies, multi-armed bandits (epsilon-greedy, Thompson sampling, UCB) shift traffic dynamically — more sample-efficient than fixed-allocation A/B.

Blue-Green and Rollback

Blue-green maintains two complete environments — blue (current production) and green (new version) — and switches all traffic at once via a single config change (an Istio DestinationRule flip or load balancer retarget). Rollback is the same change in reverse. Blue-green fits major bundled changes: new architecture, new feature store schema, redesigned ranking. ML challenges are state and schemas: online-updating models diverge between blue/green, and a new schema (user_features_v2) requires v1 still computable for the rollback path — usually via versioned feature views and dual-write windows.

Animation 3: Blue-Green Deployment — deploy → switch → rollback
DestinationRule subset selector → Blue (active) → Green (active) BLUE Model v1 + features v1 Predictor • Feature view v1 Capacity: 100% ready GREEN Model v2 + features v2 Predictor • Feature view v2 Capacity: 100% standby → active SWITCH: subset = green rollback = flip subset back to blue
stateDiagram-v2 [*] --> Shadow0: Deploy candidate Shadow0 --> Canary1: Pass shadow checks Canary1 --> Canary5: Healthy at 1% Canary5 --> Canary25: Healthy at 5% Canary25 --> Canary50: Healthy at 25% Canary50 --> Promoted100: Healthy at 50% Promoted100 --> [*]: Old version retired Canary1 --> Rollback: SLO breach Canary5 --> Rollback: SLO breach Canary25 --> Rollback: Drift / metric drop Canary50 --> Rollback: Drift / metric drop Rollback --> [*]: Traffic reverts to v1
StrategyUser ImpactTraffic SplitBest ForRollback
ShadowNone100% mirrored, 0% servedValidating safety on real dataStop mirroring
CanarySmall slice1-5% ramping to 100%Catching regressions earlyRevert weights
A/B TestHalf users50/50 sticky assignmentMeasuring true upliftTerminate experiment
Blue-GreenAll-or-nothing0% or 100%Major bundled changesFlip routing back

Key Points

3. Edge and Mobile Deployment

Pre-Reading Check — Edge and Mobile Deployment

1. Which edge framework is the native runtime for Apple's Neural Engine?

2. When should you escalate from PTQ to QAT?

3. Why does magnitude-based (unstructured) pruning rarely speed up mobile inference?

Edge deployment trades server-side flexibility for privacy, latency, and offline operation. The constraint set flips: instead of optimizing for cluster throughput, you optimize for milliseconds and milliwatts on a phone CPU or microcontroller. Three compression techniques — quantization, pruning, distillation — and four frameworks — TFLite, Core ML, ONNX Runtime, PyTorch Mobile — form the working vocabulary.

Edge Frameworks

TFLite dominates Android and microcontrollers, with mature PTQ/QAT, NNAPI/GPU delegates, and TFLM for kilobyte-RAM devices. PyTorch models must round-trip through ONNX or custom converters. Core ML is Apple's native runtime, the only way to fully exploit the Neural Engine across iPhone/iPad/Mac/Watch; coremltools auto-partitions work across NE/GPU/CPU. ONNX Runtime is the cross-platform option — one ONNX model runs on Android (NNAPI/XNNPACK), iOS (Core ML EP), desktop, and server — at the cost of opset version discipline. PyTorch Mobile and ExecuTorch minimize conversion friction for PyTorch-first teams; production teams often still convert to TFLite or ONNX for final optimization on tiny devices.

INT8 and INT4 Quantization plus Pruning

Quantization reduces precision (float32 → int8) for ~4× size and inference acceleration on NNAPI, Neural Engine, Edge TPUs, XNNPACK. Post-Training Quantization (PTQ) uses a small calibration set, takes minutes, no retraining; weak on small models, transformers, highly nonlinear architectures. Quantization-Aware Training (QAT) inserts quantization stubs during training so weights learn robustness to int8 arithmetic — preserves accuracy at the cost of a training pipeline. INT4 + mixed precision (GPTQ, AWQ) extends to 8× shrinkage for LLMs on phones.

Pruning is complementary. Magnitude-based (unstructured) zeroes small weights — shrinks size but rarely speeds inference without sparse kernels. Structured removes whole channels, attention heads, or transformer blocks — directly shrinks the graph and delivers latency wins, at higher accuracy risk.

Knowledge Distillation

Distillation trains a small student to mimic a large teacher using soft logits (with a temperature) plus hard labels. The student is designed from scratch for the device budget rather than retrofitted from a too-big architecture, and often beats direct compression. It is dominant for shipping transformers to mobile: TinyBERT, DistilBERT, MobileBERT. You can pile pruning and quantization on top.

OTA Model Updates

Models age fast: drift, new content, new attacks, bug fixes. OTA updates decouple the model artifact from the app binary — download from a CDN or model server, verify signatures, atomically swap, fall back if health checks fail. Best practices: staged rollouts (small device % first), differential updates, on-device A/B between versions, telemetry on latency, prediction distributions, and energy back to detect regressions in the wild. Recommended compression order: mobile-architected baseline (MobileNet, EfficientNet-Lite, distilled transformer) → structured pruning → distill if needed → PTQ first, QAT if accuracy demands. Always profile on the actual target device with the final framework — desktop benchmarks systematically mislead.

FrameworkPrimary TargetStrengthsTrade-offs
TFLiteAndroid, MCUsMature PTQ/QAT, NNAPI/GPU delegates, TFLMBest for TF; PyTorch needs conversion
Core MLAppleNeural Engine, low power, auto partitioniOS-only; custom layers for novel ops
ONNX RuntimeCross-platformOne model across Android/iOS/desktopOpset version care, glue code
PyTorch MobilePyTorch teamsTorchScript alignment, no conversionLess lean than TFLite for tiny devices
flowchart LR T[Trained Float32 Model] --> D[Knowledge Distillation
Teacher to Student] D --> P[Structured Pruning
Remove Channels/Heads] P --> Q{Quantization} Q -->|PTQ first| Q1[INT8 / INT4 Weights] Q -->|QAT if accuracy gap| Q2[Quantization-Aware Trained] Q1 --> CV[Framework Conversion] Q2 --> CV CV --> A[TFLite
Android/MCU] CV --> B[Core ML
Apple Neural Engine] CV --> C[ONNX Runtime
Cross-platform] A --> CDN[(Signed Model CDN)] B --> CDN C --> CDN CDN --> OTA[OTA Staged Rollout
1% to 100%] OTA --> DEV[On-Device
Atomic Swap + Fallback] DEV --> TEL[Telemetry: latency,
distributions, energy] TEL -.->|Drift signal| T

Key Points

4. Serving Platforms

Pre-Reading Check — Serving Platforms

1. Which self-hosted platform's CRD natively treats shadow predictors, A/B routing, ensembles, and outlier detectors as first-class concepts?

2. A startup has low, spiky inference traffic and a small model (~300 MB). Which serving option is most likely cheapest?

3. Why might a team adopt multi-model serving like SageMaker MME or Triton model repository?

The serving platform is the substrate that turns a model artifact into a live prediction service: autoscaling, traffic splitting, monitoring, multi-model packing, and the glue between registry and user. The landscape splits into four categories.

Self-Hosted: KServe, BentoML, Seldon

KServe (formerly KFServing) is Kubernetes-native, serverless-style. Its InferenceService CRD packages predictor + transformer + explainer, speaks the Open Inference Protocol across many backends (TF Serving, TorchServe, Triton, scikit-learn), and integrates with Knative for scale-to-zero and Istio for routing. canaryTraffic makes a percentage-based canary a one-line change. BentoML centers on DX: declare a service with Python decorators, package as a reproducible "bento," deploy to Docker, Kubernetes, or BentoML Cloud — fast notebook-to-production with less granular k8s control. Seldon Core is the most ML-feature-rich: SeldonDeployment natively understands predictors, shadow predictors, A/B routing, ensembles, explainers, outlier detectors, tightly integrated with Istio.

Managed: SageMaker, Vertex AI, Azure ML

Upload a model artifact, declare endpoint config, and the cloud handles autoscaling, load balancing, health checks, and (often) canary. SageMaker offers multi-model endpoints, serverless inference, and asynchronous inference modes. Vertex AI integrates tightly with Google's data stack. Azure ML's managed online endpoints have built-in blue-green and traffic-split semantics. Trade-offs: vendor lock-in, opaque pricing at high QPS, limits on customizing the request path. Often the right starting point for small teams that want to ship quickly.

Serverless: Lambda, Cloud Run, Functions

Serverless provisions compute per request, charges by execution time, and scales to zero. For low-QPS or spiky workloads it is dramatically cheaper than always-on serving. Constraints: model size caps (a few hundred MB for Lambda containers), cold-start latency (seconds for large model loads), no native GPU on most platforms (Cloud Run supports GPUs in limited regions), and a request-response model that suits neither batched nor streaming inference. Cloud Run is the most ML-friendly because it supports containers up to several GB and concurrency-per-instance amortizes model load cost.

Multi-Model and Multi-Tenant Serving

Teams that ship 200 personalized variants — one per merchant or experiment — cannot afford one pod per variant. Multi-model serving packs many models into one serving process with on-demand loading: SageMaker Multi-Model Endpoints, Triton's model repository, BentoML's multi-runner. Trade-offs: cache management (which models stay warm?), cold-load memory pressure, noisy neighbors. Multi-tenant serving generalizes the idea across users/organizations with isolation, quotas, authentication — first-class concerns for SaaS ML products.

CategoryExamplesStrengthsTrade-offs
Self-hostedKServe, BentoML, SeldonFull control, ML features, no lock-inKubernetes operational burden
ManagedSageMaker, Vertex AI, Azure MLQuick start, managed scaling/rolloutLock-in, opaque cost at scale
ServerlessLambda, Cloud Run, FunctionsCheap for low/spiky QPS, scale-to-zeroSize limits, cold starts, weak GPU
Multi-modelSM MME, Triton, BentoMLMany models per pod, cost efficientCache complexity, noisy neighbors

Key Points

Post-Reading Quiz

Now that you have studied the chapter, retake the same questions to measure improvement. Answers are revealed at the end.

Post-Reading — Inference Patterns

1. A retailer needs to score nightly customer lifetime value for a CRM team that consumes scores the next morning. Which inference pattern fits best?

2. Why does online inference cost the most per prediction of any pattern?

3. Which property best distinguishes streaming inference from batch and online?

Post-Reading — Safe Rollout Strategies

1. What is the key property of shadow (mirror) deployment that distinguishes it from canary?

2. A team wants to confirm a new ranking model improves NDCG and conversion. Which strategy is appropriate?

3. Why must blue-green deployments handle feature-store schemas carefully?

Post-Reading — Edge and Mobile Deployment

1. Which edge framework is the native runtime for Apple's Neural Engine?

2. When should you escalate from PTQ to QAT?

3. Why does magnitude-based (unstructured) pruning rarely speed up mobile inference?

Post-Reading — Serving Platforms

1. Which self-hosted platform's CRD natively treats shadow predictors, A/B routing, ensembles, and outlier detectors as first-class concepts?

2. A startup has low, spiky inference traffic and a small model (~300 MB). Which serving option is most likely cheapest?

3. Why might a team adopt multi-model serving like SageMaker MME or Triton model repository?

Your Progress

Answer Explanations