Chapter 12 — Serving Infrastructure: Latency, Throughput, and Scalability
Learning Objectives
Reason about serving SLOs in terms of tail percentiles (p95, p99) rather than averages and identify the sources of variance that produce tail latency.
Apply optimization techniques — dynamic batching, TensorRT/ONNX Runtime kernel fusion, FP16/INT8 quantization, and caching — and understand how they stack multiplicatively.
Design horizontal-scaling strategies that combine HPA, KEDA, MIG, and DCGM metrics for GPU-aware autoscaling.
Compare advanced serving topologies (cascades, Triton ensembles, vLLM, TGI, Triton + TensorRT-LLM) and select the right engine per workload.
Section 1: Latency and Throughput Fundamentals
Pre-Reading Check
1. Why are serving SLOs expressed in terms of p99 latency rather than average latency?
Averages are easier to compute on streaming metrics pipelines.
Averages mask rare-but-frequent bad experiences that define user perception.
p99 is required by Kubernetes HPA to function correctly.
Average latency cannot be measured on GPUs.
2. According to the operational heuristic in the chapter, sustained GPU utilization should typically be kept under what threshold to protect p99 against bursts?
10–20%
40–50%
60–70%
95–100%
3. What is the canonical mitigation for cold-start latency taxes on GPU pods?
Scale-to-zero with aggressive KEDA polling intervals.
A warm pool with minReplicaCount > 0 plus startup warmup hooks before readiness.
Disabling JIT compilation in PyTorch.
Running CPU-only inference on the first request.
Production serving moves the conversation from "does the model work?" to "does it deliver predictions fast enough, often enough, and cheaply enough to satisfy an SLO?" The single most important habit a serving engineer must develop is to stop reasoning about average latency. Averages hide catastrophes — if a model returns in 20 ms for 99 requests and 5,000 ms for the hundredth, the average looks like a healthy 70 ms, but one in every hundred users just watched a five-second spinner.
Real serving systems are evaluated on percentiles: p50 (median), p95 (bad-day), and p99 (worst case that still happens hundreds of times per hour at scale). An SLO like "p99 < 200 ms for the recommendation endpoint" is the contract between the platform and downstream consumers; when it breaks, error budgets burn and the on-call pages. Tail latency arises from sources averages cannot see: GC pauses, cold caches, head-of-line blocking, scheduler jitter, and rare input shapes outside an optimization profile. Engineering for p99 is fundamentally about reducing variance.
Throughput (QPS or tokens/s) and latency are coupled but distinct, governed by queueing theory: as utilization approaches 100%, queue depth grows nonlinearly and latency explodes. A widely used heuristic is to hold GPU SM utilization below 60–70% in steady state.
Metric
Measures
Unit
Watch For
p50
Median latency
ms
Baseline experience
p95
95th percentile
ms
Bad-day experience
p99
99th percentile
ms
SLO contract metric
Throughput
Sustained request rate
QPS, tok/s
Capacity ceiling
SM util
SMs busy
%
Stay under 60–70%
Queue depth
Pending requests
count
Leading indicator of p99
Fresh replicas pay "first-time" costs — weight loading, CUDA context creation, JIT specialization, page-cache warming — making their first dozen requests 5–10× slower than steady state. The mitigation is the warm pool: minReplicaCount > 0, plus startup warmup hooks that issue synthetic requests across all major input shapes before readiness passes. Triton supports this natively via its model-warmup config.
Profiling is non-negotiable. A common finding is that the GPU kernel itself takes 8 ms while Python preprocessing eats 40 ms and JSON serialization 30 ms — meaning kernel work buys almost nothing until the surrounding pipeline is fixed. Nsight Systems, Triton's perf_analyzer, and the PyTorch profiler show where the milliseconds actually live.
Figure 12.1: Latency Budget Across a Serving Pipeline
flowchart LR
A[Client Request] -->|Network 5 ms| B[API Gateway]
B -->|Routing 2 ms| C[Queue]
C -->|Wait 3 ms| D[Preprocessing Tokenize/Decode]
D -->|CPU 10 ms| E[H2D Copy 1 ms]
E --> F[GPU Kernel 8 ms]
F -->|D2H 1 ms| G[Postprocessing JSON serialize]
G -->|6 ms| H[Response]
H -->|Network 5 ms| I[Client]
style F fill:#1f6feb,color:#fff
style D fill:#d29922,color:#fff
style G fill:#d29922,color:#fff
Key Points
Average latency lies; the operational truth lives in p95 and p99 percentiles.
SLOs ("p99 < 200 ms") are contracts that govern error budgets and paging.
Throughput and latency interact via queueing theory; keep SM utilization <60–70%.
Cold starts can run 5–10× slower; warm pools + warmup hooks are the mitigation.
Profile every stage — gateway, queue, preprocess, kernel, postprocess — before optimizing.
Post-Reading Check
1. Why are serving SLOs expressed in terms of p99 latency rather than average latency?
Averages are easier to compute on streaming metrics pipelines.
Averages mask rare-but-frequent bad experiences that define user perception.
p99 is required by Kubernetes HPA to function correctly.
Average latency cannot be measured on GPUs.
2. According to the operational heuristic in the chapter, sustained GPU utilization should typically be kept under what threshold to protect p99 against bursts?
10–20%
40–50%
60–70%
95–100%
3. What is the canonical mitigation for cold-start latency taxes on GPU pods?
Scale-to-zero with aggressive KEDA polling intervals.
A warm pool with minReplicaCount > 0 plus startup warmup hooks before readiness.
Disabling JIT compilation in PyTorch.
Running CPU-only inference on the first request.
Section 2: Optimization Techniques
Pre-Reading Check
4. What single tuning knob most directly determines the latency/throughput tradeoff in Triton's dynamic batching?
instance_group.count
max_queue_delay_microseconds
preferred_batch_size only
model_repository path
5. Which statement about INT8 quantization is correct?
INT8 PTQ delivers 2–4× speedup over FP32 but may lose 1–2 accuracy points; QAT typically recovers most of that gap.
INT8 only works on CPUs; GPUs require FP32 weights.
Quantization increases tail latency because of additional dequantization overhead.
6. What architectural benefit does TensorRT provide over PyTorch eager execution?
It eliminates the need for a GPU entirely.
It runs Python preprocessing on a separate machine.
It compresses a graph via layer fusion, constant folding, and tactic selection, typically 3–10× lower latency.
It guarantees ONNX-Runtime compatibility for all CPU EPs.
Reducing per-request work is the highest-leverage optimization available; every saved millisecond is one less millisecond in the queue, which compounds into lower p99 and higher sustainable QPS. Four families of techniques dominate.
Dynamic Batching and Request Bucketing
GPUs are massively parallel; sending them a single 1-row matrix is like hiring a thousand cooks for one omelette. Dynamic batching collects multiple requests arriving within a 1–2 ms window into a single batch dispatch, amortizing per-call overhead. Triton exposes this via config.pbtxt:
The max_queue_delay_microseconds knob is the central tuning dial: too large and unlucky requests sit too long; too small and most batches are size 1. The sweet spot keeps delay well below compute time (e.g., 1–2 ms of delay for a 15–50 ms model). Done correctly, dynamic batching delivers 2–10× throughput on transformer/CV workloads with minimal latency cost.
Request bucketing pads variable inputs to fixed sizes (16, 32, 64, 128) so TensorRT can build optimization profiles per bucket; rare lengths no longer spike p99.
Animation A1: Dynamic Batching — Requests Accumulate, GPU Fires Once
Individual requests arrive within the queue-delay window, accumulate into the batch buffer, dispatch to the GPU as one call, and responses fan out.
TensorRT and ONNX Runtime
TensorRT takes a frozen graph and produces a hardware-specialized engine through layer fusion (Conv+Bias+ReLU → 1 kernel), constant folding, layout transformation, and tactic selection (benchmarking multiple GEMM implementations). A 150-node BERT graph compresses to ~20–30 fused ops — a 3–10× latency reduction over PyTorch eager. ONNX Runtime is the portable alternative: an Execution Provider abstraction lets the same model target CPU, CUDA, TensorRT, or DirectML, with constant folding, node fusion, and shape inference at load time.
Quantization
FP16 uses Tensor Cores for 1.5–2× throughput with negligible accuracy loss. INT8 yields 2–4× over FP32. PTQ (post-training, with calibration data) is quick but may lose 1–2 points; QAT (quantization-aware training, with fake-quant nodes in the forward pass) typically recovers most of that gap.
Optimization
Speedup
Accuracy
When
Dynamic batching
2–10× throughput
None
Always for transformer/CV
TensorRT kernel fusion
3–10× latency
None
Stable NVIDIA prod
FP16
1.5–2× throughput
Negligible
Default for modern GPUs
INT8 PTQ
2–4× over FP32
0–2 pts loss
Calibration data available
INT8 QAT
2–4× over FP32
Near-zero loss
When PTQ accuracy too low
Caching
∞ on hits
None
Repeated inputs
Caching
The fastest inference is the one you never run. A two-tier cache (in-process LRU + Redis) can absorb a large fraction of traffic before it ever reaches the GPU. Cache-key design is the engineering judgment — content hash for embeddings, (user_id, candidate_id, model_version) tuples for ranking.
Figure 12.2: Dynamic Batching Sequence
sequenceDiagram
participant R1 as Request 1
participant R2 as Request 2
participant R3 as Request 3
participant R4 as Request 4
participant B as Batch Accumulator
participant G as GPU
R1->>B: arrive t=0
R2->>B: arrive t=0.4
R3->>B: arrive t=1.1
R4->>B: arrive t=1.8
Note over B: queue_delay=2 ms; batch=4
B->>G: dispatch batch[1..4]
G-->>B: results (15 ms)
B-->>R1: response
B-->>R2: response
B-->>R3: response
B-->>R4: response
Key Points
Dynamic batching trades 1–2 ms of queue delay for 2–10× throughput.
Request bucketing eliminates rare-shape p99 spikes by padding to fixed sizes.
TensorRT fuses, folds, and selects tactics — 3–10× over PyTorch eager.
FP16 ≈ free 1.5–2×; INT8 (PTQ or QAT) ≈ 2–4× more.
Optimizations stack multiplicatively; combined 20–100× over naive Python+FP32 baselines.
Post-Reading Check
4. What single tuning knob most directly determines the latency/throughput tradeoff in Triton's dynamic batching?
instance_group.count
max_queue_delay_microseconds
preferred_batch_size only
model_repository path
5. Which statement about INT8 quantization is correct?
INT8 PTQ delivers 2–4× speedup over FP32 but may lose 1–2 accuracy points; QAT typically recovers most of that gap.
INT8 only works on CPUs; GPUs require FP32 weights.
Quantization increases tail latency because of additional dequantization overhead.
6. What architectural benefit does TensorRT provide over PyTorch eager execution?
It eliminates the need for a GPU entirely.
It runs Python preprocessing on a separate machine.
It compresses a graph via layer fusion, constant folding, and tactic selection, typically 3–10× lower latency.
It guarantees ONNX-Runtime compatibility for all CPU EPs.
Section 3: Scaling Strategies
Pre-Reading Check
7. What is the principal advantage of KEDA over vanilla HPA for GPU serving?
KEDA replaces all of Kubernetes scheduling.
KEDA adds event-source awareness (Kafka, SQS, Prometheus) and first-class scale-to-zero on top of HPA-like behavior.
KEDA forces all pods to run on CPU instances.
KEDA disables stabilization windows.
8. Why are short scale-up and long scale-down stabilization windows recommended?
Kubernetes requires symmetric windows.
To respond quickly to spikes while avoiding thrashing — cold starts make tearing down GPU pods expensive if they're needed again soon.
Long scale-down reduces cluster cost during off-peak hours by removing pods faster.
Short scale-up is mandated by the DCGM Exporter.
9. What does NVIDIA Multi-Instance GPU (MIG) provide that GPU time-slicing does not?
Higher peak FLOPs per slice.
Hardware-level isolation with dedicated compute, memory bandwidth, and L2 cache per instance.
Automatic quantization to INT8.
Free network bandwidth between slices.
Once a single replica is tuned, the next problem is replicating it. Horizontal scaling expands capacity by adding pods; vertical scaling expands by using bigger GPUs or partitioning existing ones.
Horizontal Autoscaling: HPA + KEDA
Kubernetes' HPA scales replicas by metrics — CPU/memory by default, GPU and application metrics in practice. KEDA extends this with event-source-aware scaling (Kafka, SQS, Redis, Prometheus, dozens of others) and first-class scale-to-zero. The best-practice pattern combines them: KEDA drives event-driven scaling from queue depth and Prometheus signals; HPA-style behavior reacts to GPU and latency metrics.
Raw GPU utilization alone is a poor signal — combine it with QPS per pod, queue depth, and SLO breach rate. Metrics flow from NVIDIA's DCGM Exporter (DCGM_FI_PROF_GR_ENGINE_ACTIVE, SM/memory utilization per GPU or MIG slice) into Prometheus, then into HPA via the Prometheus Adapter or into KEDA via its Prometheus scaler.
Stabilization windows matter enormously: scale-up fast (30–60 s) so the system responds to spikes; scale-down slow (300–600 s) so it does not thrash by tearing down GPU pods needed two minutes later.
CPU / queue-depth gauge climbs; HPA scales the deployment from one pod up to four. Load subsides; HPA scales back down after the stabilization window.
Figure 12.3: HPA + KEDA Signal Flow
flowchart TD
DCGM[DCGM Exporter SM, mem util] --> PA[Prometheus Adapter]
PROM[Prometheus p95, QPS] --> KEDA[KEDA Operator]
KAFKA[Kafka depth] --> KEDA
PA --> HPA[HPA]
KEDA --> HPA
HPA -->|up 30-60s, down 300-600s| DEP[Inference Deployment]
DEP --> P1[Pod 1 GPU]
DEP --> P2[Pod 2 GPU]
DEP --> P3[Pod N GPU]
P1 -.metrics.-> DCGM
P2 -.metrics.-> DCGM
P3 -.metrics.-> DCGM
style HPA fill:#1f6feb,color:#fff
style KEDA fill:#238636,color:#fff
style DCGM fill:#76b900,color:#000
MIG and GPU Sharing
A full A100/H100 is overkill for a 7B quantized model using 12 GB. MIG partitions one physical GPU into multiple isolated instances, each with dedicated compute, memory bandwidth, and L2 cache. An A100 can be split into 7× 1g.5gb, or mixed sizes such as 3× 2g.10gb + 1× 3g.20gb. On Kubernetes with the NVIDIA GPU Operator, MIG is applied via node labels (nvidia.com/mig.config=all-1g.5gb) and consumed via resource requests (nvidia.com/mig-1g.5gb: 1). Time-slicing shares one GPU without hardware isolation — simpler to configure but offers no QoS; MIG is the right choice when SLOs matter.
Load Balancing and Multi-Region
Round-robin is the lazy default and usually wrong. Use least-loaded routing, session affinity for streaming LLM responses, and shadow routing for offline candidate comparison. Multi-region deployments place replicas in multiple cloud regions with DNS or Anycast routing — at the cost of cross-region weight replication and feature-store consistency.
Figure 12.4: MIG Partitioning of an A100
graph TD
A[Physical A100 40 GB HBM2, 108 SMs]
A --> S1[1g.5gb 5 GB]
A --> S2[1g.5gb 5 GB]
A --> S3[2g.10gb 10 GB]
A --> S4[3g.20gb 20 GB]
S1 --> POD1[Pod A small CV]
S2 --> POD2[Pod B small CV]
S3 --> POD3[Pod C medium NLP]
S4 --> POD4[Pod D 7B quantized LLM]
style A fill:#76b900,color:#000
style S1 fill:#1f6feb,color:#fff
style S2 fill:#1f6feb,color:#fff
style S3 fill:#1f6feb,color:#fff
style S4 fill:#1f6feb,color:#fff
Key Points
Combine KEDA (event-driven, scale-to-zero) with HPA (steady-state metric scaling).
DCGM Exporter is the canonical source of GPU metrics into Prometheus.
Scale up fast (30–60 s), scale down slow (300–600 s) to avoid thrash.
MIG converts one A100/H100 into right-sized, isolated, schedulable slices.
Use model-aware routing (least-loaded, session affinity, shadow); multi-region for global SLOs.
Post-Reading Check
7. What is the principal advantage of KEDA over vanilla HPA for GPU serving?
KEDA replaces all of Kubernetes scheduling.
KEDA adds event-source awareness (Kafka, SQS, Prometheus) and first-class scale-to-zero on top of HPA-like behavior.
KEDA forces all pods to run on CPU instances.
KEDA disables stabilization windows.
8. Why are short scale-up and long scale-down stabilization windows recommended?
Kubernetes requires symmetric windows.
To respond quickly to spikes while avoiding thrashing — cold starts make tearing down GPU pods expensive if they're needed again soon.
Long scale-down reduces cluster cost during off-peak hours by removing pods faster.
Short scale-up is mandated by the DCGM Exporter.
9. What does NVIDIA Multi-Instance GPU (MIG) provide that GPU time-slicing does not?
Higher peak FLOPs per slice.
Hardware-level isolation with dedicated compute, memory bandwidth, and L2 cache per instance.
Automatic quantization to INT8.
Free network bandwidth between slices.
Section 4: Advanced Serving Topologies
Pre-Reading Check
10. Why is a cascade often an effective p99 optimization?
It runs every model on every input in parallel.
A cheap fast model handles the easy majority; only difficult inputs escalate to the expensive model, dramatically lowering average compute per request.
It eliminates the GPU entirely.
Cascades always have higher accuracy than a single model.
11. What two breakthrough techniques distinguish vLLM as an LLM serving engine?
Static batching plus FP32 weights.
PagedAttention (paged KV-cache) and continuous batching (per-token-step scheduling).
Single-GPU dispatch and OS process isolation.
DAG ensembles and model warmup hooks.
12. Which engine is the best choice when you need maximum throughput across many concurrent LLM users with minimal vendor lock-in?
Hugging Face TGI
Triton + TensorRT-LLM (compiled engines)
vLLM
Stock PyTorch eager mode
One-model-per-endpoint is increasingly rare. Modern topologies chain models, route through stages, and run specialized engines for specialized workloads — especially LLMs.
Ensembles, Cascades, Sidecars
Ensembles combine predictions (average, vote, stack). Cascades chain models sequentially: a 2 ms keyword filter handles obvious cases, a 15 ms CNN handles ambiguous images, only the hardest cases escalate to a 100 ms multimodal foundation model — slashing average compute without losing accuracy on hard inputs.
Triton ensembles are first-class: a config.pbtxt DAG chains tokenizer (Python backend) → BERT encoder (TensorRT) → classification head (ORT) → label mapper (Python). All four run inside Triton with no network hops between stages.
Embedding sidecars in ranking/search systems produce embeddings for new entities; cached results in Redis or a vector store mean the hot path rarely runs the sidecar, but when it does, latency is predictable.
Figure 12.5: Triton Ensemble Pipeline
flowchart LR
REQ[Client Request raw text] --> PRE
subgraph TRITON[Triton]
PRE[Tokenizer Python backend] --> ENC[BERT Encoder TensorRT]
ENC --> HEAD[Classifier Head ORT]
HEAD --> POST[Label Mapper Python]
end
POST --> RESP[Response label + score]
style PRE fill:#d29922,color:#fff
style ENC fill:#1f6feb,color:#fff
style HEAD fill:#1f6feb,color:#fff
style POST fill:#d29922,color:#fff
LLM Serving: vLLM, TGI, Triton + TensorRT-LLM
LLM serving is different: generation is autoregressive, sequence lengths vary wildly, and the KV cache dominates GPU memory. vLLM introduced two breakthroughs:
PagedAttention: treat the KV cache like an OS virtual-memory paged system — fixed-size pages, reusable across requests, finished sequences freeing their pages. Reduces fragmentation 19–27% versus contiguous layouts.
Continuous batching: new requests join the active batch at every decoding step; completed sequences free resources immediately. Keeps GPU utilization at 85–92% even under heterogeneous load.
The result: on LLaMA-2-7B at 100 concurrent requests, vLLM hits ~15,243 tok/s vs TGI's ~4,156 tok/s — roughly 3.7× higher throughput, widening to ~24× at extreme concurrency.
TGI wins on time-to-first-token (1.3–2× lower than vLLM at low concurrency) and on Hugging Face ecosystem integration (safetensors, HF Hub, auth, observability). Triton + TensorRT-LLM compiles weights/graphs into a hardware-specific engine, yielding slightly higher peak throughput than vLLM on H100 — at the cost of tens of minutes of cold-start compile time versus vLLM's roughly one minute.
The KV cache is laid out as a grid of fixed-size pages. Requests A, B, C claim pages dynamically as their sequences grow — no contiguous-allocation fragmentation, no wasted memory between sequences.
Engine
Best For
Key Technique
Throughput (LLaMA-2-7B @100)
TTFT
Cold Start
vLLM
Multi-tenant high-concurrency, batch gen
PagedAttention + continuous batching
~15,243 tok/s
Baseline
~1 min
TGI
HF-ecosystem chatbots, low concurrency
Dynamic batching + safetensors
~4,156 tok/s (~3.7× lower)
1.3–2× lower than vLLM
~1 min
Triton + TensorRT-LLM
Enterprise fleets, fixed long-lived models
Compiled engines + ensembles
Slightly higher peak than vLLM
Variable
Tens of minutes
Selection rule: vLLM for max throughput / minimal lock-in; TGI for low TTFT and HF integration; Triton + TensorRT-LLM for unified NVIDIA platform with dozens of model types under one control plane.
Key Points
Cascades exploit the easy-vs-hard distribution; fast model first, expensive model only on escalation.
Triton ensembles are a config-driven DAG of model steps — no network hops between stages.
Continuous batching keeps GPU utilization at 85–92% even under heterogeneous load.
~3.7× throughput gap on LLaMA-2-7B (vLLM vs TGI) often dictates 1 GPU vs 4 in capacity planning.
Post-Reading Check
10. Why is a cascade often an effective p99 optimization?
It runs every model on every input in parallel.
A cheap fast model handles the easy majority; only difficult inputs escalate to the expensive model, dramatically lowering average compute per request.
It eliminates the GPU entirely.
Cascades always have higher accuracy than a single model.
11. What two breakthrough techniques distinguish vLLM as an LLM serving engine?
Static batching plus FP32 weights.
PagedAttention (paged KV-cache) and continuous batching (per-token-step scheduling).
Single-GPU dispatch and OS process isolation.
DAG ensembles and model warmup hooks.
12. Which engine is the best choice when you need maximum throughput across many concurrent LLM users with minimal vendor lock-in?