Chapter 12 — Serving Infrastructure: Latency, Throughput, and Scalability

Learning Objectives

Section 1: Latency and Throughput Fundamentals

Pre-Reading Check

1. Why are serving SLOs expressed in terms of p99 latency rather than average latency?

Averages are easier to compute on streaming metrics pipelines.
Averages mask rare-but-frequent bad experiences that define user perception.
p99 is required by Kubernetes HPA to function correctly.
Average latency cannot be measured on GPUs.

2. According to the operational heuristic in the chapter, sustained GPU utilization should typically be kept under what threshold to protect p99 against bursts?

10–20%
40–50%
60–70%
95–100%

3. What is the canonical mitigation for cold-start latency taxes on GPU pods?

Scale-to-zero with aggressive KEDA polling intervals.
A warm pool with minReplicaCount > 0 plus startup warmup hooks before readiness.
Disabling JIT compilation in PyTorch.
Running CPU-only inference on the first request.

Production serving moves the conversation from "does the model work?" to "does it deliver predictions fast enough, often enough, and cheaply enough to satisfy an SLO?" The single most important habit a serving engineer must develop is to stop reasoning about average latency. Averages hide catastrophes — if a model returns in 20 ms for 99 requests and 5,000 ms for the hundredth, the average looks like a healthy 70 ms, but one in every hundred users just watched a five-second spinner.

Real serving systems are evaluated on percentiles: p50 (median), p95 (bad-day), and p99 (worst case that still happens hundreds of times per hour at scale). An SLO like "p99 < 200 ms for the recommendation endpoint" is the contract between the platform and downstream consumers; when it breaks, error budgets burn and the on-call pages. Tail latency arises from sources averages cannot see: GC pauses, cold caches, head-of-line blocking, scheduler jitter, and rare input shapes outside an optimization profile. Engineering for p99 is fundamentally about reducing variance.

Throughput (QPS or tokens/s) and latency are coupled but distinct, governed by queueing theory: as utilization approaches 100%, queue depth grows nonlinearly and latency explodes. A widely used heuristic is to hold GPU SM utilization below 60–70% in steady state.

MetricMeasuresUnitWatch For
p50Median latencymsBaseline experience
p9595th percentilemsBad-day experience
p9999th percentilemsSLO contract metric
ThroughputSustained request rateQPS, tok/sCapacity ceiling
SM utilSMs busy%Stay under 60–70%
Queue depthPending requestscountLeading indicator of p99

Fresh replicas pay "first-time" costs — weight loading, CUDA context creation, JIT specialization, page-cache warming — making their first dozen requests 5–10× slower than steady state. The mitigation is the warm pool: minReplicaCount > 0, plus startup warmup hooks that issue synthetic requests across all major input shapes before readiness passes. Triton supports this natively via its model-warmup config.

Profiling is non-negotiable. A common finding is that the GPU kernel itself takes 8 ms while Python preprocessing eats 40 ms and JSON serialization 30 ms — meaning kernel work buys almost nothing until the surrounding pipeline is fixed. Nsight Systems, Triton's perf_analyzer, and the PyTorch profiler show where the milliseconds actually live.

Figure 12.1: Latency Budget Across a Serving Pipeline

flowchart LR A[Client Request] -->|Network 5 ms| B[API Gateway] B -->|Routing 2 ms| C[Queue] C -->|Wait 3 ms| D[Preprocessing
Tokenize/Decode] D -->|CPU 10 ms| E[H2D Copy 1 ms] E --> F[GPU Kernel 8 ms] F -->|D2H 1 ms| G[Postprocessing
JSON serialize] G -->|6 ms| H[Response] H -->|Network 5 ms| I[Client] style F fill:#1f6feb,color:#fff style D fill:#d29922,color:#fff style G fill:#d29922,color:#fff

Key Points

Post-Reading Check

1. Why are serving SLOs expressed in terms of p99 latency rather than average latency?

Averages are easier to compute on streaming metrics pipelines.
Averages mask rare-but-frequent bad experiences that define user perception.
p99 is required by Kubernetes HPA to function correctly.
Average latency cannot be measured on GPUs.

2. According to the operational heuristic in the chapter, sustained GPU utilization should typically be kept under what threshold to protect p99 against bursts?

10–20%
40–50%
60–70%
95–100%

3. What is the canonical mitigation for cold-start latency taxes on GPU pods?

Scale-to-zero with aggressive KEDA polling intervals.
A warm pool with minReplicaCount > 0 plus startup warmup hooks before readiness.
Disabling JIT compilation in PyTorch.
Running CPU-only inference on the first request.

Section 2: Optimization Techniques

Pre-Reading Check

4. What single tuning knob most directly determines the latency/throughput tradeoff in Triton's dynamic batching?

instance_group.count
max_queue_delay_microseconds
preferred_batch_size only
model_repository path

5. Which statement about INT8 quantization is correct?

Post-training quantization always matches QAT accuracy.
INT8 PTQ delivers 2–4× speedup over FP32 but may lose 1–2 accuracy points; QAT typically recovers most of that gap.
INT8 only works on CPUs; GPUs require FP32 weights.
Quantization increases tail latency because of additional dequantization overhead.

6. What architectural benefit does TensorRT provide over PyTorch eager execution?

It eliminates the need for a GPU entirely.
It runs Python preprocessing on a separate machine.
It compresses a graph via layer fusion, constant folding, and tactic selection, typically 3–10× lower latency.
It guarantees ONNX-Runtime compatibility for all CPU EPs.

Reducing per-request work is the highest-leverage optimization available; every saved millisecond is one less millisecond in the queue, which compounds into lower p99 and higher sustainable QPS. Four families of techniques dominate.

Dynamic Batching and Request Bucketing

GPUs are massively parallel; sending them a single 1-row matrix is like hiring a thousand cooks for one omelette. Dynamic batching collects multiple requests arriving within a 1–2 ms window into a single batch dispatch, amortizing per-call overhead. Triton exposes this via config.pbtxt:

dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 2000
}
instance_group { kind: KIND_GPU count: 2 }

The max_queue_delay_microseconds knob is the central tuning dial: too large and unlucky requests sit too long; too small and most batches are size 1. The sweet spot keeps delay well below compute time (e.g., 1–2 ms of delay for a 15–50 ms model). Done correctly, dynamic batching delivers 2–10× throughput on transformer/CV workloads with minimal latency cost.

Request bucketing pads variable inputs to fixed sizes (16, 32, 64, 128) so TensorRT can build optimization profiles per bucket; rare lengths no longer spike p99.

Animation A1: Dynamic Batching — Requests Accumulate, GPU Fires Once

Individual requests arrive within the queue-delay window, accumulate into the batch buffer, dispatch to the GPU as one call, and responses fan out.

Requests Batch buffer (window 2 ms) GPU 1 2 3 4 batch[1,2,3,4] GPU 1 kernel call ~15 ms R 4 arrivals → 1 batch → 1 GPU call → 4 responses

TensorRT and ONNX Runtime

TensorRT takes a frozen graph and produces a hardware-specialized engine through layer fusion (Conv+Bias+ReLU → 1 kernel), constant folding, layout transformation, and tactic selection (benchmarking multiple GEMM implementations). A 150-node BERT graph compresses to ~20–30 fused ops — a 3–10× latency reduction over PyTorch eager. ONNX Runtime is the portable alternative: an Execution Provider abstraction lets the same model target CPU, CUDA, TensorRT, or DirectML, with constant folding, node fusion, and shape inference at load time.

Quantization

FP16 uses Tensor Cores for 1.5–2× throughput with negligible accuracy loss. INT8 yields 2–4× over FP32. PTQ (post-training, with calibration data) is quick but may lose 1–2 points; QAT (quantization-aware training, with fake-quant nodes in the forward pass) typically recovers most of that gap.

OptimizationSpeedupAccuracyWhen
Dynamic batching2–10× throughputNoneAlways for transformer/CV
TensorRT kernel fusion3–10× latencyNoneStable NVIDIA prod
FP161.5–2× throughputNegligibleDefault for modern GPUs
INT8 PTQ2–4× over FP320–2 pts lossCalibration data available
INT8 QAT2–4× over FP32Near-zero lossWhen PTQ accuracy too low
Caching∞ on hitsNoneRepeated inputs

Caching

The fastest inference is the one you never run. A two-tier cache (in-process LRU + Redis) can absorb a large fraction of traffic before it ever reaches the GPU. Cache-key design is the engineering judgment — content hash for embeddings, (user_id, candidate_id, model_version) tuples for ranking.

Figure 12.2: Dynamic Batching Sequence

sequenceDiagram participant R1 as Request 1 participant R2 as Request 2 participant R3 as Request 3 participant R4 as Request 4 participant B as Batch Accumulator participant G as GPU R1->>B: arrive t=0 R2->>B: arrive t=0.4 R3->>B: arrive t=1.1 R4->>B: arrive t=1.8 Note over B: queue_delay=2 ms; batch=4 B->>G: dispatch batch[1..4] G-->>B: results (15 ms) B-->>R1: response B-->>R2: response B-->>R3: response B-->>R4: response

Key Points

Post-Reading Check

4. What single tuning knob most directly determines the latency/throughput tradeoff in Triton's dynamic batching?

instance_group.count
max_queue_delay_microseconds
preferred_batch_size only
model_repository path

5. Which statement about INT8 quantization is correct?

Post-training quantization always matches QAT accuracy.
INT8 PTQ delivers 2–4× speedup over FP32 but may lose 1–2 accuracy points; QAT typically recovers most of that gap.
INT8 only works on CPUs; GPUs require FP32 weights.
Quantization increases tail latency because of additional dequantization overhead.

6. What architectural benefit does TensorRT provide over PyTorch eager execution?

It eliminates the need for a GPU entirely.
It runs Python preprocessing on a separate machine.
It compresses a graph via layer fusion, constant folding, and tactic selection, typically 3–10× lower latency.
It guarantees ONNX-Runtime compatibility for all CPU EPs.

Section 3: Scaling Strategies

Pre-Reading Check

7. What is the principal advantage of KEDA over vanilla HPA for GPU serving?

KEDA replaces all of Kubernetes scheduling.
KEDA adds event-source awareness (Kafka, SQS, Prometheus) and first-class scale-to-zero on top of HPA-like behavior.
KEDA forces all pods to run on CPU instances.
KEDA disables stabilization windows.

8. Why are short scale-up and long scale-down stabilization windows recommended?

Kubernetes requires symmetric windows.
To respond quickly to spikes while avoiding thrashing — cold starts make tearing down GPU pods expensive if they're needed again soon.
Long scale-down reduces cluster cost during off-peak hours by removing pods faster.
Short scale-up is mandated by the DCGM Exporter.

9. What does NVIDIA Multi-Instance GPU (MIG) provide that GPU time-slicing does not?

Higher peak FLOPs per slice.
Hardware-level isolation with dedicated compute, memory bandwidth, and L2 cache per instance.
Automatic quantization to INT8.
Free network bandwidth between slices.

Once a single replica is tuned, the next problem is replicating it. Horizontal scaling expands capacity by adding pods; vertical scaling expands by using bigger GPUs or partitioning existing ones.

Horizontal Autoscaling: HPA + KEDA

Kubernetes' HPA scales replicas by metrics — CPU/memory by default, GPU and application metrics in practice. KEDA extends this with event-source-aware scaling (Kafka, SQS, Redis, Prometheus, dozens of others) and first-class scale-to-zero. The best-practice pattern combines them: KEDA drives event-driven scaling from queue depth and Prometheus signals; HPA-style behavior reacts to GPU and latency metrics.

Raw GPU utilization alone is a poor signal — combine it with QPS per pod, queue depth, and SLO breach rate. Metrics flow from NVIDIA's DCGM Exporter (DCGM_FI_PROF_GR_ENGINE_ACTIVE, SM/memory utilization per GPU or MIG slice) into Prometheus, then into HPA via the Prometheus Adapter or into KEDA via its Prometheus scaler.

Stabilization windows matter enormously: scale-up fast (30–60 s) so the system responds to spikes; scale-down slow (300–600 s) so it does not thrash by tearing down GPU pods needed two minutes later.

Animation A2: HPA Autoscaling — Gauge Rises, Pods Spawn; Gauge Falls, Pods Drain

CPU / queue-depth gauge climbs; HPA scales the deployment from one pod up to four. Load subsides; HPA scales back down after the stabilization window.

Queue depth / SM util load signal HPA scale up 30–60 s Deployment replicas Pod 1 Pod 2 Pod 3 Pod 4 Signal high → spawn pods; signal low → drain pods (300–600 s window)

Figure 12.3: HPA + KEDA Signal Flow

flowchart TD DCGM[DCGM Exporter
SM, mem util] --> PA[Prometheus Adapter] PROM[Prometheus
p95, QPS] --> KEDA[KEDA Operator] KAFKA[Kafka depth] --> KEDA PA --> HPA[HPA] KEDA --> HPA HPA -->|up 30-60s, down 300-600s| DEP[Inference Deployment] DEP --> P1[Pod 1 GPU] DEP --> P2[Pod 2 GPU] DEP --> P3[Pod N GPU] P1 -.metrics.-> DCGM P2 -.metrics.-> DCGM P3 -.metrics.-> DCGM style HPA fill:#1f6feb,color:#fff style KEDA fill:#238636,color:#fff style DCGM fill:#76b900,color:#000

MIG and GPU Sharing

A full A100/H100 is overkill for a 7B quantized model using 12 GB. MIG partitions one physical GPU into multiple isolated instances, each with dedicated compute, memory bandwidth, and L2 cache. An A100 can be split into 7× 1g.5gb, or mixed sizes such as 3× 2g.10gb + 1× 3g.20gb. On Kubernetes with the NVIDIA GPU Operator, MIG is applied via node labels (nvidia.com/mig.config=all-1g.5gb) and consumed via resource requests (nvidia.com/mig-1g.5gb: 1). Time-slicing shares one GPU without hardware isolation — simpler to configure but offers no QoS; MIG is the right choice when SLOs matter.

Load Balancing and Multi-Region

Round-robin is the lazy default and usually wrong. Use least-loaded routing, session affinity for streaming LLM responses, and shadow routing for offline candidate comparison. Multi-region deployments place replicas in multiple cloud regions with DNS or Anycast routing — at the cost of cross-region weight replication and feature-store consistency.

Figure 12.4: MIG Partitioning of an A100

graph TD A[Physical A100
40 GB HBM2, 108 SMs] A --> S1[1g.5gb
5 GB] A --> S2[1g.5gb
5 GB] A --> S3[2g.10gb
10 GB] A --> S4[3g.20gb
20 GB] S1 --> POD1[Pod A
small CV] S2 --> POD2[Pod B
small CV] S3 --> POD3[Pod C
medium NLP] S4 --> POD4[Pod D
7B quantized LLM] style A fill:#76b900,color:#000 style S1 fill:#1f6feb,color:#fff style S2 fill:#1f6feb,color:#fff style S3 fill:#1f6feb,color:#fff style S4 fill:#1f6feb,color:#fff

Key Points

Post-Reading Check

7. What is the principal advantage of KEDA over vanilla HPA for GPU serving?

KEDA replaces all of Kubernetes scheduling.
KEDA adds event-source awareness (Kafka, SQS, Prometheus) and first-class scale-to-zero on top of HPA-like behavior.
KEDA forces all pods to run on CPU instances.
KEDA disables stabilization windows.

8. Why are short scale-up and long scale-down stabilization windows recommended?

Kubernetes requires symmetric windows.
To respond quickly to spikes while avoiding thrashing — cold starts make tearing down GPU pods expensive if they're needed again soon.
Long scale-down reduces cluster cost during off-peak hours by removing pods faster.
Short scale-up is mandated by the DCGM Exporter.

9. What does NVIDIA Multi-Instance GPU (MIG) provide that GPU time-slicing does not?

Higher peak FLOPs per slice.
Hardware-level isolation with dedicated compute, memory bandwidth, and L2 cache per instance.
Automatic quantization to INT8.
Free network bandwidth between slices.

Section 4: Advanced Serving Topologies

Pre-Reading Check

10. Why is a cascade often an effective p99 optimization?

It runs every model on every input in parallel.
A cheap fast model handles the easy majority; only difficult inputs escalate to the expensive model, dramatically lowering average compute per request.
It eliminates the GPU entirely.
Cascades always have higher accuracy than a single model.

11. What two breakthrough techniques distinguish vLLM as an LLM serving engine?

Static batching plus FP32 weights.
PagedAttention (paged KV-cache) and continuous batching (per-token-step scheduling).
Single-GPU dispatch and OS process isolation.
DAG ensembles and model warmup hooks.

12. Which engine is the best choice when you need maximum throughput across many concurrent LLM users with minimal vendor lock-in?

Hugging Face TGI
Triton + TensorRT-LLM (compiled engines)
vLLM
Stock PyTorch eager mode

One-model-per-endpoint is increasingly rare. Modern topologies chain models, route through stages, and run specialized engines for specialized workloads — especially LLMs.

Ensembles, Cascades, Sidecars

Ensembles combine predictions (average, vote, stack). Cascades chain models sequentially: a 2 ms keyword filter handles obvious cases, a 15 ms CNN handles ambiguous images, only the hardest cases escalate to a 100 ms multimodal foundation model — slashing average compute without losing accuracy on hard inputs.

Triton ensembles are first-class: a config.pbtxt DAG chains tokenizer (Python backend) → BERT encoder (TensorRT) → classification head (ORT) → label mapper (Python). All four run inside Triton with no network hops between stages.

Embedding sidecars in ranking/search systems produce embeddings for new entities; cached results in Redis or a vector store mean the hot path rarely runs the sidecar, but when it does, latency is predictable.

Figure 12.5: Triton Ensemble Pipeline

flowchart LR REQ[Client Request
raw text] --> PRE subgraph TRITON[Triton] PRE[Tokenizer
Python backend] --> ENC[BERT Encoder
TensorRT] ENC --> HEAD[Classifier Head
ORT] HEAD --> POST[Label Mapper
Python] end POST --> RESP[Response
label + score] style PRE fill:#d29922,color:#fff style ENC fill:#1f6feb,color:#fff style HEAD fill:#1f6feb,color:#fff style POST fill:#d29922,color:#fff

LLM Serving: vLLM, TGI, Triton + TensorRT-LLM

LLM serving is different: generation is autoregressive, sequence lengths vary wildly, and the KV cache dominates GPU memory. vLLM introduced two breakthroughs:

The result: on LLaMA-2-7B at 100 concurrent requests, vLLM hits ~15,243 tok/s vs TGI's ~4,156 tok/s — roughly 3.7× higher throughput, widening to ~24× at extreme concurrency.

TGI wins on time-to-first-token (1.3–2× lower than vLLM at low concurrency) and on Hugging Face ecosystem integration (safetensors, HF Hub, auth, observability). Triton + TensorRT-LLM compiles weights/graphs into a hardware-specific engine, yielding slightly higher peak throughput than vLLM on H100 — at the cost of tens of minutes of cold-start compile time versus vLLM's roughly one minute.

Animation A3: vLLM PagedAttention — KV Cache as Reusable Memory Pages

The KV cache is laid out as a grid of fixed-size pages. Requests A, B, C claim pages dynamically as their sequences grow — no contiguous-allocation fragmentation, no wasted memory between sequences.

KV-cache page grid (fixed-size pages) Request A (blue) — claims 4 pages Request B (green) — claims 4 pages Request C (pink) — claims 4 pages Pages claimed dynamically; freed pages immediately reusable — no fragmentation
EngineBest ForKey TechniqueThroughput (LLaMA-2-7B @100)TTFTCold Start
vLLMMulti-tenant high-concurrency, batch genPagedAttention + continuous batching~15,243 tok/sBaseline~1 min
TGIHF-ecosystem chatbots, low concurrencyDynamic batching + safetensors~4,156 tok/s (~3.7× lower)1.3–2× lower than vLLM~1 min
Triton + TensorRT-LLMEnterprise fleets, fixed long-lived modelsCompiled engines + ensemblesSlightly higher peak than vLLMVariableTens of minutes

Selection rule: vLLM for max throughput / minimal lock-in; TGI for low TTFT and HF integration; Triton + TensorRT-LLM for unified NVIDIA platform with dozens of model types under one control plane.

Key Points

Post-Reading Check

10. Why is a cascade often an effective p99 optimization?

It runs every model on every input in parallel.
A cheap fast model handles the easy majority; only difficult inputs escalate to the expensive model, dramatically lowering average compute per request.
It eliminates the GPU entirely.
Cascades always have higher accuracy than a single model.

11. What two breakthrough techniques distinguish vLLM as an LLM serving engine?

Static batching plus FP32 weights.
PagedAttention (paged KV-cache) and continuous batching (per-token-step scheduling).
Single-GPU dispatch and OS process isolation.
DAG ensembles and model warmup hooks.

12. Which engine is the best choice when you need maximum throughput across many concurrent LLM users with minimal vendor lock-in?

Hugging Face TGI
Triton + TensorRT-LLM (compiled engines)
vLLM
Stock PyTorch eager mode

Your Progress

Answer Explanations