Study Guide: Chapter 12 — Serving Infrastructure: Latency, Throughput, and Scalability

Learning Objectives

Reason about serving SLOs in terms of tail percentiles (p95, p99) rather than averages and identify the sources of variance that produce tail latency.
Apply optimization techniques — dynamic batching, TensorRT/ONNX Runtime kernel fusion, FP16/INT8 quantization, and caching — and understand how they stack multiplicatively.
Design horizontal-scaling strategies that combine HPA, KEDA, MIG, and DCGM metrics for GPU-aware autoscaling.
Compare advanced serving topologies (cascades, Triton ensembles, vLLM, TGI, Triton + TensorRT-LLM) and select the right engine per workload.

Section 1: Latency and Throughput Fundamentals

Pre-Reading Check

1. Why are serving SLOs expressed in terms of p99 latency rather than average latency?

Averages are easier to compute on streaming metrics pipelines.

Averages mask rare-but-frequent bad experiences that define user perception.

p99 is required by Kubernetes HPA to function correctly.

Average latency cannot be measured on GPUs.

2. According to the operational heuristic in the chapter, sustained GPU utilization should typically be kept under what threshold to protect p99 against bursts?

10–20%

40–50%

60–70%

95–100%

3. What is the canonical mitigation for cold-start latency taxes on GPU pods?

Scale-to-zero with aggressive KEDA polling intervals.

A warm pool with minReplicaCount > 0 plus startup warmup hooks before readiness.

Disabling JIT compilation in PyTorch.

Running CPU-only inference on the first request.

Production serving moves the conversation from "does the model work?" to "does it deliver predictions fast enough, often enough, and cheaply enough to satisfy an SLO?" The single most important habit a serving engineer must develop is to stop reasoning about average latency. Averages hide catastrophes — if a model returns in 20 ms for 99 requests and 5,000 ms for the hundredth, the average looks like a healthy 70 ms, but one in every hundred users just watched a five-second spinner.

Real serving systems are evaluated on percentiles: p50 (median), p95 (bad-day), and p99 (worst case that still happens hundreds of times per hour at scale). An SLO like "p99 < 200 ms for the recommendation endpoint" is the contract between the platform and downstream consumers; when it breaks, error budgets burn and the on-call pages. Tail latency arises from sources averages cannot see: GC pauses, cold caches, head-of-line blocking, scheduler jitter, and rare input shapes outside an optimization profile. Engineering for p99 is fundamentally about reducing variance.

Throughput (QPS or tokens/s) and latency are coupled but distinct, governed by queueing theory: as utilization approaches 100%, queue depth grows nonlinearly and latency explodes. A widely used heuristic is to hold GPU SM utilization below 60–70% in steady state.

Metric	Measures	Unit	Watch For
p50	Median latency	ms	Baseline experience
p95	95th percentile	ms	Bad-day experience
p99	99th percentile	ms	SLO contract metric
Throughput	Sustained request rate	QPS, tok/s	Capacity ceiling
SM util	SMs busy	%	Stay under 60–70%
Queue depth	Pending requests	count	Leading indicator of p99

Fresh replicas pay "first-time" costs — weight loading, CUDA context creation, JIT specialization, page-cache warming — making their first dozen requests 5–10× slower than steady state. The mitigation is the warm pool: minReplicaCount > 0, plus startup warmup hooks that issue synthetic requests across all major input shapes before readiness passes. Triton supports this natively via its model-warmup config.

Profiling is non-negotiable. A common finding is that the GPU kernel itself takes 8 ms while Python preprocessing eats 40 ms and JSON serialization 30 ms — meaning kernel work buys almost nothing until the surrounding pipeline is fixed. Nsight Systems, Triton's perf_analyzer, and the PyTorch profiler show where the milliseconds actually live.

Figure 12.1: Latency Budget Across a Serving Pipeline

Post-Reading Check

1. Why are serving SLOs expressed in terms of p99 latency rather than average latency?

Averages are easier to compute on streaming metrics pipelines.

Averages mask rare-but-frequent bad experiences that define user perception.

p99 is required by Kubernetes HPA to function correctly.

Average latency cannot be measured on GPUs.

2. According to the operational heuristic in the chapter, sustained GPU utilization should typically be kept under what threshold to protect p99 against bursts?

10–20%

40–50%

60–70%

95–100%

3. What is the canonical mitigation for cold-start latency taxes on GPU pods?

Scale-to-zero with aggressive KEDA polling intervals.

A warm pool with minReplicaCount > 0 plus startup warmup hooks before readiness.

Disabling JIT compilation in PyTorch.

Running CPU-only inference on the first request.

Section 2: Optimization Techniques

Pre-Reading Check

4. What single tuning knob most directly determines the latency/throughput tradeoff in Triton's dynamic batching?

instance_group.count

max_queue_delay_microseconds

preferred_batch_size only

model_repository path

5. Which statement about INT8 quantization is correct?

Post-training quantization always matches QAT accuracy.

INT8 PTQ delivers 2–4× speedup over FP32 but may lose 1–2 accuracy points; QAT typically recovers most of that gap.

INT8 only works on CPUs; GPUs require FP32 weights.

Quantization increases tail latency because of additional dequantization overhead.

6. What architectural benefit does TensorRT provide over PyTorch eager execution?

It eliminates the need for a GPU entirely.

It runs Python preprocessing on a separate machine.

It compresses a graph via layer fusion, constant folding, and tactic selection, typically 3–10× lower latency.

It guarantees ONNX-Runtime compatibility for all CPU EPs.

Reducing per-request work is the highest-leverage optimization available; every saved millisecond is one less millisecond in the queue, which compounds into lower p99 and higher sustainable QPS. Four families of techniques dominate.

Dynamic Batching and Request Bucketing

GPUs are massively parallel; sending them a single 1-row matrix is like hiring a thousand cooks for one omelette. Dynamic batching collects multiple requests arriving within a 1–2 ms window into a single batch dispatch, amortizing per-call overhead. Triton exposes this via config.pbtxt:

dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 2000
}
instance_group { kind: KIND_GPU count: 2 }

The max_queue_delay_microseconds knob is the central tuning dial: too large and unlucky requests sit too long; too small and most batches are size 1. The sweet spot keeps delay well below compute time (e.g., 1–2 ms of delay for a 15–50 ms model). Done correctly, dynamic batching delivers 2–10× throughput on transformer/CV workloads with minimal latency cost.

Request bucketing pads variable inputs to fixed sizes (16, 32, 64, 128) so TensorRT can build optimization profiles per bucket; rare lengths no longer spike p99.

TensorRT and ONNX Runtime

TensorRT takes a frozen graph and produces a hardware-specialized engine through layer fusion (Conv+Bias+ReLU → 1 kernel), constant folding, layout transformation, and tactic selection (benchmarking multiple GEMM implementations). A 150-node BERT graph compresses to ~20–30 fused ops — a 3–10× latency reduction over PyTorch eager. ONNX Runtime is the portable alternative: an Execution Provider abstraction lets the same model target CPU, CUDA, TensorRT, or DirectML, with constant folding, node fusion, and shape inference at load time.

Quantization

FP16 uses Tensor Cores for 1.5–2× throughput with negligible accuracy loss. INT8 yields 2–4× over FP32. PTQ (post-training, with calibration data) is quick but may lose 1–2 points; QAT (quantization-aware training, with fake-quant nodes in the forward pass) typically recovers most of that gap.

Optimization	Speedup	Accuracy	When
Dynamic batching	2–10× throughput	None	Always for transformer/CV
TensorRT kernel fusion	3–10× latency	None	Stable NVIDIA prod
FP16	1.5–2× throughput	Negligible	Default for modern GPUs
INT8 PTQ	2–4× over FP32	0–2 pts loss	Calibration data available
INT8 QAT	2–4× over FP32	Near-zero loss	When PTQ accuracy too low
Caching	∞ on hits	None	Repeated inputs

Caching

The fastest inference is the one you never run. A two-tier cache (in-process LRU + Redis) can absorb a large fraction of traffic before it ever reaches the GPU. Cache-key design is the engineering judgment — content hash for embeddings, (user_id, candidate_id, model_version) tuples for ranking.

Figure 12.2: Dynamic Batching Sequence

Post-Reading Check

4. What single tuning knob most directly determines the latency/throughput tradeoff in Triton's dynamic batching?

instance_group.count

max_queue_delay_microseconds

preferred_batch_size only

model_repository path

5. Which statement about INT8 quantization is correct?

Post-training quantization always matches QAT accuracy.

INT8 PTQ delivers 2–4× speedup over FP32 but may lose 1–2 accuracy points; QAT typically recovers most of that gap.

INT8 only works on CPUs; GPUs require FP32 weights.

Quantization increases tail latency because of additional dequantization overhead.

6. What architectural benefit does TensorRT provide over PyTorch eager execution?

It eliminates the need for a GPU entirely.

It runs Python preprocessing on a separate machine.

It compresses a graph via layer fusion, constant folding, and tactic selection, typically 3–10× lower latency.

It guarantees ONNX-Runtime compatibility for all CPU EPs.

Section 3: Scaling Strategies

Pre-Reading Check

7. What is the principal advantage of KEDA over vanilla HPA for GPU serving?

KEDA replaces all of Kubernetes scheduling.

KEDA adds event-source awareness (Kafka, SQS, Prometheus) and first-class scale-to-zero on top of HPA-like behavior.

KEDA forces all pods to run on CPU instances.

KEDA disables stabilization windows.

8. Why are short scale-up and long scale-down stabilization windows recommended?

Kubernetes requires symmetric windows.

To respond quickly to spikes while avoiding thrashing — cold starts make tearing down GPU pods expensive if they're needed again soon.

Long scale-down reduces cluster cost during off-peak hours by removing pods faster.

Short scale-up is mandated by the DCGM Exporter.

9. What does NVIDIA Multi-Instance GPU (MIG) provide that GPU time-slicing does not?

Higher peak FLOPs per slice.

Hardware-level isolation with dedicated compute, memory bandwidth, and L2 cache per instance.

Automatic quantization to INT8.

Free network bandwidth between slices.

Once a single replica is tuned, the next problem is replicating it. Horizontal scaling expands capacity by adding pods; vertical scaling expands by using bigger GPUs or partitioning existing ones.

Horizontal Autoscaling: HPA + KEDA

Kubernetes' HPA scales replicas by metrics — CPU/memory by default, GPU and application metrics in practice. KEDA extends this with event-source-aware scaling (Kafka, SQS, Redis, Prometheus, dozens of others) and first-class scale-to-zero. The best-practice pattern combines them: KEDA drives event-driven scaling from queue depth and Prometheus signals; HPA-style behavior reacts to GPU and latency metrics.

Raw GPU utilization alone is a poor signal — combine it with QPS per pod, queue depth, and SLO breach rate. Metrics flow from NVIDIA's DCGM Exporter (DCGM_FI_PROF_GR_ENGINE_ACTIVE, SM/memory utilization per GPU or MIG slice) into Prometheus, then into HPA via the Prometheus Adapter or into KEDA via its Prometheus scaler.

Stabilization windows matter enormously: scale-up fast (30–60 s) so the system responds to spikes; scale-down slow (300–600 s) so it does not thrash by tearing down GPU pods needed two minutes later.

Figure 12.3: HPA + KEDA Signal Flow

MIG and GPU Sharing

A full A100/H100 is overkill for a 7B quantized model using 12 GB. MIG partitions one physical GPU into multiple isolated instances, each with dedicated compute, memory bandwidth, and L2 cache. An A100 can be split into 7× 1g.5gb, or mixed sizes such as 3× 2g.10gb + 1× 3g.20gb. On Kubernetes with the NVIDIA GPU Operator, MIG is applied via node labels (nvidia.com/mig.config=all-1g.5gb) and consumed via resource requests (nvidia.com/mig-1g.5gb: 1). Time-slicing shares one GPU without hardware isolation — simpler to configure but offers no QoS; MIG is the right choice when SLOs matter.

Load Balancing and Multi-Region

Round-robin is the lazy default and usually wrong. Use least-loaded routing, session affinity for streaming LLM responses, and shadow routing for offline candidate comparison. Multi-region deployments place replicas in multiple cloud regions with DNS or Anycast routing — at the cost of cross-region weight replication and feature-store consistency.

Figure 12.4: MIG Partitioning of an A100

Post-Reading Check

7. What is the principal advantage of KEDA over vanilla HPA for GPU serving?

KEDA replaces all of Kubernetes scheduling.

KEDA adds event-source awareness (Kafka, SQS, Prometheus) and first-class scale-to-zero on top of HPA-like behavior.

KEDA forces all pods to run on CPU instances.

KEDA disables stabilization windows.

8. Why are short scale-up and long scale-down stabilization windows recommended?

Kubernetes requires symmetric windows.

To respond quickly to spikes while avoiding thrashing — cold starts make tearing down GPU pods expensive if they're needed again soon.

Long scale-down reduces cluster cost during off-peak hours by removing pods faster.

Short scale-up is mandated by the DCGM Exporter.

9. What does NVIDIA Multi-Instance GPU (MIG) provide that GPU time-slicing does not?

Higher peak FLOPs per slice.

Hardware-level isolation with dedicated compute, memory bandwidth, and L2 cache per instance.

Automatic quantization to INT8.

Free network bandwidth between slices.

Section 4: Advanced Serving Topologies

Pre-Reading Check

10. Why is a cascade often an effective p99 optimization?

It runs every model on every input in parallel.

A cheap fast model handles the easy majority; only difficult inputs escalate to the expensive model, dramatically lowering average compute per request.

It eliminates the GPU entirely.

Cascades always have higher accuracy than a single model.

11. What two breakthrough techniques distinguish vLLM as an LLM serving engine?

Static batching plus FP32 weights.

PagedAttention (paged KV-cache) and continuous batching (per-token-step scheduling).

Single-GPU dispatch and OS process isolation.

DAG ensembles and model warmup hooks.

12. Which engine is the best choice when you need maximum throughput across many concurrent LLM users with minimal vendor lock-in?

Hugging Face TGI

Triton + TensorRT-LLM (compiled engines)

vLLM

Stock PyTorch eager mode

One-model-per-endpoint is increasingly rare. Modern topologies chain models, route through stages, and run specialized engines for specialized workloads — especially LLMs.

Ensembles, Cascades, Sidecars

Ensembles combine predictions (average, vote, stack). Cascades chain models sequentially: a 2 ms keyword filter handles obvious cases, a 15 ms CNN handles ambiguous images, only the hardest cases escalate to a 100 ms multimodal foundation model — slashing average compute without losing accuracy on hard inputs.

Triton ensembles are first-class: a config.pbtxt DAG chains tokenizer (Python backend) → BERT encoder (TensorRT) → classification head (ORT) → label mapper (Python). All four run inside Triton with no network hops between stages.

Embedding sidecars in ranking/search systems produce embeddings for new entities; cached results in Redis or a vector store mean the hot path rarely runs the sidecar, but when it does, latency is predictable.

Figure 12.5: Triton Ensemble Pipeline

LLM Serving: vLLM, TGI, Triton + TensorRT-LLM

LLM serving is different: generation is autoregressive, sequence lengths vary wildly, and the KV cache dominates GPU memory. vLLM introduced two breakthroughs:

PagedAttention: treat the KV cache like an OS virtual-memory paged system — fixed-size pages, reusable across requests, finished sequences freeing their pages. Reduces fragmentation 19–27% versus contiguous layouts.
Continuous batching: new requests join the active batch at every decoding step; completed sequences free resources immediately. Keeps GPU utilization at 85–92% even under heterogeneous load.

The result: on LLaMA-2-7B at 100 concurrent requests, vLLM hits ~15,243 tok/s vs TGI's ~4,156 tok/s — roughly 3.7× higher throughput, widening to ~24× at extreme concurrency.

TGI wins on time-to-first-token (1.3–2× lower than vLLM at low concurrency) and on Hugging Face ecosystem integration (safetensors, HF Hub, auth, observability). Triton + TensorRT-LLM compiles weights/graphs into a hardware-specific engine, yielding slightly higher peak throughput than vLLM on H100 — at the cost of tens of minutes of cold-start compile time versus vLLM's roughly one minute.

Engine	Best For	Key Technique	Throughput (LLaMA-2-7B @100)	TTFT	Cold Start
vLLM	Multi-tenant high-concurrency, batch gen	PagedAttention + continuous batching	~15,243 tok/s	Baseline	~1 min
TGI	HF-ecosystem chatbots, low concurrency	Dynamic batching + safetensors	~4,156 tok/s (~3.7× lower)	1.3–2× lower than vLLM	~1 min
Triton + TensorRT-LLM	Enterprise fleets, fixed long-lived models	Compiled engines + ensembles	Slightly higher peak than vLLM	Variable	Tens of minutes

Selection rule: vLLM for max throughput / minimal lock-in; TGI for low TTFT and HF integration; Triton + TensorRT-LLM for unified NVIDIA platform with dozens of model types under one control plane.

Post-Reading Check

10. Why is a cascade often an effective p99 optimization?

It runs every model on every input in parallel.

A cheap fast model handles the easy majority; only difficult inputs escalate to the expensive model, dramatically lowering average compute per request.

It eliminates the GPU entirely.

Cascades always have higher accuracy than a single model.

11. What two breakthrough techniques distinguish vLLM as an LLM serving engine?

Static batching plus FP32 weights.

PagedAttention (paged KV-cache) and continuous batching (per-token-step scheduling).

Single-GPU dispatch and OS process isolation.

DAG ensembles and model warmup hooks.

12. Which engine is the best choice when you need maximum throughput across many concurrent LLM users with minimal vendor lock-in?

Hugging Face TGI

Triton + TensorRT-LLM (compiled engines)

vLLM

Stock PyTorch eager mode

Chapter 12 — Serving Infrastructure: Latency, Throughput, and Scalability

Learning Objectives

Section 1: Latency and Throughput Fundamentals

Figure 12.1: Latency Budget Across a Serving Pipeline

Key Points

Section 2: Optimization Techniques

Dynamic Batching and Request Bucketing

Animation A1: Dynamic Batching — Requests Accumulate, GPU Fires Once

TensorRT and ONNX Runtime

Quantization

Caching

Figure 12.2: Dynamic Batching Sequence

Key Points

Section 3: Scaling Strategies

Horizontal Autoscaling: HPA + KEDA

Animation A2: HPA Autoscaling — Gauge Rises, Pods Spawn; Gauge Falls, Pods Drain

Figure 12.3: HPA + KEDA Signal Flow

MIG and GPU Sharing

Load Balancing and Multi-Region

Figure 12.4: MIG Partitioning of an A100

Key Points

Section 4: Advanced Serving Topologies

Ensembles, Cascades, Sidecars

Figure 12.5: Triton Ensemble Pipeline

LLM Serving: vLLM, TGI, Triton + TensorRT-LLM

Animation A3: vLLM PagedAttention — KV Cache as Reusable Memory Pages

Key Points

Your Progress

Answer Explanations