Deploy models using KServe, Triton, and vLLM on Kubernetes
Configure autoscaling for inference workloads based on GPU utilization and request metrics
Implement canary deployments and A/B testing for model rollouts
Optimize inference latency and throughput with batching and model optimization techniques
Pre-Study Assessment
Answer these questions before reading the material to gauge your current understanding.
Pre-Quiz
1. What is the primary purpose of KServe's InferenceService Custom Resource?
A) To train machine learning models on GPU clusters
B) To provide a single declarative abstraction for deploying, autoscaling, and routing to model endpoints
C) To monitor GPU utilization across the cluster
D) To convert models from PyTorch format to ONNX
2. What problem does vLLM's PagedAttention solve?
A) It compresses model weights to reduce download size
B) It eliminates the need for GPU memory entirely by using CPU offloading
C) It manages KV-cache memory in non-contiguous pages, reducing fragmentation and enabling more concurrent requests
D) It distributes model layers across multiple nodes automatically
3. In a canary deployment for model serving, what happens to user traffic?
A) All traffic is switched to the new model version immediately
B) A small percentage of traffic is routed to the new version while the majority stays on the stable version
C) Traffic is paused entirely until the new model is validated
D) Only admin users can access the new model version
4. Why is configuring /dev/shm critical for LLM pods on Kubernetes?
A) It controls the maximum number of GPUs a pod can use
B) Pods default to 64MB shared memory, and AI frameworks need much more for inter-process communication
C) It determines the network bandwidth allocated to the pod
D) It stores the model weights permanently for faster restarts
5. What is the key advantage of KEDA over standard HPA for LLM workloads?
A) KEDA supports more programming languages
B) KEDA can scale on LLM-specific metrics like pending queue depth and token throughput
C) KEDA is faster at starting new pods
D) KEDA requires fewer cluster resources to operate
6. What does quantization (e.g., GPTQ, AWQ) achieve for LLM serving?
A) It increases model accuracy by adding more parameters
B) It reduces weight precision to lower memory footprint and increase throughput with minimal accuracy loss
C) It encrypts model weights for security
D) It converts models to run on CPUs only
7. What is the role of the Transformer component in a KServe InferenceService?
A) It trains the model on new data
B) It handles pre-processing and post-processing of requests before and after the predictor
C) It converts the model from one framework to another
D) It provides GPU scheduling for the cluster
8. How does continuous batching differ from static batching in LLM serving?
A) Continuous batching processes one request at a time for better accuracy
B) Continuous batching inserts new requests as previous ones complete, keeping GPU utilization high
C) Continuous batching requires more GPU memory than static batching
D) Continuous batching is only available for CPU-based inference
9. What is Model Mesh in the KServe ecosystem?
A) A network mesh that encrypts traffic between model servers
B) A multi-model serving capability that co-locates many models in shared pods to reduce overhead
C) A visualization tool for model architectures
D) A storage backend for model weights
10. When should you use scale-to-zero for model serving endpoints?
A) Always, to save maximum cost
B) For latency-sensitive LLM endpoints that need instant responses
C) For smaller models with low traffic where cold start time is acceptable
D) Only for batch inference workloads
11. Which model server is best suited for heterogeneous multi-model deployments with mixed frameworks?
A) vLLM
B) TGI (Text Generation Inference)
C) Triton Inference Server
D) TorchServe
12. What distinguishes A/B testing from canary deployment in model rollouts?
A) A/B testing is faster to set up
B) A/B testing routes traffic based on user cohort or headers, while canary uses random percentage splitting
C) Canary deployments require more replicas
D) A/B testing does not support automatic rollback
Section 1: Model Serving Architectures
Online vs. Batch Inference
Not all inference looks the same. Two fundamental patterns govern how predictions are generated:
Pattern
Trigger
Latency Target
Typical Use Case
Online (real-time)
Single HTTP/gRPC request
< 200ms
Chatbots, recommendation APIs, fraud detection
Batch
Scheduled job or queue
Minutes to hours
Offline scoring, ETL enrichment, bulk predictions
Online inference prioritizes low latency: a user waits for each response, so the serving system must return results quickly. Batch inference prioritizes throughput: thousands of records are processed together. On Kubernetes, online serving uses Deployment or InferenceService resources fronted by a Service, while batch inference is better expressed as a Job or CronJob.
Model Server Options
Choosing the right model server is one of the most consequential decisions in your serving stack:
Server
Best For
Key Feature
Protocols
TorchServe
PyTorch models, custom handlers
Native PyTorch integration
REST, gRPC
Triton
Multi-framework, mixed workloads
Multi-model concurrency, dynamic batching
REST, gRPC, HTTP/2
vLLM
LLM text generation
PagedAttention, continuous batching
OpenAI-compatible REST
TGI
LLM serving (HuggingFace)
Flash Attention, tensor parallelism
REST
Sidecar and Standalone Topologies
Standalone: The model server runs as the primary container in a pod. This is simpler and appropriate for most cases.
Sidecar: A lightweight proxy or transformer runs alongside the model server container. KServe uses this pattern to inject its storage initializer as an init container that downloads model weights, and optionally attaches a transformer container for pre/post-processing.
Key Takeaway
Choose online or batch inference based on latency requirements
Pick your model server based on model framework and workload mix
Triton excels at heterogeneous multi-model deployments; vLLM is the default for high-throughput LLM serving
Section 2: KServe (formerly KFServing)
KServe is the leading Kubernetes-native model serving platform, providing a single InferenceService CRD that abstracts away autoscaling, networking, health checking, and runtime configuration. It graduated from the Kubeflow project and is now a standalone CNCF project.
The InferenceService Custom Resource
A minimal InferenceService definition specifies a predictor with a model format and storage URI. When applied, KServe launches a storage initializer init container, starts the serving runtime, configures ingress routing, and optionally enables autoscaling.
A full InferenceService can include up to three logical components:
Component
Purpose
Example
Predictor
Core inference -- runs the model
vLLM server, Triton, sklearn-server
Transformer
Pre/post-processing pipeline
Tokenization, feature engineering
Explainer
Model explainability
SHAP values, LIME explanations
KServe Request Flow
Model Storage and Model Mesh
The storage initializer is an init container that KServe injects into every predictor pod. It supports multiple storage backends: GCS (gs://), S3 (s3://), Azure Blob (azureblob://), HTTP/HTTPS, and PersistentVolumeClaims (pvc://).
Model Mesh is KServe's multi-model serving capability. Instead of one pod per model, Model Mesh co-locates many models within a shared pool of serving pods, dramatically reducing per-model overhead when maintaining hundreds of smaller models. It maintains an in-memory cache and loads/unloads models on demand.
Key Takeaway
KServe's InferenceService CRD is the central abstraction for production ML serving on Kubernetes
The predictor/transformer/explainer decomposition encourages clean separation of concerns
Model Mesh enables cost-efficient multi-model deployments
Section 3: LLM Serving on Kubernetes
Large language models present unique serving challenges. A 70B parameter model in FP16 requires approximately 140GB of GPU VRAM. The autoregressive generation pattern creates inherently sequential workloads that are hard to batch naively.
vLLM and PagedAttention
vLLM is the dominant open-source LLM serving engine. Its defining innovation is PagedAttention, which manages the KV cache using virtual memory techniques borrowed from OS page tables. Instead of reserving large contiguous VRAM blocks, PagedAttention stores the KV cache in non-contiguous "pages," eliminating internal fragmentation and enabling up to 24x more concurrent requests.
A critical Kubernetes-specific detail: pods default to only 64MB of shared memory (/dev/shm). Without an emptyDir with medium: Memory, large models will crash during initialization.
HuggingFace's TGI is an alternative LLM server supporting Flash Attention 2, continuous batching, and tensor parallelism. It integrates deeply with the HuggingFace Hub ecosystem.
Quantization: GPTQ, AWQ, GGUF
Quantization converts model weights from high-precision formats (FP16) to lower-precision formats (INT4), trading a small amount of accuracy for significant memory and speed gains:
Format
Precision
Memory Reduction
Supported By
GPTQ
INT4
~4x vs FP32
vLLM, TGI, Triton
AWQ
INT4
~4x vs FP32
vLLM, TGI
GGUF
2-8 bit mixed
Up to 8x
llama.cpp, Ollama
A 13B parameter model in FP16 requires ~26GB VRAM. The GPTQ INT4 version requires ~7GB -- fitting on a single A10G or consumer RTX 3090.
Continuous Batching and Speculative Decoding
Continuous batching keeps the GPU pipeline full at all times. As soon as a sequence completes, a new request from the queue immediately fills its slot in the next forward pass, keeping GPU utilization high and minimizing time-to-first-token.
Speculative decoding uses a small draft model to propose several tokens, which the large target model verifies in parallel. When the draft model is right, multiple tokens are produced per forward pass.
Key Takeaway
vLLM's PagedAttention and continuous batching are the two most impactful optimizations for LLM serving throughput
Quantization (AWQ or GPTQ) allows models 2-4x too large for your GPU to fit in memory
Always configure /dev/shm for LLM pods on Kubernetes
Section 4: Autoscaling and Traffic Management
HPA with Custom Metrics
The Horizontal Pod Autoscaler (HPA) scales workloads based on observed metrics. For inference services, useful signals include CPU utilization, GPU utilization (DCGM metrics), request latency (p99), request concurrency, and pending queue depth.
Autoscaler
Trigger
Best For
Scale-to-Zero
KPA (Knative)
Request concurrency
Serverless, bursty traffic
Yes
HPA
CPU, memory, custom Prometheus
Stable, predictable traffic
No (min 1)
KEDA
Any event source
LLM serving, event-driven
Yes
HPA Autoscaling in Response to Load
Scale-to-Zero with Knative or KEDA
Scale-to-zero completely removes all pods when no traffic is present, and scales back up on the first request. Knative Pod Autoscaler (KPA) is KServe's default autoscaler in serverless mode. KEDA extends HPA with support for custom event sources and LLM-specific metrics.
The tradeoff with scale-to-zero for LLMs is cold start time. A 7B parameter model takes 30-90 seconds to load, making scale-to-zero unsuitable for latency-sensitive endpoints. For LLMs, minReplicas: 1 is usually the right choice.
Canary Rollouts and Traffic Splitting
Deploying a new model version to 100% of traffic immediately is high risk. Canary deployments mitigate this by routing a small percentage of traffic to the new version while the majority continues on the stable version.
spec:
predictor:
canaryTrafficPercent: 20 # 20% to new version
model:
modelFormat:
name: sklearn
storageUri: "gs://my-bucket/models/classifier/v2"
For automated promotion, Flagger integrates with KServe to progressively shift traffic based on Prometheus metrics, starting with a small canary percentage and automatically promoting or rolling back.
Canary Deployment: Gradual Traffic Rollout
A/B Testing and Load Balancing
A/B testing routes traffic based on user identity or request metadata, enabling controlled experiments. KServe, backed by Istio, supports header-based routing (e.g., X-Experiment-Group: B).
Technique
Traffic Split Basis
Rollback
End State
Canary
Random percentage
Automatic (Flagger) or manual
One version wins
A/B Test
User cohort / header
Manual analysis
Winner by experiment
Blue/Green
All-or-nothing switch
Flip traffic weight
Immediate cutover
For load balancing, LLM inference benefits from prefix cache affinity (routing users to replicas that have cached their system prompt), request length awareness, and least-connection routing rather than naive round-robin.
Key Takeaway
Combine KPA or KEDA for responsive autoscaling with KServe's native traffic splitting for safe model rollouts
Scale-to-zero saves GPU costs for low-traffic models but requires cold start tolerance
For LLMs, KEDA's ability to scale on queue depth and token throughput makes it the most effective autoscaling option
Post-Study Assessment
Now that you have studied the material, answer the same questions again to measure your learning.
Post-Quiz
1. What is the primary purpose of KServe's InferenceService Custom Resource?
A) To train machine learning models on GPU clusters
B) To provide a single declarative abstraction for deploying, autoscaling, and routing to model endpoints
C) To monitor GPU utilization across the cluster
D) To convert models from PyTorch format to ONNX
2. What problem does vLLM's PagedAttention solve?
A) It compresses model weights to reduce download size
B) It eliminates the need for GPU memory entirely by using CPU offloading
C) It manages KV-cache memory in non-contiguous pages, reducing fragmentation and enabling more concurrent requests
D) It distributes model layers across multiple nodes automatically
3. In a canary deployment for model serving, what happens to user traffic?
A) All traffic is switched to the new model version immediately
B) A small percentage of traffic is routed to the new version while the majority stays on the stable version
C) Traffic is paused entirely until the new model is validated
D) Only admin users can access the new model version
4. Why is configuring /dev/shm critical for LLM pods on Kubernetes?
A) It controls the maximum number of GPUs a pod can use
B) Pods default to 64MB shared memory, and AI frameworks need much more for inter-process communication
C) It determines the network bandwidth allocated to the pod
D) It stores the model weights permanently for faster restarts
5. What is the key advantage of KEDA over standard HPA for LLM workloads?
A) KEDA supports more programming languages
B) KEDA can scale on LLM-specific metrics like pending queue depth and token throughput
C) KEDA is faster at starting new pods
D) KEDA requires fewer cluster resources to operate
6. What does quantization (e.g., GPTQ, AWQ) achieve for LLM serving?
A) It increases model accuracy by adding more parameters
B) It reduces weight precision to lower memory footprint and increase throughput with minimal accuracy loss
C) It encrypts model weights for security
D) It converts models to run on CPUs only
7. What is the role of the Transformer component in a KServe InferenceService?
A) It trains the model on new data
B) It handles pre-processing and post-processing of requests before and after the predictor
C) It converts the model from one framework to another
D) It provides GPU scheduling for the cluster
8. How does continuous batching differ from static batching in LLM serving?
A) Continuous batching processes one request at a time for better accuracy
B) Continuous batching inserts new requests as previous ones complete, keeping GPU utilization high
C) Continuous batching requires more GPU memory than static batching
D) Continuous batching is only available for CPU-based inference
9. What is Model Mesh in the KServe ecosystem?
A) A network mesh that encrypts traffic between model servers
B) A multi-model serving capability that co-locates many models in shared pods to reduce overhead
C) A visualization tool for model architectures
D) A storage backend for model weights
10. When should you use scale-to-zero for model serving endpoints?
A) Always, to save maximum cost
B) For latency-sensitive LLM endpoints that need instant responses
C) For smaller models with low traffic where cold start time is acceptable
D) Only for batch inference workloads
11. Which model server is best suited for heterogeneous multi-model deployments with mixed frameworks?
A) vLLM
B) TGI (Text Generation Inference)
C) Triton Inference Server
D) TorchServe
12. What distinguishes A/B testing from canary deployment in model rollouts?
A) A/B testing is faster to set up
B) A/B testing routes traffic based on user cohort or headers, while canary uses random percentage splitting
C) Canary deployments require more replicas
D) A/B testing does not support automatic rollback