Chapter 5: Model Serving and Inference

Learning Objectives

Pre-Study Assessment

Answer these questions before reading the material to gauge your current understanding.

Pre-Quiz

1. What is the primary purpose of KServe's InferenceService Custom Resource?

A) To train machine learning models on GPU clusters
B) To provide a single declarative abstraction for deploying, autoscaling, and routing to model endpoints
C) To monitor GPU utilization across the cluster
D) To convert models from PyTorch format to ONNX

2. What problem does vLLM's PagedAttention solve?

A) It compresses model weights to reduce download size
B) It eliminates the need for GPU memory entirely by using CPU offloading
C) It manages KV-cache memory in non-contiguous pages, reducing fragmentation and enabling more concurrent requests
D) It distributes model layers across multiple nodes automatically

3. In a canary deployment for model serving, what happens to user traffic?

A) All traffic is switched to the new model version immediately
B) A small percentage of traffic is routed to the new version while the majority stays on the stable version
C) Traffic is paused entirely until the new model is validated
D) Only admin users can access the new model version

4. Why is configuring /dev/shm critical for LLM pods on Kubernetes?

A) It controls the maximum number of GPUs a pod can use
B) Pods default to 64MB shared memory, and AI frameworks need much more for inter-process communication
C) It determines the network bandwidth allocated to the pod
D) It stores the model weights permanently for faster restarts

5. What is the key advantage of KEDA over standard HPA for LLM workloads?

A) KEDA supports more programming languages
B) KEDA can scale on LLM-specific metrics like pending queue depth and token throughput
C) KEDA is faster at starting new pods
D) KEDA requires fewer cluster resources to operate

6. What does quantization (e.g., GPTQ, AWQ) achieve for LLM serving?

A) It increases model accuracy by adding more parameters
B) It reduces weight precision to lower memory footprint and increase throughput with minimal accuracy loss
C) It encrypts model weights for security
D) It converts models to run on CPUs only

7. What is the role of the Transformer component in a KServe InferenceService?

A) It trains the model on new data
B) It handles pre-processing and post-processing of requests before and after the predictor
C) It converts the model from one framework to another
D) It provides GPU scheduling for the cluster

8. How does continuous batching differ from static batching in LLM serving?

A) Continuous batching processes one request at a time for better accuracy
B) Continuous batching inserts new requests as previous ones complete, keeping GPU utilization high
C) Continuous batching requires more GPU memory than static batching
D) Continuous batching is only available for CPU-based inference

9. What is Model Mesh in the KServe ecosystem?

A) A network mesh that encrypts traffic between model servers
B) A multi-model serving capability that co-locates many models in shared pods to reduce overhead
C) A visualization tool for model architectures
D) A storage backend for model weights

10. When should you use scale-to-zero for model serving endpoints?

A) Always, to save maximum cost
B) For latency-sensitive LLM endpoints that need instant responses
C) For smaller models with low traffic where cold start time is acceptable
D) Only for batch inference workloads

11. Which model server is best suited for heterogeneous multi-model deployments with mixed frameworks?

A) vLLM
B) TGI (Text Generation Inference)
C) Triton Inference Server
D) TorchServe

12. What distinguishes A/B testing from canary deployment in model rollouts?

A) A/B testing is faster to set up
B) A/B testing routes traffic based on user cohort or headers, while canary uses random percentage splitting
C) Canary deployments require more replicas
D) A/B testing does not support automatic rollback

Section 1: Model Serving Architectures

Online vs. Batch Inference

Not all inference looks the same. Two fundamental patterns govern how predictions are generated:

PatternTriggerLatency TargetTypical Use Case
Online (real-time)Single HTTP/gRPC request< 200msChatbots, recommendation APIs, fraud detection
BatchScheduled job or queueMinutes to hoursOffline scoring, ETL enrichment, bulk predictions

Online inference prioritizes low latency: a user waits for each response, so the serving system must return results quickly. Batch inference prioritizes throughput: thousands of records are processed together. On Kubernetes, online serving uses Deployment or InferenceService resources fronted by a Service, while batch inference is better expressed as a Job or CronJob.

Model Server Options

Choosing the right model server is one of the most consequential decisions in your serving stack:

ServerBest ForKey FeatureProtocols
TorchServePyTorch models, custom handlersNative PyTorch integrationREST, gRPC
TritonMulti-framework, mixed workloadsMulti-model concurrency, dynamic batchingREST, gRPC, HTTP/2
vLLMLLM text generationPagedAttention, continuous batchingOpenAI-compatible REST
TGILLM serving (HuggingFace)Flash Attention, tensor parallelismREST

Sidecar and Standalone Topologies

Standalone: The model server runs as the primary container in a pod. This is simpler and appropriate for most cases.

Sidecar: A lightweight proxy or transformer runs alongside the model server container. KServe uses this pattern to inject its storage initializer as an init container that downloads model weights, and optionally attaches a transformer container for pre/post-processing.

Key Takeaway

Section 2: KServe (formerly KFServing)

KServe is the leading Kubernetes-native model serving platform, providing a single InferenceService CRD that abstracts away autoscaling, networking, health checking, and runtime configuration. It graduated from the Kubeflow project and is now a standalone CNCF project.

The InferenceService Custom Resource

A minimal InferenceService definition specifies a predictor with a model format and storage URI. When applied, KServe launches a storage initializer init container, starts the serving runtime, configures ingress routing, and optionally enables autoscaling.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"

Predictor, Transformer, and Explainer Components

A full InferenceService can include up to three logical components:

ComponentPurposeExample
PredictorCore inference -- runs the modelvLLM server, Triton, sklearn-server
TransformerPre/post-processing pipelineTokenization, feature engineering
ExplainerModel explainabilitySHAP values, LIME explanations

KServe Request Flow

InferenceService Client HTTP / gRPC Transformer Pre-processing Predictor Model Inference Transformer Post-processing Explainer SHAP / LIME 1 2 3 4 5 response 6 explanation 7 result

Model Storage and Model Mesh

The storage initializer is an init container that KServe injects into every predictor pod. It supports multiple storage backends: GCS (gs://), S3 (s3://), Azure Blob (azureblob://), HTTP/HTTPS, and PersistentVolumeClaims (pvc://).

Model Mesh is KServe's multi-model serving capability. Instead of one pod per model, Model Mesh co-locates many models within a shared pool of serving pods, dramatically reducing per-model overhead when maintaining hundreds of smaller models. It maintains an in-memory cache and loads/unloads models on demand.

Key Takeaway

Section 3: LLM Serving on Kubernetes

Large language models present unique serving challenges. A 70B parameter model in FP16 requires approximately 140GB of GPU VRAM. The autoregressive generation pattern creates inherently sequential workloads that are hard to batch naively.

vLLM and PagedAttention

vLLM is the dominant open-source LLM serving engine. Its defining innovation is PagedAttention, which manages the KV cache using virtual memory techniques borrowed from OS page tables. Instead of reserving large contiguous VRAM blocks, PagedAttention stores the KV cache in non-contiguous "pages," eliminating internal fragmentation and enabling up to 24x more concurrent requests.

A critical Kubernetes-specific detail: pods default to only 64MB of shared memory (/dev/shm). Without an emptyDir with medium: Memory, large models will crash during initialization.

volumes:
  - name: shm
    emptyDir:
      medium: Memory
      sizeLimit: 8Gi    # critical for LLM IPC

Text Generation Inference (TGI)

HuggingFace's TGI is an alternative LLM server supporting Flash Attention 2, continuous batching, and tensor parallelism. It integrates deeply with the HuggingFace Hub ecosystem.

Quantization: GPTQ, AWQ, GGUF

Quantization converts model weights from high-precision formats (FP16) to lower-precision formats (INT4), trading a small amount of accuracy for significant memory and speed gains:

FormatPrecisionMemory ReductionSupported By
GPTQINT4~4x vs FP32vLLM, TGI, Triton
AWQINT4~4x vs FP32vLLM, TGI
GGUF2-8 bit mixedUp to 8xllama.cpp, Ollama

A 13B parameter model in FP16 requires ~26GB VRAM. The GPTQ INT4 version requires ~7GB -- fitting on a single A10G or consumer RTX 3090.

Continuous Batching and Speculative Decoding

Continuous batching keeps the GPU pipeline full at all times. As soon as a sequence completes, a new request from the queue immediately fills its slot in the next forward pass, keeping GPU utilization high and minimizing time-to-first-token.

Speculative decoding uses a small draft model to propose several tokens, which the large target model verifies in parallel. When the draft model is right, multiple tokens are produced per forward pass.

Key Takeaway

Section 4: Autoscaling and Traffic Management

HPA with Custom Metrics

The Horizontal Pod Autoscaler (HPA) scales workloads based on observed metrics. For inference services, useful signals include CPU utilization, GPU utilization (DCGM metrics), request latency (p99), request concurrency, and pending queue depth.

AutoscalerTriggerBest ForScale-to-Zero
KPA (Knative)Request concurrencyServerless, bursty trafficYes
HPACPU, memory, custom PrometheusStable, predictable trafficNo (min 1)
KEDAAny event sourceLLM serving, event-drivenYes

HPA Autoscaling in Response to Load

Phase 1: Normal Load (40%) 40% 80% threshold Pod 1 Pod 2 2 replicas Phase 2: Load Rising (70%) 70% Pod 1 Pod 2 Approaching threshold... Phase 3: Threshold Exceeded -- Scaling Up 90% Pod 1 Pod 2 Pod 3 HPA: scaling to 3 replicas Phase 4: Load Distributed (55%) 55% Pod 1 Pod 2 Pod 3 3 replicas (stable) Phase 5: Load Drops -- Scaling Down 20% Pod 1 Pod 2 Pod 3 HPA: scaling down to 2 replicas

Scale-to-Zero with Knative or KEDA

Scale-to-zero completely removes all pods when no traffic is present, and scales back up on the first request. Knative Pod Autoscaler (KPA) is KServe's default autoscaler in serverless mode. KEDA extends HPA with support for custom event sources and LLM-specific metrics.

The tradeoff with scale-to-zero for LLMs is cold start time. A 7B parameter model takes 30-90 seconds to load, making scale-to-zero unsuitable for latency-sensitive endpoints. For LLMs, minReplicas: 1 is usually the right choice.

Canary Rollouts and Traffic Splitting

Deploying a new model version to 100% of traffic immediately is high risk. Canary deployments mitigate this by routing a small percentage of traffic to the new version while the majority continues on the stable version.

spec:
  predictor:
    canaryTrafficPercent: 20    # 20% to new version
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://my-bucket/models/classifier/v2"

For automated promotion, Flagger integrates with KServe to progressively shift traffic based on Prometheus metrics, starting with a small canary percentage and automatically promoting or rolling back.

Canary Deployment: Gradual Traffic Rollout

Stable (v1) Canary (v2) Step 1: Initial v1: 100% Deploy v2 canary Step 2: +10% v1: 90% 10% Metrics OK Step 3: +30% v1: 70% 30% Metrics OK Step 4: +50% v1: 50% v2: 50% Metrics OK Step 5: +80% 20% v2: 80% Metrics OK Step 6: Promoted v2: 100% (new stable) Promoted!

A/B Testing and Load Balancing

A/B testing routes traffic based on user identity or request metadata, enabling controlled experiments. KServe, backed by Istio, supports header-based routing (e.g., X-Experiment-Group: B).

TechniqueTraffic Split BasisRollbackEnd State
CanaryRandom percentageAutomatic (Flagger) or manualOne version wins
A/B TestUser cohort / headerManual analysisWinner by experiment
Blue/GreenAll-or-nothing switchFlip traffic weightImmediate cutover

For load balancing, LLM inference benefits from prefix cache affinity (routing users to replicas that have cached their system prompt), request length awareness, and least-connection routing rather than naive round-robin.

Key Takeaway

Post-Study Assessment

Now that you have studied the material, answer the same questions again to measure your learning.

Post-Quiz

1. What is the primary purpose of KServe's InferenceService Custom Resource?

A) To train machine learning models on GPU clusters
B) To provide a single declarative abstraction for deploying, autoscaling, and routing to model endpoints
C) To monitor GPU utilization across the cluster
D) To convert models from PyTorch format to ONNX

2. What problem does vLLM's PagedAttention solve?

A) It compresses model weights to reduce download size
B) It eliminates the need for GPU memory entirely by using CPU offloading
C) It manages KV-cache memory in non-contiguous pages, reducing fragmentation and enabling more concurrent requests
D) It distributes model layers across multiple nodes automatically

3. In a canary deployment for model serving, what happens to user traffic?

A) All traffic is switched to the new model version immediately
B) A small percentage of traffic is routed to the new version while the majority stays on the stable version
C) Traffic is paused entirely until the new model is validated
D) Only admin users can access the new model version

4. Why is configuring /dev/shm critical for LLM pods on Kubernetes?

A) It controls the maximum number of GPUs a pod can use
B) Pods default to 64MB shared memory, and AI frameworks need much more for inter-process communication
C) It determines the network bandwidth allocated to the pod
D) It stores the model weights permanently for faster restarts

5. What is the key advantage of KEDA over standard HPA for LLM workloads?

A) KEDA supports more programming languages
B) KEDA can scale on LLM-specific metrics like pending queue depth and token throughput
C) KEDA is faster at starting new pods
D) KEDA requires fewer cluster resources to operate

6. What does quantization (e.g., GPTQ, AWQ) achieve for LLM serving?

A) It increases model accuracy by adding more parameters
B) It reduces weight precision to lower memory footprint and increase throughput with minimal accuracy loss
C) It encrypts model weights for security
D) It converts models to run on CPUs only

7. What is the role of the Transformer component in a KServe InferenceService?

A) It trains the model on new data
B) It handles pre-processing and post-processing of requests before and after the predictor
C) It converts the model from one framework to another
D) It provides GPU scheduling for the cluster

8. How does continuous batching differ from static batching in LLM serving?

A) Continuous batching processes one request at a time for better accuracy
B) Continuous batching inserts new requests as previous ones complete, keeping GPU utilization high
C) Continuous batching requires more GPU memory than static batching
D) Continuous batching is only available for CPU-based inference

9. What is Model Mesh in the KServe ecosystem?

A) A network mesh that encrypts traffic between model servers
B) A multi-model serving capability that co-locates many models in shared pods to reduce overhead
C) A visualization tool for model architectures
D) A storage backend for model weights

10. When should you use scale-to-zero for model serving endpoints?

A) Always, to save maximum cost
B) For latency-sensitive LLM endpoints that need instant responses
C) For smaller models with low traffic where cold start time is acceptable
D) Only for batch inference workloads

11. Which model server is best suited for heterogeneous multi-model deployments with mixed frameworks?

A) vLLM
B) TGI (Text Generation Inference)
C) Triton Inference Server
D) TorchServe

12. What distinguishes A/B testing from canary deployment in model rollouts?

A) A/B testing is faster to set up
B) A/B testing routes traffic based on user cohort or headers, while canary uses random percentage splitting
C) Canary deployments require more replicas
D) A/B testing does not support automatic rollback

Your Progress

Answer Explanations