Study Guide: Chapter 5 — Model Serving and Inference

Answer these questions before reading the material to gauge your current understanding.

Pre-Quiz

1. What is the primary purpose of KServe's InferenceService Custom Resource?

A) To train machine learning models on GPU clusters

B) To provide a single declarative abstraction for deploying, autoscaling, and routing to model endpoints

C) To monitor GPU utilization across the cluster

D) To convert models from PyTorch format to ONNX

2. What problem does vLLM's PagedAttention solve?

A) It compresses model weights to reduce download size

B) It eliminates the need for GPU memory entirely by using CPU offloading

C) It manages KV-cache memory in non-contiguous pages, reducing fragmentation and enabling more concurrent requests

D) It distributes model layers across multiple nodes automatically

3. In a canary deployment for model serving, what happens to user traffic?

A) All traffic is switched to the new model version immediately

B) A small percentage of traffic is routed to the new version while the majority stays on the stable version

C) Traffic is paused entirely until the new model is validated

D) Only admin users can access the new model version

4. Why is configuring /dev/shm critical for LLM pods on Kubernetes?

A) It controls the maximum number of GPUs a pod can use

B) Pods default to 64MB shared memory, and AI frameworks need much more for inter-process communication

C) It determines the network bandwidth allocated to the pod

D) It stores the model weights permanently for faster restarts

5. What is the key advantage of KEDA over standard HPA for LLM workloads?

A) KEDA supports more programming languages

B) KEDA can scale on LLM-specific metrics like pending queue depth and token throughput

C) KEDA is faster at starting new pods

D) KEDA requires fewer cluster resources to operate

6. What does quantization (e.g., GPTQ, AWQ) achieve for LLM serving?

A) It increases model accuracy by adding more parameters

B) It reduces weight precision to lower memory footprint and increase throughput with minimal accuracy loss

C) It encrypts model weights for security

D) It converts models to run on CPUs only

7. What is the role of the Transformer component in a KServe InferenceService?

A) It trains the model on new data

B) It handles pre-processing and post-processing of requests before and after the predictor

C) It converts the model from one framework to another

D) It provides GPU scheduling for the cluster

8. How does continuous batching differ from static batching in LLM serving?

A) Continuous batching processes one request at a time for better accuracy

B) Continuous batching inserts new requests as previous ones complete, keeping GPU utilization high

C) Continuous batching requires more GPU memory than static batching

D) Continuous batching is only available for CPU-based inference

9. What is Model Mesh in the KServe ecosystem?

A) A network mesh that encrypts traffic between model servers

B) A multi-model serving capability that co-locates many models in shared pods to reduce overhead

C) A visualization tool for model architectures

D) A storage backend for model weights

10. When should you use scale-to-zero for model serving endpoints?

A) Always, to save maximum cost

B) For latency-sensitive LLM endpoints that need instant responses

C) For smaller models with low traffic where cold start time is acceptable

D) Only for batch inference workloads

11. Which model server is best suited for heterogeneous multi-model deployments with mixed frameworks?

A) vLLM

B) TGI (Text Generation Inference)

C) Triton Inference Server

D) TorchServe

12. What distinguishes A/B testing from canary deployment in model rollouts?

A) A/B testing is faster to set up

B) A/B testing routes traffic based on user cohort or headers, while canary uses random percentage splitting

C) Canary deployments require more replicas

D) A/B testing does not support automatic rollback

Section 1: Model Serving Architectures

Online vs. Batch Inference

Not all inference looks the same. Two fundamental patterns govern how predictions are generated:

Pattern	Trigger	Latency Target	Typical Use Case
Online (real-time)	Single HTTP/gRPC request	< 200ms	Chatbots, recommendation APIs, fraud detection
Batch	Scheduled job or queue	Minutes to hours	Offline scoring, ETL enrichment, bulk predictions

Online inference prioritizes low latency: a user waits for each response, so the serving system must return results quickly. Batch inference prioritizes throughput: thousands of records are processed together. On Kubernetes, online serving uses Deployment or InferenceService resources fronted by a Service, while batch inference is better expressed as a Job or CronJob.

Model Server Options

Choosing the right model server is one of the most consequential decisions in your serving stack:

Server	Best For	Key Feature	Protocols
TorchServe	PyTorch models, custom handlers	Native PyTorch integration	REST, gRPC
Triton	Multi-framework, mixed workloads	Multi-model concurrency, dynamic batching	REST, gRPC, HTTP/2
vLLM	LLM text generation	PagedAttention, continuous batching	OpenAI-compatible REST
TGI	LLM serving (HuggingFace)	Flash Attention, tensor parallelism	REST

Sidecar and Standalone Topologies

Standalone: The model server runs as the primary container in a pod. This is simpler and appropriate for most cases.

Sidecar: A lightweight proxy or transformer runs alongside the model server container. KServe uses this pattern to inject its storage initializer as an init container that downloads model weights, and optionally attaches a transformer container for pre/post-processing.

Key Takeaway

Choose online or batch inference based on latency requirements
Pick your model server based on model framework and workload mix
Triton excels at heterogeneous multi-model deployments; vLLM is the default for high-throughput LLM serving

Section 2: KServe (formerly KFServing)

KServe is the leading Kubernetes-native model serving platform, providing a single InferenceService CRD that abstracts away autoscaling, networking, health checking, and runtime configuration. It graduated from the Kubeflow project and is now a standalone CNCF project.

The InferenceService Custom Resource

A minimal InferenceService definition specifies a predictor with a model format and storage URI. When applied, KServe launches a storage initializer init container, starts the serving runtime, configures ingress routing, and optionally enables autoscaling.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"

Predictor, Transformer, and Explainer Components

A full InferenceService can include up to three logical components:

Component	Purpose	Example
Predictor	Core inference -- runs the model	vLLM server, Triton, sklearn-server
Transformer	Pre/post-processing pipeline	Tokenization, feature engineering
Explainer	Model explainability	SHAP values, LIME explanations

KServe Request Flow

Model Storage and Model Mesh

The storage initializer is an init container that KServe injects into every predictor pod. It supports multiple storage backends: GCS (gs://), S3 (s3://), Azure Blob (azureblob://), HTTP/HTTPS, and PersistentVolumeClaims (pvc://).

Model Mesh is KServe's multi-model serving capability. Instead of one pod per model, Model Mesh co-locates many models within a shared pool of serving pods, dramatically reducing per-model overhead when maintaining hundreds of smaller models. It maintains an in-memory cache and loads/unloads models on demand.

Key Takeaway

KServe's InferenceService CRD is the central abstraction for production ML serving on Kubernetes
The predictor/transformer/explainer decomposition encourages clean separation of concerns
Model Mesh enables cost-efficient multi-model deployments

Section 3: LLM Serving on Kubernetes

Large language models present unique serving challenges. A 70B parameter model in FP16 requires approximately 140GB of GPU VRAM. The autoregressive generation pattern creates inherently sequential workloads that are hard to batch naively.

vLLM and PagedAttention

vLLM is the dominant open-source LLM serving engine. Its defining innovation is PagedAttention, which manages the KV cache using virtual memory techniques borrowed from OS page tables. Instead of reserving large contiguous VRAM blocks, PagedAttention stores the KV cache in non-contiguous "pages," eliminating internal fragmentation and enabling up to 24x more concurrent requests.

A critical Kubernetes-specific detail: pods default to only 64MB of shared memory (/dev/shm). Without an emptyDir with medium: Memory, large models will crash during initialization.

volumes:
  - name: shm
    emptyDir:
      medium: Memory
      sizeLimit: 8Gi    # critical for LLM IPC

Text Generation Inference (TGI)

HuggingFace's TGI is an alternative LLM server supporting Flash Attention 2, continuous batching, and tensor parallelism. It integrates deeply with the HuggingFace Hub ecosystem.

Quantization: GPTQ, AWQ, GGUF

Quantization converts model weights from high-precision formats (FP16) to lower-precision formats (INT4), trading a small amount of accuracy for significant memory and speed gains:

Format	Precision	Memory Reduction	Supported By
GPTQ	INT4	~4x vs FP32	vLLM, TGI, Triton
AWQ	INT4	~4x vs FP32	vLLM, TGI
GGUF	2-8 bit mixed	Up to 8x	llama.cpp, Ollama

A 13B parameter model in FP16 requires ~26GB VRAM. The GPTQ INT4 version requires ~7GB -- fitting on a single A10G or consumer RTX 3090.

Continuous Batching and Speculative Decoding

Continuous batching keeps the GPU pipeline full at all times. As soon as a sequence completes, a new request from the queue immediately fills its slot in the next forward pass, keeping GPU utilization high and minimizing time-to-first-token.

Speculative decoding uses a small draft model to propose several tokens, which the large target model verifies in parallel. When the draft model is right, multiple tokens are produced per forward pass.

Key Takeaway

vLLM's PagedAttention and continuous batching are the two most impactful optimizations for LLM serving throughput
Quantization (AWQ or GPTQ) allows models 2-4x too large for your GPU to fit in memory
Always configure /dev/shm for LLM pods on Kubernetes

Section 4: Autoscaling and Traffic Management

HPA with Custom Metrics

The Horizontal Pod Autoscaler (HPA) scales workloads based on observed metrics. For inference services, useful signals include CPU utilization, GPU utilization (DCGM metrics), request latency (p99), request concurrency, and pending queue depth.

Autoscaler	Trigger	Best For	Scale-to-Zero
KPA (Knative)	Request concurrency	Serverless, bursty traffic	Yes
HPA	CPU, memory, custom Prometheus	Stable, predictable traffic	No (min 1)
KEDA	Any event source	LLM serving, event-driven	Yes

HPA Autoscaling in Response to Load

Scale-to-Zero with Knative or KEDA

Scale-to-zero completely removes all pods when no traffic is present, and scales back up on the first request. Knative Pod Autoscaler (KPA) is KServe's default autoscaler in serverless mode. KEDA extends HPA with support for custom event sources and LLM-specific metrics.

The tradeoff with scale-to-zero for LLMs is cold start time. A 7B parameter model takes 30-90 seconds to load, making scale-to-zero unsuitable for latency-sensitive endpoints. For LLMs, minReplicas: 1 is usually the right choice.

Canary Rollouts and Traffic Splitting

Deploying a new model version to 100% of traffic immediately is high risk. Canary deployments mitigate this by routing a small percentage of traffic to the new version while the majority continues on the stable version.

spec:
  predictor:
    canaryTrafficPercent: 20    # 20% to new version
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://my-bucket/models/classifier/v2"

For automated promotion, Flagger integrates with KServe to progressively shift traffic based on Prometheus metrics, starting with a small canary percentage and automatically promoting or rolling back.

Canary Deployment: Gradual Traffic Rollout

A/B Testing and Load Balancing

A/B testing routes traffic based on user identity or request metadata, enabling controlled experiments. KServe, backed by Istio, supports header-based routing (e.g., X-Experiment-Group: B).

Technique	Traffic Split Basis	Rollback	End State
Canary	Random percentage	Automatic (Flagger) or manual	One version wins
A/B Test	User cohort / header	Manual analysis	Winner by experiment
Blue/Green	All-or-nothing switch	Flip traffic weight	Immediate cutover

For load balancing, LLM inference benefits from prefix cache affinity (routing users to replicas that have cached their system prompt), request length awareness, and least-connection routing rather than naive round-robin.

Key Takeaway

Combine KPA or KEDA for responsive autoscaling with KServe's native traffic splitting for safe model rollouts
Scale-to-zero saves GPU costs for low-traffic models but requires cold start tolerance
For LLMs, KEDA's ability to scale on queue depth and token throughput makes it the most effective autoscaling option

Post-Study Assessment

Now that you have studied the material, answer the same questions again to measure your learning.

Post-Quiz

1. What is the primary purpose of KServe's InferenceService Custom Resource?

A) To train machine learning models on GPU clusters

B) To provide a single declarative abstraction for deploying, autoscaling, and routing to model endpoints

C) To monitor GPU utilization across the cluster

D) To convert models from PyTorch format to ONNX

2. What problem does vLLM's PagedAttention solve?

A) It compresses model weights to reduce download size

B) It eliminates the need for GPU memory entirely by using CPU offloading

C) It manages KV-cache memory in non-contiguous pages, reducing fragmentation and enabling more concurrent requests

D) It distributes model layers across multiple nodes automatically

3. In a canary deployment for model serving, what happens to user traffic?

A) All traffic is switched to the new model version immediately

B) A small percentage of traffic is routed to the new version while the majority stays on the stable version

C) Traffic is paused entirely until the new model is validated

D) Only admin users can access the new model version

4. Why is configuring /dev/shm critical for LLM pods on Kubernetes?

A) It controls the maximum number of GPUs a pod can use

B) Pods default to 64MB shared memory, and AI frameworks need much more for inter-process communication

C) It determines the network bandwidth allocated to the pod

D) It stores the model weights permanently for faster restarts

5. What is the key advantage of KEDA over standard HPA for LLM workloads?

A) KEDA supports more programming languages

B) KEDA can scale on LLM-specific metrics like pending queue depth and token throughput

C) KEDA is faster at starting new pods

D) KEDA requires fewer cluster resources to operate

6. What does quantization (e.g., GPTQ, AWQ) achieve for LLM serving?

A) It increases model accuracy by adding more parameters

B) It reduces weight precision to lower memory footprint and increase throughput with minimal accuracy loss

C) It encrypts model weights for security

D) It converts models to run on CPUs only

7. What is the role of the Transformer component in a KServe InferenceService?

A) It trains the model on new data

B) It handles pre-processing and post-processing of requests before and after the predictor

C) It converts the model from one framework to another

D) It provides GPU scheduling for the cluster

8. How does continuous batching differ from static batching in LLM serving?

A) Continuous batching processes one request at a time for better accuracy

B) Continuous batching inserts new requests as previous ones complete, keeping GPU utilization high

C) Continuous batching requires more GPU memory than static batching

D) Continuous batching is only available for CPU-based inference

9. What is Model Mesh in the KServe ecosystem?

A) A network mesh that encrypts traffic between model servers

B) A multi-model serving capability that co-locates many models in shared pods to reduce overhead

C) A visualization tool for model architectures

D) A storage backend for model weights

10. When should you use scale-to-zero for model serving endpoints?

A) Always, to save maximum cost

B) For latency-sensitive LLM endpoints that need instant responses

C) For smaller models with low traffic where cold start time is acceptable

D) Only for batch inference workloads

11. Which model server is best suited for heterogeneous multi-model deployments with mixed frameworks?

A) vLLM

B) TGI (Text Generation Inference)

C) Triton Inference Server

D) TorchServe

12. What distinguishes A/B testing from canary deployment in model rollouts?

A) A/B testing is faster to set up

B) A/B testing routes traffic based on user cohort or headers, while canary uses random percentage splitting

C) Canary deployments require more replicas

D) A/B testing does not support automatic rollback

Chapter 5: Model Serving and Inference

Learning Objectives

Pre-Study Assessment

Section 1: Model Serving Architectures

Online vs. Batch Inference

Model Server Options

Sidecar and Standalone Topologies

Key Takeaway

Section 2: KServe (formerly KFServing)

The InferenceService Custom Resource

Predictor, Transformer, and Explainer Components

KServe Request Flow

Model Storage and Model Mesh

Key Takeaway

Section 3: LLM Serving on Kubernetes

vLLM and PagedAttention

Text Generation Inference (TGI)

Quantization: GPTQ, AWQ, GGUF

Continuous Batching and Speculative Decoding

Key Takeaway

Section 4: Autoscaling and Traffic Management

HPA with Custom Metrics

HPA Autoscaling in Response to Load

Scale-to-Zero with Knative or KEDA

Canary Rollouts and Traffic Splitting

Canary Deployment: Gradual Traffic Rollout

A/B Testing and Load Balancing

Key Takeaway

Post-Study Assessment

Your Progress

Answer Explanations