Chapter 1: AI Fundamentals and Workload Types

Learning Objectives

Section 1: AI and Machine Learning Workload Types

Pre-Quiz: AI and Machine Learning Workload Types

1. A financial services company needs its chatbot to answer questions about policies that change weekly. Which approach best balances accuracy with operational simplicity?

Retrain the entire foundation model every week with updated policy data
Use a RAG pipeline that retrieves current policy documents at inference time
Fine-tune the model on each new policy document as it is released
Cache all previous chatbot responses and serve them for similar queries

2. A team is training a 175-billion-parameter language model that cannot fit into the memory of any single GPU. Which distributed training strategy directly addresses this constraint?

Data parallelism, because it splits the dataset across GPUs
Gradient checkpointing, because it reduces memory usage per layer
Model parallelism, because it partitions the model itself across multiple GPUs
Batch size reduction, because smaller batches require less GPU memory

3. An e-commerce platform experiences 10x traffic spikes during flash sales. Which inference serving characteristic is most critical for handling these spikes?

Switching from GPU to CPU inference to reduce costs
Pre-computing all possible recommendations in batch mode before the sale
Autoscaling inference pods horizontally to match demand in real time
Increasing the precision of model weights from 8-bit to 32-bit

4. Why is a vector database considered critical infrastructure in a RAG pipeline, rather than a traditional relational database?

Vector databases are cheaper to operate than relational databases
Vector databases can store and query high-dimensional embeddings for fast similarity search
Relational databases cannot store text data at the scale RAG requires
Vector databases automatically generate embeddings from raw documents

5. What distinguishes a foundation model from a traditional task-specific AI model in terms of how organizations deploy them?

Foundation models are always smaller and faster than task-specific models
Foundation models can only be used for text generation tasks
Foundation models are pre-trained on broad data and adapted to many downstream tasks via fine-tuning or prompting
Foundation models do not require GPU resources for inference

An AI workload is any computational task performed as part of building, deploying, or operating an AI system. Different workload types place radically different demands on compute, storage, and networking. Understanding these categories is essential for designing data center AI infrastructure.

Key Points

RAG Workloads

Retrieval-Augmented Generation (RAG) improves generative AI accuracy by querying a vector database at inference time, finding relevant documents, and injecting that context into the prompt before the model generates its response. This avoids costly retraining when information changes frequently.

flowchart LR A["User Query"] --> B["Embedding\nGeneration"] B --> C["Vector Storage\n& Indexing"] D["Documents"] --> E["Embedding\nGeneration"] E --> C B --> F["Retrieval Query\nExecution"] F --> C C --> F F --> G["Relevant\nContext"] G --> H["LLM Inference\n(Generation)"] A --> H H --> I["Augmented\nResponse"] style A fill:#4a90d9,color:#fff style I fill:#2ecc71,color:#fff style C fill:#e67e22,color:#fff style H fill:#9b59b6,color:#fff
RAG Pipeline StageFunctionInfrastructure Need
Embedding GenerationConvert documents and queries into numerical vectorsGPU or CPU compute for embedding models
Vector Storage & IndexingStore high-dimensional embeddings for fast retrievalVector databases (Weaviate, Pinecone, Milvus)
Retrieval Query ExecutionFind the most relevant documents via similarity searchLow-latency compute and network I/O
Inference (Generation)Produce the final response using retrieved contextGPU-accelerated model serving

Training and Distributed Training

Training teaches a model to recognize patterns by exposing it to large datasets and iteratively adjusting weights. Large-scale training requires GPU clusters and high-performance storage (NVMe SSDs, Lustre) to keep GPUs fed with data.

Distributed training splits workloads across multiple processors using two primary strategies:

StrategyHow It WorksBest For
Data ParallelismDataset split across GPUs; each holds a full model copy; gradients synchronizedModels that fit in one GPU but need faster training on large datasets
Model ParallelismModel itself is partitioned across GPUs; each computes a portionModels too large to fit in a single GPU's memory
flowchart TD subgraph DP["Data Parallelism"] direction TB DS["Full Dataset"] --> S1["Shard 1"] DS --> S2["Shard 2"] DS --> S3["Shard N"] S1 --> G1["GPU 1\n(Full Model Copy)"] S2 --> G2["GPU 2\n(Full Model Copy)"] S3 --> G3["GPU N\n(Full Model Copy)"] G1 --> SYNC["Gradient\nSynchronization"] G2 --> SYNC G3 --> SYNC SYNC --> UP["Updated\nModel"] end subgraph MP["Model Parallelism"] direction TB IN["Input Data"] --> L1["GPU 1\n(Layers 1-4)"] L1 --> L2["GPU 2\n(Layers 5-8)"] L2 --> L3["GPU N\n(Layers 9-12)"] L3 --> OUT["Output"] end style DP fill:#eaf2f8,stroke:#2980b9 style MP fill:#fdf2e9,stroke:#e67e22 style SYNC fill:#2ecc71,color:#fff style OUT fill:#2ecc71,color:#fff
Hybrid Parallelism: Cutting-edge models like GPT and PaLM combine both data and model parallelism. Within a node, GPUs connect via NVLink; across nodes, InfiniBand provides the high-speed fabric. Optimizing this communication is critical -- inter-node overhead can negate the benefit of adding more GPUs.

Inference Workloads and Serving Patterns

Inference uses a trained model to make predictions on new data. Unlike training, inference is an ongoing, production-facing workload with strict latency requirements.

flowchart TD TM["Trained Model"] --> SP["Serving Patterns"] SP --> RT["Real-time\n(Online)"] SP --> BA["Batch"] SP --> ST["Streaming"] SP --> ED["Edge"] RT --> RT_D["Single requests\n< 100ms latency\ne.g., Chatbots, Fraud Detection"] BA --> BA_D["Scheduled bulk jobs\nMinutes-to-hours latency\ne.g., Nightly Recommendations"] ST --> ST_D["Continuous data streams\nLow latency, high throughput\ne.g., Video Analytics, IoT"] ED --> ED_D["On-device inference\nUltra-low latency, offline\ne.g., Autonomous Vehicles"] style TM fill:#9b59b6,color:#fff style RT fill:#e74c3c,color:#fff style BA fill:#3498db,color:#fff style ST fill:#2ecc71,color:#fff style ED fill:#f39c12,color:#fff
Serving PatternLatency RequirementExample Use Case
Real-time (Online)Sub-second (typically < 100ms)Chatbot responses, fraud detection
BatchMinutes to hoursNightly product recommendation updates
StreamingLow latency, high throughputVideo surveillance, IoT sensor analysis
EdgeUltra-low latency, offline capableAutonomous vehicles, factory floor inspection

Generative AI and Foundation Models

Generative AI creates new content (text, images, code, audio, video) rather than classifying existing data. These systems are built on foundation models -- large-scale models pre-trained on broad, diverse datasets that can be adapted to many downstream tasks.

Animation Placeholder: Interactive comparison of AI workload types -- drag sliders for compute, latency, and storage requirements to see how each workload type (Training, Inference, RAG, GenAI) maps to infrastructure needs.
Animation Placeholder: Animated RAG pipeline flow -- watch a user query travel through embedding generation, vector retrieval, context injection, and LLM response generation step by step.
Post-Quiz: AI and Machine Learning Workload Types

1. A financial services company needs its chatbot to answer questions about policies that change weekly. Which approach best balances accuracy with operational simplicity?

Retrain the entire foundation model every week with updated policy data
Use a RAG pipeline that retrieves current policy documents at inference time
Fine-tune the model on each new policy document as it is released
Cache all previous chatbot responses and serve them for similar queries

2. A team is training a 175-billion-parameter language model that cannot fit into the memory of any single GPU. Which distributed training strategy directly addresses this constraint?

Data parallelism, because it splits the dataset across GPUs
Gradient checkpointing, because it reduces memory usage per layer
Model parallelism, because it partitions the model itself across multiple GPUs
Batch size reduction, because smaller batches require less GPU memory

3. An e-commerce platform experiences 10x traffic spikes during flash sales. Which inference serving characteristic is most critical for handling these spikes?

Switching from GPU to CPU inference to reduce costs
Pre-computing all possible recommendations in batch mode before the sale
Autoscaling inference pods horizontally to match demand in real time
Increasing the precision of model weights from 8-bit to 32-bit

4. Why is a vector database considered critical infrastructure in a RAG pipeline, rather than a traditional relational database?

Vector databases are cheaper to operate than relational databases
Vector databases can store and query high-dimensional embeddings for fast similarity search
Relational databases cannot store text data at the scale RAG requires
Vector databases automatically generate embeddings from raw documents

5. What distinguishes a foundation model from a traditional task-specific AI model in terms of how organizations deploy them?

Foundation models are always smaller and faster than task-specific models
Foundation models can only be used for text generation tasks
Foundation models are pre-trained on broad data and adapted to many downstream tasks via fine-tuning or prompting
Foundation models do not require GPU resources for inference

Section 2: The AI Lifecycle

Pre-Quiz: The AI Lifecycle

1. A hospital's diagnostic imaging model has been in production for one year. Clinicians report declining accuracy. What is the most likely root cause from a lifecycle perspective?

The model's source code has developed bugs over time
Model drift -- the statistical properties of incoming data have changed since the model was trained
The model's weights decay naturally over time without use
The GPU hardware has degraded, causing computation errors

2. Why is data preparation typically considered the most time-consuming phase of the AI lifecycle?

Because modern GPUs are too slow to process raw data
Because it involves collection, cleansing, feature engineering, and labeling -- each requiring significant effort and domain expertise
Because data must be converted to a proprietary format before any model can use it
Because regulatory requirements mandate a minimum data preparation duration

3. A team splits their dataset 70/15/15 into training, validation, and test sets. What is the primary purpose of the validation set?

To provide additional training data when the model's accuracy is low
To detect overfitting by evaluating the model on data it has never seen during training
To generate labeled examples for supervised learning
To benchmark the model's speed in a production environment

4. What makes the AI lifecycle fundamentally iterative rather than a one-time linear process?

AI models must be rebuilt from scratch whenever new data arrives
Monitoring reveals performance changes that feed back into data collection and retraining
Regulatory bodies require a fixed number of iteration cycles before deployment
Each iteration uses a completely different algorithm to avoid bias

5. A company is deploying a new model version. Which deployment strategy minimizes risk by gradually shifting traffic from the old to the new model?

Big-bang deployment -- replace the old model all at once
Canary deployment -- route a small percentage of traffic to the new model first
Shadow deployment -- run the new model but discard its outputs
Offline deployment -- test the new model only on historical data

The AI lifecycle is the end-to-end process of developing, deploying, and maintaining AI systems. It unfolds across three overarching phases -- design, development, and deployment -- and is fundamentally iterative: each phase is revisited many times as teams refine their approach.

Key Points

Data Collection, Preparation, and Labeling

ActivityPurposeCommon Challenge
CollectionGather sufficient, representative raw dataData silos, privacy regulations, incomplete coverage
CleansingEnsure data accuracy and consistencyScale of dirty data, ambiguous error patterns
Feature EngineeringCreate informative model inputs from raw dataRequires domain expertise, risk of data leakage
LabelingProvide ground truth for supervised learningExpensive, subjective, requires quality control

Model Selection, Training, and Validation

Model selection involves comparing algorithms or architectures for the task at hand. Training iteratively adjusts model weights to minimize a loss function, configured with an optimizer (Adam, SGD), learning rate, batch size, and epoch count.

Validation evaluates performance on unseen data to detect overfitting:

Deployment, Monitoring, and Retraining

Deployment packages the model (often as a containerized microservice) with API endpoints. Modern deployments use canary or blue-green strategies to minimize rollout risk.

Model Drift: When the statistical properties of input data change over time, model performance degrades. A fraud detection model trained on pre-pandemic patterns may lose accuracy as consumer behavior shifts. Continuous monitoring and scheduled retraining close this loop.
stateDiagram-v2 [*] --> DataCollection: Start DataCollection: Data Collection & Preparation ModelTraining: Model Training & Selection Validation: Validation & Testing Deployment: Deployment Monitoring: Monitoring & Maintenance Retraining: Retraining & Updates DataCollection --> ModelTraining ModelTraining --> Validation Validation --> Deployment Deployment --> Monitoring Monitoring --> Retraining: Drift detected or\nperformance degrades Retraining --> DataCollection: New data ingested note right of Monitoring Track accuracy, latency, throughput, data drift end note note left of DataCollection Collection, cleansing, feature engineering, labeling end note
Animation Placeholder: Interactive AI lifecycle wheel -- click each phase to expand details showing the activities, tools, and infrastructure requirements at each stage, with animated arrows showing the feedback loop.
Post-Quiz: The AI Lifecycle

1. A hospital's diagnostic imaging model has been in production for one year. Clinicians report declining accuracy. What is the most likely root cause from a lifecycle perspective?

The model's source code has developed bugs over time
Model drift -- the statistical properties of incoming data have changed since the model was trained
The model's weights decay naturally over time without use
The GPU hardware has degraded, causing computation errors

2. Why is data preparation typically considered the most time-consuming phase of the AI lifecycle?

Because modern GPUs are too slow to process raw data
Because it involves collection, cleansing, feature engineering, and labeling -- each requiring significant effort and domain expertise
Because data must be converted to a proprietary format before any model can use it
Because regulatory requirements mandate a minimum data preparation duration

3. A team splits their dataset 70/15/15 into training, validation, and test sets. What is the primary purpose of the validation set?

To provide additional training data when the model's accuracy is low
To detect overfitting by evaluating the model on data it has never seen during training
To generate labeled examples for supervised learning
To benchmark the model's speed in a production environment

4. What makes the AI lifecycle fundamentally iterative rather than a one-time linear process?

AI models must be rebuilt from scratch whenever new data arrives
Monitoring reveals performance changes that feed back into data collection and retraining
Regulatory bodies require a fixed number of iteration cycles before deployment
Each iteration uses a completely different algorithm to avoid bias

5. A company is deploying a new model version. Which deployment strategy minimizes risk by gradually shifting traffic from the old to the new model?

Big-bang deployment -- replace the old model all at once
Canary deployment -- route a small percentage of traffic to the new model first
Shadow deployment -- run the new model but discard its outputs
Offline deployment -- test the new model only on historical data

Section 3: AI-ML Clusters and Models

Pre-Quiz: AI-ML Clusters and Models

1. A Kubernetes cluster has both GPU and CPU nodes. Training jobs should only run on GPU nodes, while data preprocessing should only run on CPU nodes. Which Kubernetes mechanism enforces this placement?

Horizontal Pod Autoscaler (HPA)
Node selectors with taints and tolerations
Vertical Pod Autoscaler (VPA)
Cluster Autoscaler

2. An organization wants to deploy an AI model for legal document review but cannot afford the compute to train a model from scratch. What is the most practical approach?

Use a rule-based system instead of AI
Fine-tune a pre-trained foundation model on a smaller domain-specific dataset
Train a small model from scratch on publicly available legal data only
Deploy a pre-trained model without any adaptation and accept lower accuracy

3. Why are model registries considered essential for enterprise AI operations?

They automatically train models on new data as it arrives
They provide centralized version control, governance, and lifecycle management for models
They replace the need for Kubernetes in model deployment
They compress model weights to reduce storage costs

4. A model serving 10,000 requests per second uses 32-bit floating-point weights. An engineer proposes quantizing to 8-bit. What is the primary trade-off of this optimization?

The model will require more GPU memory but run faster
The model will use less memory and run faster, with a potential small decrease in accuracy
The model will become more accurate because lower precision reduces noise
The model cannot be quantized once it has been deployed to production

5. What is the relationship between HPA, VPA, and Cluster Autoscaler in managing AI workloads on Kubernetes?

They are mutually exclusive -- only one can be active at a time
HPA scales pod count, VPA adjusts individual pod resources, and Cluster Autoscaler adds or removes nodes to meet overall demand
HPA manages GPU allocation, VPA manages CPU allocation, and Cluster Autoscaler manages network bandwidth
They all perform the same function but at different time intervals

An AI-ML cluster is a collection of interconnected compute nodes -- typically equipped with GPUs or accelerators -- that work together to execute AI workloads. These clusters are managed by orchestrators such as Kubernetes, Slurm, and Ray.

Key Points

Cluster Architecture and Management

graph TD subgraph CP["Control Plane"] API["API Server"] SCHED["Scheduler"] CM["Controller Manager"] end subgraph Workers["Worker Nodes"] subgraph GPU_N["GPU Node"] P1["Training Pod"] P2["Inference Pod"] end subgraph CPU_N["CPU Node"] P3["Data Preprocessing Pod"] P4["Feature Engineering Pod"] end end API --> SCHED SCHED --> GPU_N SCHED --> CPU_N HPA["Horizontal Pod\nAutoscaler (HPA)"] -->|"Scale pod replicas"| Workers VPA["Vertical Pod\nAutoscaler (VPA)"] -->|"Adjust pod resources"| Workers CA["Cluster\nAutoscaler"] -->|"Add/remove nodes"| Workers style CP fill:#2c3e50,color:#fff style GPU_N fill:#8e44ad,color:#fff style CPU_N fill:#2980b9,color:#fff style HPA fill:#27ae60,color:#fff style VPA fill:#e67e22,color:#fff style CA fill:#c0392b,color:#fff
Scaling MechanismFunctionAI/ML Use Case
Horizontal Pod Autoscaler (HPA)Adds or removes pod replicas based on metricsScaling inference endpoints during traffic spikes
Vertical Pod Autoscaler (VPA)Adjusts CPU and memory for individual podsRight-sizing training job resource allocations
Cluster AutoscalerAdds or removes worker nodesProvisioning additional GPU nodes for large training jobs

Kubernetes uses resource requests and limits to guarantee GPU and memory allocation while preventing resource monopolization. Node selectors and taints/tolerations direct workloads to appropriate hardware -- training jobs to GPU nodes, preprocessing to CPU nodes.

Acquiring and Fine-Tuning Pre-Trained Models

ApproachData RequiredCompute CostTime to DeployBest For
Training from scratchMassive (millions+ examples)Very high (weeks on GPU clusters)MonthsNovel problem domains with no existing models
Fine-tuningModerate (hundreds to thousands)Moderate (hours to days on GPUs)Days to weeksAdapting a general model to a specific domain
Using as-isNone (or minimal prompt examples)Low (inference only)ImmediateGeneral tasks where the base model is sufficient
Enterprise Best Practices: Use model registries for centralized governance and version control. Implement CI/CD pipelines (Jenkins, GitLab CI/CD, Argo CD) to automate build, test, and deployment. Ensure high-quality training data -- it is the foundation of effective fine-tuning.

Optimizing Performance and Utilization

Model-level optimizations:

Infrastructure-level optimizations:

Animation Placeholder: Interactive Kubernetes cluster visualizer -- watch pods scale up and down across GPU and CPU nodes as simulated training and inference workloads arrive, with HPA/VPA/Cluster Autoscaler indicators.
Animation Placeholder: Side-by-side model acquisition comparison -- animate the timeline, cost, and data requirements for training from scratch vs. fine-tuning vs. using a pre-trained model as-is.
Post-Quiz: AI-ML Clusters and Models

1. A Kubernetes cluster has both GPU and CPU nodes. Training jobs should only run on GPU nodes, while data preprocessing should only run on CPU nodes. Which Kubernetes mechanism enforces this placement?

Horizontal Pod Autoscaler (HPA)
Node selectors with taints and tolerations
Vertical Pod Autoscaler (VPA)
Cluster Autoscaler

2. An organization wants to deploy an AI model for legal document review but cannot afford the compute to train a model from scratch. What is the most practical approach?

Use a rule-based system instead of AI
Fine-tune a pre-trained foundation model on a smaller domain-specific dataset
Train a small model from scratch on publicly available legal data only
Deploy a pre-trained model without any adaptation and accept lower accuracy

3. Why are model registries considered essential for enterprise AI operations?

They automatically train models on new data as it arrives
They provide centralized version control, governance, and lifecycle management for models
They replace the need for Kubernetes in model deployment
They compress model weights to reduce storage costs

4. A model serving 10,000 requests per second uses 32-bit floating-point weights. An engineer proposes quantizing to 8-bit. What is the primary trade-off of this optimization?

The model will require more GPU memory but run faster
The model will use less memory and run faster, with a potential small decrease in accuracy
The model will become more accurate because lower precision reduces noise
The model cannot be quantized once it has been deployed to production

5. What is the relationship between HPA, VPA, and Cluster Autoscaler in managing AI workloads on Kubernetes?

They are mutually exclusive -- only one can be active at a time
HPA scales pod count, VPA adjusts individual pod resources, and Cluster Autoscaler adds or removes nodes to meet overall demand
HPA manages GPU allocation, VPA manages CPU allocation, and Cluster Autoscaler manages network bandwidth
They all perform the same function but at different time intervals

Your Progress

Answer Explanations