Study Guide: Chapter 1 — AI Fundamentals and Workload Types

Pre-Quiz: AI and Machine Learning Workload Types

1. A financial services company needs its chatbot to answer questions about policies that change weekly. Which approach best balances accuracy with operational simplicity?

Retrain the entire foundation model every week with updated policy data

Use a RAG pipeline that retrieves current policy documents at inference time

Fine-tune the model on each new policy document as it is released

Cache all previous chatbot responses and serve them for similar queries

2. A team is training a 175-billion-parameter language model that cannot fit into the memory of any single GPU. Which distributed training strategy directly addresses this constraint?

Data parallelism, because it splits the dataset across GPUs

Gradient checkpointing, because it reduces memory usage per layer

Model parallelism, because it partitions the model itself across multiple GPUs

Batch size reduction, because smaller batches require less GPU memory

3. An e-commerce platform experiences 10x traffic spikes during flash sales. Which inference serving characteristic is most critical for handling these spikes?

Switching from GPU to CPU inference to reduce costs

Pre-computing all possible recommendations in batch mode before the sale

Autoscaling inference pods horizontally to match demand in real time

Increasing the precision of model weights from 8-bit to 32-bit

4. Why is a vector database considered critical infrastructure in a RAG pipeline, rather than a traditional relational database?

Vector databases are cheaper to operate than relational databases

Vector databases can store and query high-dimensional embeddings for fast similarity search

Relational databases cannot store text data at the scale RAG requires

Vector databases automatically generate embeddings from raw documents

5. What distinguishes a foundation model from a traditional task-specific AI model in terms of how organizations deploy them?

Foundation models are always smaller and faster than task-specific models

Foundation models can only be used for text generation tasks

Foundation models are pre-trained on broad data and adapted to many downstream tasks via fine-tuning or prompting

Foundation models do not require GPU resources for inference

An AI workload is any computational task performed as part of building, deploying, or operating an AI system. Different workload types place radically different demands on compute, storage, and networking. Understanding these categories is essential for designing data center AI infrastructure.

Key Points

RAG supplements a model's built-in knowledge with real-time retrieval from vector databases, keeping outputs current without retraining.
Training is compute-intensive and sustained (days/weeks), using data parallelism (split data) or model parallelism (split model) across GPU clusters.
Inference is latency-sensitive and high-volume, served via real-time, batch, streaming, or edge patterns depending on the use case.
Foundation models (e.g., GPT, DALL-E) are pre-trained on broad datasets and adapted via fine-tuning or prompting, representing the most resource-intensive workload category.
Each workload type drives distinct infrastructure requirements: fast storage for training, low-latency networking for inference, vector databases for RAG, and massive GPU memory for generative AI.

RAG Workloads

Retrieval-Augmented Generation (RAG) improves generative AI accuracy by querying a vector database at inference time, finding relevant documents, and injecting that context into the prompt before the model generates its response. This avoids costly retraining when information changes frequently.

flowchart LR A["User Query"] --> B["Embedding\nGeneration"] B --> C["Vector Storage\n& Indexing"] D["Documents"] --> E["Embedding\nGeneration"] E --> C B --> F["Retrieval Query\nExecution"] F --> C C --> F F --> G["Relevant\nContext"] G --> H["LLM Inference\n(Generation)"] A --> H H --> I["Augmented\nResponse"] style A fill:#4a90d9,color:#fff style I fill:#2ecc71,color:#fff style C fill:#e67e22,color:#fff style H fill:#9b59b6,color:#fff

RAG Pipeline Stage	Function	Infrastructure Need
Embedding Generation	Convert documents and queries into numerical vectors	GPU or CPU compute for embedding models
Vector Storage & Indexing	Store high-dimensional embeddings for fast retrieval	Vector databases (Weaviate, Pinecone, Milvus)
Retrieval Query Execution	Find the most relevant documents via similarity search	Low-latency compute and network I/O
Inference (Generation)	Produce the final response using retrieved context	GPU-accelerated model serving

Training and Distributed Training

Training teaches a model to recognize patterns by exposing it to large datasets and iteratively adjusting weights. Large-scale training requires GPU clusters and high-performance storage (NVMe SSDs, Lustre) to keep GPUs fed with data.

Distributed training splits workloads across multiple processors using two primary strategies:

Strategy	How It Works	Best For
Data Parallelism	Dataset split across GPUs; each holds a full model copy; gradients synchronized	Models that fit in one GPU but need faster training on large datasets
Model Parallelism	Model itself is partitioned across GPUs; each computes a portion	Models too large to fit in a single GPU's memory

flowchart TD subgraph DP["Data Parallelism"] direction TB DS["Full Dataset"] --> S1["Shard 1"] DS --> S2["Shard 2"] DS --> S3["Shard N"] S1 --> G1["GPU 1\n(Full Model Copy)"] S2 --> G2["GPU 2\n(Full Model Copy)"] S3 --> G3["GPU N\n(Full Model Copy)"] G1 --> SYNC["Gradient\nSynchronization"] G2 --> SYNC G3 --> SYNC SYNC --> UP["Updated\nModel"] end subgraph MP["Model Parallelism"] direction TB IN["Input Data"] --> L1["GPU 1\n(Layers 1-4)"] L1 --> L2["GPU 2\n(Layers 5-8)"] L2 --> L3["GPU N\n(Layers 9-12)"] L3 --> OUT["Output"] end style DP fill:#eaf2f8,stroke:#2980b9 style MP fill:#fdf2e9,stroke:#e67e22 style SYNC fill:#2ecc71,color:#fff style OUT fill:#2ecc71,color:#fff

Hybrid Parallelism: Cutting-edge models like GPT and PaLM combine both data and model parallelism. Within a node, GPUs connect via NVLink; across nodes, InfiniBand provides the high-speed fabric. Optimizing this communication is critical -- inter-node overhead can negate the benefit of adding more GPUs.

Inference Workloads and Serving Patterns

Inference uses a trained model to make predictions on new data. Unlike training, inference is an ongoing, production-facing workload with strict latency requirements.

flowchart TD TM["Trained Model"] --> SP["Serving Patterns"] SP --> RT["Real-time\n(Online)"] SP --> BA["Batch"] SP --> ST["Streaming"] SP --> ED["Edge"] RT --> RT_D["Single requests\n< 100ms latency\ne.g., Chatbots, Fraud Detection"] BA --> BA_D["Scheduled bulk jobs\nMinutes-to-hours latency\ne.g., Nightly Recommendations"] ST --> ST_D["Continuous data streams\nLow latency, high throughput\ne.g., Video Analytics, IoT"] ED --> ED_D["On-device inference\nUltra-low latency, offline\ne.g., Autonomous Vehicles"] style TM fill:#9b59b6,color:#fff style RT fill:#e74c3c,color:#fff style BA fill:#3498db,color:#fff style ST fill:#2ecc71,color:#fff style ED fill:#f39c12,color:#fff

Serving Pattern	Latency Requirement	Example Use Case
Real-time (Online)	Sub-second (typically < 100ms)	Chatbot responses, fraud detection
Batch	Minutes to hours	Nightly product recommendation updates
Streaming	Low latency, high throughput	Video surveillance, IoT sensor analysis
Edge	Ultra-low latency, offline capable	Autonomous vehicles, factory floor inspection

Generative AI and Foundation Models

Generative AI creates new content (text, images, code, audio, video) rather than classifying existing data. These systems are built on foundation models -- large-scale models pre-trained on broad, diverse datasets that can be adapted to many downstream tasks.

Training a foundation model can cost millions of dollars and take weeks on thousands of GPUs
Serving requires optimization (quantization, batching, caching) to control costs at scale
Fine-tuning for a specific domain is far more economical, requiring a fraction of the original compute

Animation Placeholder: Interactive comparison of AI workload types -- drag sliders for compute, latency, and storage requirements to see how each workload type (Training, Inference, RAG, GenAI) maps to infrastructure needs.

Animation Placeholder: Animated RAG pipeline flow -- watch a user query travel through embedding generation, vector retrieval, context injection, and LLM response generation step by step.

Post-Quiz: AI and Machine Learning Workload Types