Study Guide: Chapter 8 — Networking and Security for AI Workloads

Pre-Study Assessment

1. What is the primary advantage of RDMA over traditional TCP/IP for distributed training?

It supports higher MTU sizes on Ethernet networks

It allows the NIC to read/write remote memory without involving the remote CPU, eliminating kernel buffer copies

It uses UDP instead of TCP to reduce connection overhead

It compresses gradient data before transmission to reduce bandwidth

2. What does GPUDirect RDMA eliminate from the data path during distributed training?

The need for a network switch between GPU nodes

The CPU-mediated copies between GPU memory and the NIC

The requirement for InfiniBand hardware

The gradient synchronization step in AllReduce

3. What is the role of Multus CNI in a Kubernetes AI cluster?

It replaces the default CNI plugin with a higher-performance alternative

It enables pods to have multiple network interfaces by delegating to multiple CNI plugins

It encrypts all pod-to-pod traffic using IPsec

It enforces NetworkPolicy rules at Layer 7

4. What should be the first NetworkPolicy applied to every AI namespace?

Allow all egress to the internet for package updates

Allow all ingress from the monitoring namespace

A default-deny policy that blocks all ingress and egress

A policy that limits bandwidth per pod

5. In Istio's STRICT mTLS mode, what happens to unencrypted connections?

They are encrypted automatically by the sidecar proxy

They are logged and allowed through for backward compatibility

They are rejected at the sidecar proxy before reaching the application

They are routed through a separate unencrypted channel

6. Why is granting cluster-admin to training job service accounts a security risk?

It prevents the training job from accessing GPU resources

It creates a lateral movement vector with full cluster access if the job is compromised

It causes RBAC policy conflicts with namespace-scoped roles

It increases the memory footprint of the training pod

7. Which Pod Security Standard profile should be applied to AI inference namespaces in multi-tenant clusters?

Privileged

Baseline

Restricted

Custom (via OPA/Gatekeeper only)

8. What is the recommended way to provide secrets (API keys, credentials) to AI training pods?

As environment variables from a ConfigMap

Hardcoded in the container image

As files mounted via volumeMounts with restrictive permissions (0400)

As command-line arguments to the container entrypoint

9. What does SR-IOV provide to Kubernetes pods?

Software-defined virtual NICs with shared DMA queues

Hardware-isolated Virtual Functions with their own DMA engines, providing near-native NIC performance

Automatic load balancing across multiple physical NICs

Encrypted tunnels between pods on different nodes

10. For HIPAA compliance on a Kubernetes AI cluster, which combination of controls is required?

RBAC only, since access control covers all HIPAA requirements

PV encryption + TLS + RBAC + Secrets management

NetworkPolicy + Pod Security Standards only

Service mesh with mTLS only, since encryption covers all requirements

11. Why should training pods be denied egress to the public internet?

Internet traffic would consume bandwidth needed for gradient synchronization

Training pods have no legitimate need for internet access and allowing it creates a data exfiltration path

Kubernetes DNS does not resolve external domains by default

Internet access would trigger Pod Security Admission violations

12. What is the purpose of tainting GPU nodes with nvidia.com/gpu=true:NoSchedule?

To enable GPUDirect RDMA on those nodes

To prevent non-GPU workloads from being scheduled on expensive GPU nodes

To label nodes for the NVIDIA device plugin discovery

To enforce Pod Security Standards on GPU pods

Section 1: High-Performance Networking for Distributed Training

Ordinary TCP/IP networking was designed for flexibility and correctness, not for the microsecond-latency, near-zero-CPU-overhead requirements of distributed deep learning. When a training job executes an AllReduce collective across 256 GPUs, each process must exchange gradient tensors with its peers hundreds of times per training step. The bottleneck is not GPU computation — it is the time gradients spend waiting in CPU buffers and kernel queues.

RDMA and InfiniBand with Kubernetes

Remote Direct Memory Access (RDMA) allows a network adapter to read from and write to the memory of a remote host without involving the remote CPU. This eliminates the copy-to-kernel-buffer and context-switch overhead that dominates conventional networking at high message rates.

InfiniBand is the dominant physical transport for RDMA in HPC and AI data centers, providing link-level reliability, congestion management, and switch fabrics rated at 400 Gbps (HDR) or 800 Gbps (NDR) per port, with end-to-end latency measured in single-digit microseconds.

The NVIDIA Network Operator manages the full driver and plugin lifecycle to expose InfiniBand or RoCE devices to Kubernetes pods: MOFED/DOCA-OFED drivers, the RDMA shared device plugin (advertising rdma/ib resources), and the nvidia-peermem module for GPU-NIC peer-to-peer DMA.

GPUDirect RDMA for GPU-to-GPU Communication

GPUDirect RDMA extends RDMA into the GPU memory address space. Without it, gradient data travels: GPU memory → CPU pinned buffer → NIC → Network → NIC → CPU pinned buffer → GPU memory. With GPUDirect, the NIC can DMA directly into and out of the GPU's BAR memory region, eliminating both CPU copies.

NCCL can achieve 90–95% of theoretical InfiniBand bandwidth with GPUDirect RDMA, versus 40–60% through a CPU-mediated path.

Animation: RDMA vs TCP Networking

SR-IOV for High-Bandwidth Network Interfaces

SR-IOV (Single Root I/O Virtualization) partitions a single physical NIC into multiple lightweight Virtual Functions (VFs). Each VF has its own DMA engine, queues, and interrupt lines. When assigned exclusively to a pod, the pod gets near-native NIC performance with hardware-enforced isolation.

Deploying SR-IOV in Kubernetes requires three components: the SR-IOV CNI plugin (moves a VF into the pod network namespace), the SR-IOV device plugin (advertises VF resources to the scheduler), and the SR-IOV Network Operator (automates VF creation).

Multus CNI for Secondary Network Attachments

Multus CNI acts as a meta-CNI plugin multiplexer. It creates the primary interface using the cluster's default CNI, then creates additional interfaces using other CNI plugins. This separates management traffic (Kubernetes API, health probes) from high-speed gradient communication on a dedicated InfiniBand or SR-IOV network.

Table: RDMA Transport Options
Transport	Bandwidth	Latency	Notes
InfiniBand HDR	200 Gb/s per port	~1 us	Industry standard for HPC/AI
InfiniBand NDR	400 Gb/s per port	<1 us	Latest generation
RoCEv2	100 Gb/s per port	2–5 us	Runs on existing Ethernet switches
iWARP	25–100 Gb/s	5–10 us	Software-friendly, higher latency

Section 2: Network Policies and Isolation

In a shared AI cluster, different teams run workloads with different data sensitivity levels. Without explicit network isolation, those workloads can communicate freely at the pod level. Network isolation is a defense-in-depth strategy that starts with denying everything and explicitly allowing only what is needed.

Kubernetes NetworkPolicy for AI Namespace Isolation

The most important first step is a default-deny policy applied to every AI namespace. This blocks all ingress and egress by default, then you add explicit allow rules only for what each workload legitimately needs.

Training pods should have no internet egress — there is no legitimate reason for a gradient synchronization process to reach the public internet, and allowing it creates an exfiltration path.

Service Mesh (Istio, Linkerd)

For inference workloads, a NetworkPolicy alone is insufficient. A service mesh injects a sidecar proxy into every pod, transparently intercepting all connections to provide: traffic management (canary deployments, A/B testing), observability (per-route latency p50/p95/p99), and security (mTLS between all pods).

mTLS for Secure Inter-Pod Communication

Mutual TLS (mTLS) means both ends of a connection present and verify certificates. Each pod's identity is cryptographically bound to its Kubernetes service account. In Istio's STRICT mode, unencrypted or unauthenticated connections are rejected at the sidecar proxy before reaching the application.

Animation: Network Policy Enforcement

Section 3: Security Best Practices

RBAC for Multi-Team AI Clusters

RBAC misconfigurations account for over 35% of Kubernetes breaches. The most common mistake is granting cluster-admin to training job service accounts. A well-structured cluster uses namespace-scoped roles matching what each job type actually needs.

A training job needs to: read ConfigMaps (hyperparameters), read Secrets (data credentials), and list Pods (workers). It does not need to list nodes, modify RBAC policies, or access other namespaces.

Table: RBAC Roles for AI Cluster Personas
Persona	Scope	Allowed	Denied
Data scientist	Own namespace	create/delete Jobs, read Logs	create Roles, other namespaces
ML engineer	Own NS + model-registry	create Deployments, push to registry	delete Namespaces, modify ClusterRoles
Training job SA	Own namespace	get ConfigMap/Secret, list Pods	Everything else
Platform admin	Cluster-wide	All (with MFA + audit)	—

Pod Security Standards and Admission Controllers

Pod Security Admission (PSA) enforces one of three built-in profiles at the namespace level:

Privileged — No restrictions. Only for system namespaces like kube-system.
Baseline — Prevents known privilege escalations. Default for AI training namespaces.
Restricted — Enforces hardened configuration. For inference and multi-tenant environments.

The warn and audit modes surface violations without blocking pods, useful during migration before switching to enforce.

Image Scanning for ML Container Vulnerabilities

ML containers can reach 20–30 GB, meaning more surface area and more CVEs. Integrate scanning (Trivy, Grype, Snyk Container) into CI/CD, and use admission controllers (Kyverno, OPA/Gatekeeper) to reject pods from untrusted registries or with critical CVEs.

Secrets Management

The most important rule: mount secrets as files, not environment variables. Environment variables are visible in /proc/<pid>/environ, readable by any process as the same user. File-mounted secrets with 0400 permissions are far more restrictive.

For frequently changing secrets or audit trail requirements, use External Secrets Operator or Vault Agent Injector with HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault.

Animation: Defense-in-Depth Security Layers

Section 4: Data Security and Compliance

Encryption at Rest and in Transit

Encryption at rest applies at two levels in Kubernetes: etcd encryption (Secrets and sensitive API objects encrypted with aescbc or aesgcm via KMS provider) and persistent volume encryption (AWS EBS encryption, GCP CMEK, or LUKS-encrypted block devices on-premises).

Encryption in transit means TLS on all paths carrying training data. Note that RDMA bypasses the CPU and therefore bypasses TLS — for workloads requiring encrypted gradient communication, NCCL supports encrypted collectives or the cluster can enforce IPsec at the NIC level.

Data Access Auditing and Lineage Tracking

Configure Kubernetes audit logging to capture RequestResponse-level events for sensitive resources (secrets, PVCs). For AI-specific data lineage, tools like MLflow, DVC, and Kubeflow Pipelines track which dataset version produced which model checkpoint.

Compliance Frameworks

Table: Compliance Framework Requirements Mapping
Framework	Key Requirement	Kubernetes Implementation
SOC 2 Type II	Logical access controls, audit logging	RBAC least-privilege + API server audit logs
HIPAA	PHI encryption at rest and in transit	PV encryption + TLS + RBAC + Secrets mgmt
GDPR	Data minimization, right to erasure	Namespace isolation + audit logs + retention policies
FedRAMP	FIPS 140-2, continuous monitoring	FIPS-mode etcd encryption + Falco runtime monitoring
NIST AI RMF	Model documentation, risk assessment	MLflow tracking + OPA policy-as-code

Treat compliance as a set of Kubernetes namespaces with different security boundaries. A HIPAA-scoped namespace gets: enforce=restricted PSA, dedicated node pool with encrypted root volumes, default-deny NetworkPolicy, mandatory mTLS, and audit logs exported to an immutable log store.

Runtime threat detection using Falco watches the Linux syscall stream and alerts on compromise indicators: spawning shells inside containers, reading /etc/shadow, or outbound connections on unusual ports.

Key Terms

Post-Study Assessment

Term	Definition
RDMA	Remote Direct Memory Access — a networking technique allowing a NIC to read/write remote host memory without involving the remote CPU
InfiniBand	High-speed, low-latency interconnect fabric for HPC/AI clusters, providing 200–400+ Gb/s per port
GPUDirect RDMA	NVIDIA technology enabling direct DMA between GPU memory and a NIC over PCIe, bypassing CPU copies
SR-IOV	Single Root I/O Virtualization — partitions a physical NIC into hardware-isolated Virtual Functions
Multus CNI	Meta-CNI plugin enabling pods to have multiple network interfaces by delegating to multiple CNI plugins
NetworkPolicy	Kubernetes resource defining Layer 3/4 rules controlling pod-to-pod and pod-to-external communication
mTLS	Mutual TLS — both client and server present and verify certificates for cryptographic identity
RBAC	Role-Based Access Control using Roles/ClusterRoles and Bindings to grant permissions
Pod Security Standards	Three built-in Kubernetes profiles (Privileged, Baseline, Restricted) enforced via Pod Security Admission
Admission Controller	API server plugin that intercepts requests and can validate, mutate, or reject them before objects are persisted

1. What is the primary advantage of RDMA over traditional TCP/IP for distributed training?

It supports higher MTU sizes on Ethernet networks

It allows the NIC to read/write remote memory without involving the remote CPU, eliminating kernel buffer copies

It uses UDP instead of TCP to reduce connection overhead

It compresses gradient data before transmission to reduce bandwidth

2. What does GPUDirect RDMA eliminate from the data path during distributed training?

The need for a network switch between GPU nodes

The CPU-mediated copies between GPU memory and the NIC

The requirement for InfiniBand hardware

The gradient synchronization step in AllReduce

3. What is the role of Multus CNI in a Kubernetes AI cluster?

It replaces the default CNI plugin with a higher-performance alternative

It enables pods to have multiple network interfaces by delegating to multiple CNI plugins

It encrypts all pod-to-pod traffic using IPsec

It enforces NetworkPolicy rules at Layer 7

4. What should be the first NetworkPolicy applied to every AI namespace?

Allow all egress to the internet for package updates

Allow all ingress from the monitoring namespace

A default-deny policy that blocks all ingress and egress

A policy that limits bandwidth per pod

5. In Istio's STRICT mTLS mode, what happens to unencrypted connections?

They are encrypted automatically by the sidecar proxy

They are logged and allowed through for backward compatibility

They are rejected at the sidecar proxy before reaching the application

They are routed through a separate unencrypted channel

6. Why is granting cluster-admin to training job service accounts a security risk?

It prevents the training job from accessing GPU resources

It creates a lateral movement vector with full cluster access if the job is compromised

It causes RBAC policy conflicts with namespace-scoped roles

It increases the memory footprint of the training pod

7. Which Pod Security Standard profile should be applied to AI inference namespaces in multi-tenant clusters?

Privileged

Baseline

Restricted

Custom (via OPA/Gatekeeper only)

8. What is the recommended way to provide secrets (API keys, credentials) to AI training pods?

As environment variables from a ConfigMap

Hardcoded in the container image

As files mounted via volumeMounts with restrictive permissions (0400)

As command-line arguments to the container entrypoint

9. What does SR-IOV provide to Kubernetes pods?

Software-defined virtual NICs with shared DMA queues

Hardware-isolated Virtual Functions with their own DMA engines, providing near-native NIC performance

Automatic load balancing across multiple physical NICs

Encrypted tunnels between pods on different nodes

10. For HIPAA compliance on a Kubernetes AI cluster, which combination of controls is required?

RBAC only, since access control covers all HIPAA requirements

PV encryption + TLS + RBAC + Secrets management

NetworkPolicy + Pod Security Standards only

Service mesh with mTLS only, since encryption covers all requirements

11. Why should training pods be denied egress to the public internet?

Internet traffic would consume bandwidth needed for gradient synchronization

Training pods have no legitimate need for internet access and allowing it creates a data exfiltration path

Kubernetes DNS does not resolve external domains by default

Internet access would trigger Pod Security Admission violations

12. What is the purpose of tainting GPU nodes with nvidia.com/gpu=true:NoSchedule?

To enable GPUDirect RDMA on those nodes

To prevent non-GPU workloads from being scheduled on expensive GPU nodes

To label nodes for the NVIDIA device plugin discovery

To enforce Pod Security Standards on GPU pods

Chapter 8: Networking and Security for AI Workloads

Learning Objectives

Section 1: High-Performance Networking for Distributed Training

RDMA and InfiniBand with Kubernetes

GPUDirect RDMA for GPU-to-GPU Communication

Animation: RDMA vs TCP Networking

SR-IOV for High-Bandwidth Network Interfaces

Multus CNI for Secondary Network Attachments

Key Takeaway

Section 2: Network Policies and Isolation

Kubernetes NetworkPolicy for AI Namespace Isolation

Service Mesh (Istio, Linkerd)

mTLS for Secure Inter-Pod Communication

Animation: Network Policy Enforcement

Key Takeaway

Section 3: Security Best Practices

RBAC for Multi-Team AI Clusters

Pod Security Standards and Admission Controllers

Image Scanning for ML Container Vulnerabilities

Secrets Management

Animation: Defense-in-Depth Security Layers

Key Takeaway

Section 4: Data Security and Compliance

Encryption at Rest and in Transit

Data Access Auditing and Lineage Tracking

Compliance Frameworks

Key Takeaway

Key Terms

Post-Study Assessment

Your Progress

Answer Explanations