Chapter 8: Networking and Security for AI Workloads

Learning Objectives

Pre-Study Assessment

1. What is the primary advantage of RDMA over traditional TCP/IP for distributed training?

It supports higher MTU sizes on Ethernet networks
It allows the NIC to read/write remote memory without involving the remote CPU, eliminating kernel buffer copies
It uses UDP instead of TCP to reduce connection overhead
It compresses gradient data before transmission to reduce bandwidth

2. What does GPUDirect RDMA eliminate from the data path during distributed training?

The need for a network switch between GPU nodes
The CPU-mediated copies between GPU memory and the NIC
The requirement for InfiniBand hardware
The gradient synchronization step in AllReduce

3. What is the role of Multus CNI in a Kubernetes AI cluster?

It replaces the default CNI plugin with a higher-performance alternative
It enables pods to have multiple network interfaces by delegating to multiple CNI plugins
It encrypts all pod-to-pod traffic using IPsec
It enforces NetworkPolicy rules at Layer 7

4. What should be the first NetworkPolicy applied to every AI namespace?

Allow all egress to the internet for package updates
Allow all ingress from the monitoring namespace
A default-deny policy that blocks all ingress and egress
A policy that limits bandwidth per pod

5. In Istio's STRICT mTLS mode, what happens to unencrypted connections?

They are encrypted automatically by the sidecar proxy
They are logged and allowed through for backward compatibility
They are rejected at the sidecar proxy before reaching the application
They are routed through a separate unencrypted channel

6. Why is granting cluster-admin to training job service accounts a security risk?

It prevents the training job from accessing GPU resources
It creates a lateral movement vector with full cluster access if the job is compromised
It causes RBAC policy conflicts with namespace-scoped roles
It increases the memory footprint of the training pod

7. Which Pod Security Standard profile should be applied to AI inference namespaces in multi-tenant clusters?

Privileged
Baseline
Restricted
Custom (via OPA/Gatekeeper only)

8. What is the recommended way to provide secrets (API keys, credentials) to AI training pods?

As environment variables from a ConfigMap
Hardcoded in the container image
As files mounted via volumeMounts with restrictive permissions (0400)
As command-line arguments to the container entrypoint

9. What does SR-IOV provide to Kubernetes pods?

Software-defined virtual NICs with shared DMA queues
Hardware-isolated Virtual Functions with their own DMA engines, providing near-native NIC performance
Automatic load balancing across multiple physical NICs
Encrypted tunnels between pods on different nodes

10. For HIPAA compliance on a Kubernetes AI cluster, which combination of controls is required?

RBAC only, since access control covers all HIPAA requirements
PV encryption + TLS + RBAC + Secrets management
NetworkPolicy + Pod Security Standards only
Service mesh with mTLS only, since encryption covers all requirements

11. Why should training pods be denied egress to the public internet?

Internet traffic would consume bandwidth needed for gradient synchronization
Training pods have no legitimate need for internet access and allowing it creates a data exfiltration path
Kubernetes DNS does not resolve external domains by default
Internet access would trigger Pod Security Admission violations

12. What is the purpose of tainting GPU nodes with nvidia.com/gpu=true:NoSchedule?

To enable GPUDirect RDMA on those nodes
To prevent non-GPU workloads from being scheduled on expensive GPU nodes
To label nodes for the NVIDIA device plugin discovery
To enforce Pod Security Standards on GPU pods

Section 1: High-Performance Networking for Distributed Training

Ordinary TCP/IP networking was designed for flexibility and correctness, not for the microsecond-latency, near-zero-CPU-overhead requirements of distributed deep learning. When a training job executes an AllReduce collective across 256 GPUs, each process must exchange gradient tensors with its peers hundreds of times per training step. The bottleneck is not GPU computation — it is the time gradients spend waiting in CPU buffers and kernel queues.

RDMA and InfiniBand with Kubernetes

Remote Direct Memory Access (RDMA) allows a network adapter to read from and write to the memory of a remote host without involving the remote CPU. This eliminates the copy-to-kernel-buffer and context-switch overhead that dominates conventional networking at high message rates.

InfiniBand is the dominant physical transport for RDMA in HPC and AI data centers, providing link-level reliability, congestion management, and switch fabrics rated at 400 Gbps (HDR) or 800 Gbps (NDR) per port, with end-to-end latency measured in single-digit microseconds.

The NVIDIA Network Operator manages the full driver and plugin lifecycle to expose InfiniBand or RoCE devices to Kubernetes pods: MOFED/DOCA-OFED drivers, the RDMA shared device plugin (advertising rdma/ib resources), and the nvidia-peermem module for GPU-NIC peer-to-peer DMA.

GPUDirect RDMA for GPU-to-GPU Communication

GPUDirect RDMA extends RDMA into the GPU memory address space. Without it, gradient data travels: GPU memory → CPU pinned buffer → NIC → Network → NIC → CPU pinned buffer → GPU memory. With GPUDirect, the NIC can DMA directly into and out of the GPU's BAR memory region, eliminating both CPU copies.

NCCL can achieve 90–95% of theoretical InfiniBand bandwidth with GPUDirect RDMA, versus 40–60% through a CPU-mediated path.

Animation: RDMA vs TCP Networking

RDMA vs Traditional TCP: Kernel Bypass in Action TRADITIONAL TCP/IP GPU Memory (A) CPU / Kernel Socket Buffer NIC (A) Network NIC (B) Socket Buffer CPU / Kernel GPU Memory (B) ~50-100 us latency 8 hops, 4 CPU copies, kernel context switches RDMA (KERNEL BYPASS) GPU BAR Memory (A) RDMA NIC (A) InfiniBand RDMA NIC (B) GPU BAR Memory (B) No CPU No CPU PCIe P2P Direct DMA ~1-2 us latency 4 hops, zero CPU copies, kernel bypassed 50-100x lower latency Replay

SR-IOV for High-Bandwidth Network Interfaces

SR-IOV (Single Root I/O Virtualization) partitions a single physical NIC into multiple lightweight Virtual Functions (VFs). Each VF has its own DMA engine, queues, and interrupt lines. When assigned exclusively to a pod, the pod gets near-native NIC performance with hardware-enforced isolation.

Deploying SR-IOV in Kubernetes requires three components: the SR-IOV CNI plugin (moves a VF into the pod network namespace), the SR-IOV device plugin (advertises VF resources to the scheduler), and the SR-IOV Network Operator (automates VF creation).

Multus CNI for Secondary Network Attachments

Multus CNI acts as a meta-CNI plugin multiplexer. It creates the primary interface using the cluster's default CNI, then creates additional interfaces using other CNI plugins. This separates management traffic (Kubernetes API, health probes) from high-speed gradient communication on a dedicated InfiniBand or SR-IOV network.

Table: RDMA Transport Options
TransportBandwidthLatencyNotes
InfiniBand HDR200 Gb/s per port~1 usIndustry standard for HPC/AI
InfiniBand NDR400 Gb/s per port<1 usLatest generation
RoCEv2100 Gb/s per port2–5 usRuns on existing Ethernet switches
iWARP25–100 Gb/s5–10 usSoftware-friendly, higher latency

Key Takeaway

Section 2: Network Policies and Isolation

In a shared AI cluster, different teams run workloads with different data sensitivity levels. Without explicit network isolation, those workloads can communicate freely at the pod level. Network isolation is a defense-in-depth strategy that starts with denying everything and explicitly allowing only what is needed.

Kubernetes NetworkPolicy for AI Namespace Isolation

The most important first step is a default-deny policy applied to every AI namespace. This blocks all ingress and egress by default, then you add explicit allow rules only for what each workload legitimately needs.

Training pods should have no internet egress — there is no legitimate reason for a gradient synchronization process to reach the public internet, and allowing it creates an exfiltration path.

Service Mesh (Istio, Linkerd)

For inference workloads, a NetworkPolicy alone is insufficient. A service mesh injects a sidecar proxy into every pod, transparently intercepting all connections to provide: traffic management (canary deployments, A/B testing), observability (per-route latency p50/p95/p99), and security (mTLS between all pods).

mTLS for Secure Inter-Pod Communication

Mutual TLS (mTLS) means both ends of a connection present and verify certificates. Each pod's identity is cryptographically bound to its Kubernetes service account. In Istio's STRICT mode, unencrypted or unauthenticated connections are rejected at the sidecar proxy before reaching the application.

Animation: Network Policy Enforcement

Network Policy Enforcement: Default Deny + Explicit Allow namespace: ai-training Training Pod (role: trainer) Eval Pod (role: evaluator) DEFAULT DENY namespace: data-storage Object Storage :443 namespace: model-registry Model Registry :5000 Public Internet namespace: research Research Pod ALLOWED :443 ALLOWED :5000 ALLOWED :443 X X BLOCKED (no rule) BLOCKED (cross-NS) Policy Summary ✓ Egress to data-storage ✓ Egress to model-registry ✓ DNS (UDP :53) ✗ Internet egress ✗ Cross-namespace ingress Replay

Key Takeaway

Section 3: Security Best Practices

RBAC for Multi-Team AI Clusters

RBAC misconfigurations account for over 35% of Kubernetes breaches. The most common mistake is granting cluster-admin to training job service accounts. A well-structured cluster uses namespace-scoped roles matching what each job type actually needs.

A training job needs to: read ConfigMaps (hyperparameters), read Secrets (data credentials), and list Pods (workers). It does not need to list nodes, modify RBAC policies, or access other namespaces.

Table: RBAC Roles for AI Cluster Personas
PersonaScopeAllowedDenied
Data scientistOwn namespacecreate/delete Jobs, read Logscreate Roles, other namespaces
ML engineerOwn NS + model-registrycreate Deployments, push to registrydelete Namespaces, modify ClusterRoles
Training job SAOwn namespaceget ConfigMap/Secret, list PodsEverything else
Platform adminCluster-wideAll (with MFA + audit)

Pod Security Standards and Admission Controllers

Pod Security Admission (PSA) enforces one of three built-in profiles at the namespace level:

The warn and audit modes surface violations without blocking pods, useful during migration before switching to enforce.

Image Scanning for ML Container Vulnerabilities

ML containers can reach 20–30 GB, meaning more surface area and more CVEs. Integrate scanning (Trivy, Grype, Snyk Container) into CI/CD, and use admission controllers (Kyverno, OPA/Gatekeeper) to reject pods from untrusted registries or with critical CVEs.

Secrets Management

The most important rule: mount secrets as files, not environment variables. Environment variables are visible in /proc/<pid>/environ, readable by any process as the same user. File-mounted secrets with 0400 permissions are far more restrictive.

For frequently changing secrets or audit trail requirements, use External Secrets Operator or Vault Agent Injector with HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault.

Animation: Defense-in-Depth Security Layers

Defense-in-Depth: Security Layers for AI Workloads RBAC Pod Security Standards NetworkPolicy mTLS (Istio) Encryption AI Workload RBAC (Identity Layer) Namespace-scoped roles, least-privilege SAs Pod Security (Admission Layer) Baseline/Restricted profiles, GPU node taints NetworkPolicy (Network Layer) Default-deny, explicit egress allow rules mTLS (Transport Layer) Istio STRICT mode, cert-bound identity Encryption (Data Layer) etcd + PV at rest, TLS in transit, FIPS 140-2 Attacker X Blocked at RBAC before reaching inner layers Defense in Depth Each layer stops a different attack class. Compromise of one layer does not expose the inner workload to direct access. Replay

Key Takeaway

Section 4: Data Security and Compliance

Encryption at Rest and in Transit

Encryption at rest applies at two levels in Kubernetes: etcd encryption (Secrets and sensitive API objects encrypted with aescbc or aesgcm via KMS provider) and persistent volume encryption (AWS EBS encryption, GCP CMEK, or LUKS-encrypted block devices on-premises).

Encryption in transit means TLS on all paths carrying training data. Note that RDMA bypasses the CPU and therefore bypasses TLS — for workloads requiring encrypted gradient communication, NCCL supports encrypted collectives or the cluster can enforce IPsec at the NIC level.

Data Access Auditing and Lineage Tracking

Configure Kubernetes audit logging to capture RequestResponse-level events for sensitive resources (secrets, PVCs). For AI-specific data lineage, tools like MLflow, DVC, and Kubeflow Pipelines track which dataset version produced which model checkpoint.

Compliance Frameworks

Table: Compliance Framework Requirements Mapping
FrameworkKey RequirementKubernetes Implementation
SOC 2 Type IILogical access controls, audit loggingRBAC least-privilege + API server audit logs
HIPAAPHI encryption at rest and in transitPV encryption + TLS + RBAC + Secrets mgmt
GDPRData minimization, right to erasureNamespace isolation + audit logs + retention policies
FedRAMPFIPS 140-2, continuous monitoringFIPS-mode etcd encryption + Falco runtime monitoring
NIST AI RMFModel documentation, risk assessmentMLflow tracking + OPA policy-as-code

Treat compliance as a set of Kubernetes namespaces with different security boundaries. A HIPAA-scoped namespace gets: enforce=restricted PSA, dedicated node pool with encrypted root volumes, default-deny NetworkPolicy, mandatory mTLS, and audit logs exported to an immutable log store.

Runtime threat detection using Falco watches the Linux syscall stream and alerts on compromise indicators: spawning shells inside containers, reading /etc/shadow, or outbound connections on unusual ports.

Key Takeaway

Key Terms

TermDefinition
RDMARemote Direct Memory Access — a networking technique allowing a NIC to read/write remote host memory without involving the remote CPU
InfiniBandHigh-speed, low-latency interconnect fabric for HPC/AI clusters, providing 200–400+ Gb/s per port
GPUDirect RDMANVIDIA technology enabling direct DMA between GPU memory and a NIC over PCIe, bypassing CPU copies
SR-IOVSingle Root I/O Virtualization — partitions a physical NIC into hardware-isolated Virtual Functions
Multus CNIMeta-CNI plugin enabling pods to have multiple network interfaces by delegating to multiple CNI plugins
NetworkPolicyKubernetes resource defining Layer 3/4 rules controlling pod-to-pod and pod-to-external communication
mTLSMutual TLS — both client and server present and verify certificates for cryptographic identity
RBACRole-Based Access Control using Roles/ClusterRoles and Bindings to grant permissions
Pod Security StandardsThree built-in Kubernetes profiles (Privileged, Baseline, Restricted) enforced via Pod Security Admission
Admission ControllerAPI server plugin that intercepts requests and can validate, mutate, or reject them before objects are persisted

Post-Study Assessment

Post-Study Assessment

1. What is the primary advantage of RDMA over traditional TCP/IP for distributed training?

It supports higher MTU sizes on Ethernet networks
It allows the NIC to read/write remote memory without involving the remote CPU, eliminating kernel buffer copies
It uses UDP instead of TCP to reduce connection overhead
It compresses gradient data before transmission to reduce bandwidth

2. What does GPUDirect RDMA eliminate from the data path during distributed training?

The need for a network switch between GPU nodes
The CPU-mediated copies between GPU memory and the NIC
The requirement for InfiniBand hardware
The gradient synchronization step in AllReduce

3. What is the role of Multus CNI in a Kubernetes AI cluster?

It replaces the default CNI plugin with a higher-performance alternative
It enables pods to have multiple network interfaces by delegating to multiple CNI plugins
It encrypts all pod-to-pod traffic using IPsec
It enforces NetworkPolicy rules at Layer 7

4. What should be the first NetworkPolicy applied to every AI namespace?

Allow all egress to the internet for package updates
Allow all ingress from the monitoring namespace
A default-deny policy that blocks all ingress and egress
A policy that limits bandwidth per pod

5. In Istio's STRICT mTLS mode, what happens to unencrypted connections?

They are encrypted automatically by the sidecar proxy
They are logged and allowed through for backward compatibility
They are rejected at the sidecar proxy before reaching the application
They are routed through a separate unencrypted channel

6. Why is granting cluster-admin to training job service accounts a security risk?

It prevents the training job from accessing GPU resources
It creates a lateral movement vector with full cluster access if the job is compromised
It causes RBAC policy conflicts with namespace-scoped roles
It increases the memory footprint of the training pod

7. Which Pod Security Standard profile should be applied to AI inference namespaces in multi-tenant clusters?

Privileged
Baseline
Restricted
Custom (via OPA/Gatekeeper only)

8. What is the recommended way to provide secrets (API keys, credentials) to AI training pods?

As environment variables from a ConfigMap
Hardcoded in the container image
As files mounted via volumeMounts with restrictive permissions (0400)
As command-line arguments to the container entrypoint

9. What does SR-IOV provide to Kubernetes pods?

Software-defined virtual NICs with shared DMA queues
Hardware-isolated Virtual Functions with their own DMA engines, providing near-native NIC performance
Automatic load balancing across multiple physical NICs
Encrypted tunnels between pods on different nodes

10. For HIPAA compliance on a Kubernetes AI cluster, which combination of controls is required?

RBAC only, since access control covers all HIPAA requirements
PV encryption + TLS + RBAC + Secrets management
NetworkPolicy + Pod Security Standards only
Service mesh with mTLS only, since encryption covers all requirements

11. Why should training pods be denied egress to the public internet?

Internet traffic would consume bandwidth needed for gradient synchronization
Training pods have no legitimate need for internet access and allowing it creates a data exfiltration path
Kubernetes DNS does not resolve external domains by default
Internet access would trigger Pod Security Admission violations

12. What is the purpose of tainting GPU nodes with nvidia.com/gpu=true:NoSchedule?

To enable GPUDirect RDMA on those nodes
To prevent non-GPU workloads from being scheduled on expensive GPU nodes
To label nodes for the NVIDIA device plugin discovery
To enforce Pod Security Standards on GPU pods

Your Progress

Answer Explanations