Chapter 8: Networking and Security for AI Workloads
Learning Objectives
Configure high-performance networking for distributed training using RDMA, GPUDirect RDMA, SR-IOV, and Multus CNI
Implement Kubernetes NetworkPolicy and service mesh solutions to isolate AI workload traffic
Secure model artifacts, training data, and inference endpoints with encryption, secrets management, and mTLS
Apply RBAC and Pod Security Standards to harden AI namespaces against privilege escalation and lateral movement
Pre-Study Assessment
1. What is the primary advantage of RDMA over traditional TCP/IP for distributed training?
It supports higher MTU sizes on Ethernet networks
It allows the NIC to read/write remote memory without involving the remote CPU, eliminating kernel buffer copies
It uses UDP instead of TCP to reduce connection overhead
It compresses gradient data before transmission to reduce bandwidth
2. What does GPUDirect RDMA eliminate from the data path during distributed training?
The need for a network switch between GPU nodes
The CPU-mediated copies between GPU memory and the NIC
The requirement for InfiniBand hardware
The gradient synchronization step in AllReduce
3. What is the role of Multus CNI in a Kubernetes AI cluster?
It replaces the default CNI plugin with a higher-performance alternative
It enables pods to have multiple network interfaces by delegating to multiple CNI plugins
It encrypts all pod-to-pod traffic using IPsec
It enforces NetworkPolicy rules at Layer 7
4. What should be the first NetworkPolicy applied to every AI namespace?
Allow all egress to the internet for package updates
Allow all ingress from the monitoring namespace
A default-deny policy that blocks all ingress and egress
A policy that limits bandwidth per pod
5. In Istio's STRICT mTLS mode, what happens to unencrypted connections?
They are encrypted automatically by the sidecar proxy
They are logged and allowed through for backward compatibility
They are rejected at the sidecar proxy before reaching the application
They are routed through a separate unencrypted channel
6. Why is granting cluster-admin to training job service accounts a security risk?
It prevents the training job from accessing GPU resources
It creates a lateral movement vector with full cluster access if the job is compromised
It causes RBAC policy conflicts with namespace-scoped roles
It increases the memory footprint of the training pod
7. Which Pod Security Standard profile should be applied to AI inference namespaces in multi-tenant clusters?
Privileged
Baseline
Restricted
Custom (via OPA/Gatekeeper only)
8. What is the recommended way to provide secrets (API keys, credentials) to AI training pods?
As environment variables from a ConfigMap
Hardcoded in the container image
As files mounted via volumeMounts with restrictive permissions (0400)
As command-line arguments to the container entrypoint
9. What does SR-IOV provide to Kubernetes pods?
Software-defined virtual NICs with shared DMA queues
Hardware-isolated Virtual Functions with their own DMA engines, providing near-native NIC performance
Automatic load balancing across multiple physical NICs
Encrypted tunnels between pods on different nodes
10. For HIPAA compliance on a Kubernetes AI cluster, which combination of controls is required?
RBAC only, since access control covers all HIPAA requirements
PV encryption + TLS + RBAC + Secrets management
NetworkPolicy + Pod Security Standards only
Service mesh with mTLS only, since encryption covers all requirements
11. Why should training pods be denied egress to the public internet?
Internet traffic would consume bandwidth needed for gradient synchronization
Training pods have no legitimate need for internet access and allowing it creates a data exfiltration path
Kubernetes DNS does not resolve external domains by default
Internet access would trigger Pod Security Admission violations
12. What is the purpose of tainting GPU nodes with nvidia.com/gpu=true:NoSchedule?
To enable GPUDirect RDMA on those nodes
To prevent non-GPU workloads from being scheduled on expensive GPU nodes
To label nodes for the NVIDIA device plugin discovery
To enforce Pod Security Standards on GPU pods
Section 1: High-Performance Networking for Distributed Training
Ordinary TCP/IP networking was designed for flexibility and correctness, not for the microsecond-latency, near-zero-CPU-overhead requirements of distributed deep learning. When a training job executes an AllReduce collective across 256 GPUs, each process must exchange gradient tensors with its peers hundreds of times per training step. The bottleneck is not GPU computation — it is the time gradients spend waiting in CPU buffers and kernel queues.
RDMA and InfiniBand with Kubernetes
Remote Direct Memory Access (RDMA) allows a network adapter to read from and write to the memory of a remote host without involving the remote CPU. This eliminates the copy-to-kernel-buffer and context-switch overhead that dominates conventional networking at high message rates.
InfiniBand is the dominant physical transport for RDMA in HPC and AI data centers, providing link-level reliability, congestion management, and switch fabrics rated at 400 Gbps (HDR) or 800 Gbps (NDR) per port, with end-to-end latency measured in single-digit microseconds.
The NVIDIA Network Operator manages the full driver and plugin lifecycle to expose InfiniBand or RoCE devices to Kubernetes pods: MOFED/DOCA-OFED drivers, the RDMA shared device plugin (advertising rdma/ib resources), and the nvidia-peermem module for GPU-NIC peer-to-peer DMA.
GPUDirect RDMA for GPU-to-GPU Communication
GPUDirect RDMA extends RDMA into the GPU memory address space. Without it, gradient data travels: GPU memory → CPU pinned buffer → NIC → Network → NIC → CPU pinned buffer → GPU memory. With GPUDirect, the NIC can DMA directly into and out of the GPU's BAR memory region, eliminating both CPU copies.
NCCL can achieve 90–95% of theoretical InfiniBand bandwidth with GPUDirect RDMA, versus 40–60% through a CPU-mediated path.
Animation: RDMA vs TCP Networking
SR-IOV for High-Bandwidth Network Interfaces
SR-IOV (Single Root I/O Virtualization) partitions a single physical NIC into multiple lightweight Virtual Functions (VFs). Each VF has its own DMA engine, queues, and interrupt lines. When assigned exclusively to a pod, the pod gets near-native NIC performance with hardware-enforced isolation.
Deploying SR-IOV in Kubernetes requires three components: the SR-IOV CNI plugin (moves a VF into the pod network namespace), the SR-IOV device plugin (advertises VF resources to the scheduler), and the SR-IOV Network Operator (automates VF creation).
Multus CNI for Secondary Network Attachments
Multus CNI acts as a meta-CNI plugin multiplexer. It creates the primary interface using the cluster's default CNI, then creates additional interfaces using other CNI plugins. This separates management traffic (Kubernetes API, health probes) from high-speed gradient communication on a dedicated InfiniBand or SR-IOV network.
Table: RDMA Transport Options
Transport
Bandwidth
Latency
Notes
InfiniBand HDR
200 Gb/s per port
~1 us
Industry standard for HPC/AI
InfiniBand NDR
400 Gb/s per port
<1 us
Latest generation
RoCEv2
100 Gb/s per port
2–5 us
Runs on existing Ethernet switches
iWARP
25–100 Gb/s
5–10 us
Software-friendly, higher latency
Key Takeaway
Multus CNI gives training pods a secondary high-speed network interface alongside the standard cluster network.
SR-IOV provides hardware-isolated near-native performance via Virtual Functions.
GPUDirect RDMA eliminates CPU copies by routing gradient data directly between GPU memory and the NIC.
Together, these three technologies allow NCCL collective operations to run at close to theoretical InfiniBand wire speed.
Section 2: Network Policies and Isolation
In a shared AI cluster, different teams run workloads with different data sensitivity levels. Without explicit network isolation, those workloads can communicate freely at the pod level. Network isolation is a defense-in-depth strategy that starts with denying everything and explicitly allowing only what is needed.
Kubernetes NetworkPolicy for AI Namespace Isolation
The most important first step is a default-deny policy applied to every AI namespace. This blocks all ingress and egress by default, then you add explicit allow rules only for what each workload legitimately needs.
Training pods should have no internet egress — there is no legitimate reason for a gradient synchronization process to reach the public internet, and allowing it creates an exfiltration path.
Service Mesh (Istio, Linkerd)
For inference workloads, a NetworkPolicy alone is insufficient. A service mesh injects a sidecar proxy into every pod, transparently intercepting all connections to provide: traffic management (canary deployments, A/B testing), observability (per-route latency p50/p95/p99), and security (mTLS between all pods).
mTLS for Secure Inter-Pod Communication
Mutual TLS (mTLS) means both ends of a connection present and verify certificates. Each pod's identity is cryptographically bound to its Kubernetes service account. In Istio's STRICT mode, unencrypted or unauthenticated connections are rejected at the sidecar proxy before reaching the application.
Animation: Network Policy Enforcement
Key Takeaway
Start with default-deny NetworkPolicies in every namespace and add only the egress rules each workload needs.
Introduce a service mesh for inference workloads to gain mTLS identity verification, traffic routing, and observability.
Never allow training pods to initiate connections to the public internet.
Section 3: Security Best Practices
RBAC for Multi-Team AI Clusters
RBAC misconfigurations account for over 35% of Kubernetes breaches. The most common mistake is granting cluster-admin to training job service accounts. A well-structured cluster uses namespace-scoped roles matching what each job type actually needs.
A training job needs to: read ConfigMaps (hyperparameters), read Secrets (data credentials), and list Pods (workers). It does not need to list nodes, modify RBAC policies, or access other namespaces.
Table: RBAC Roles for AI Cluster Personas
Persona
Scope
Allowed
Denied
Data scientist
Own namespace
create/delete Jobs, read Logs
create Roles, other namespaces
ML engineer
Own NS + model-registry
create Deployments, push to registry
delete Namespaces, modify ClusterRoles
Training job SA
Own namespace
get ConfigMap/Secret, list Pods
Everything else
Platform admin
Cluster-wide
All (with MFA + audit)
—
Pod Security Standards and Admission Controllers
Pod Security Admission (PSA) enforces one of three built-in profiles at the namespace level:
Privileged — No restrictions. Only for system namespaces like kube-system.
Baseline — Prevents known privilege escalations. Default for AI training namespaces.
Restricted — Enforces hardened configuration. For inference and multi-tenant environments.
The warn and audit modes surface violations without blocking pods, useful during migration before switching to enforce.
Image Scanning for ML Container Vulnerabilities
ML containers can reach 20–30 GB, meaning more surface area and more CVEs. Integrate scanning (Trivy, Grype, Snyk Container) into CI/CD, and use admission controllers (Kyverno, OPA/Gatekeeper) to reject pods from untrusted registries or with critical CVEs.
Secrets Management
The most important rule: mount secrets as files, not environment variables. Environment variables are visible in /proc/<pid>/environ, readable by any process as the same user. File-mounted secrets with 0400 permissions are far more restrictive.
For frequently changing secrets or audit trail requirements, use External Secrets Operator or Vault Agent Injector with HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault.
Animation: Defense-in-Depth Security Layers
Key Takeaway
At the identity layer, enforce least-privilege RBAC using namespace-scoped roles — never cluster-admin for training jobs.
At the pod layer, apply Baseline or Restricted Pod Security Standards and taint GPU nodes to prevent resource theft.
At the secrets layer, always mount credentials as files and consider external secrets managers for audit trail requirements.
Section 4: Data Security and Compliance
Encryption at Rest and in Transit
Encryption at rest applies at two levels in Kubernetes: etcd encryption (Secrets and sensitive API objects encrypted with aescbc or aesgcm via KMS provider) and persistent volume encryption (AWS EBS encryption, GCP CMEK, or LUKS-encrypted block devices on-premises).
Encryption in transit means TLS on all paths carrying training data. Note that RDMA bypasses the CPU and therefore bypasses TLS — for workloads requiring encrypted gradient communication, NCCL supports encrypted collectives or the cluster can enforce IPsec at the NIC level.
Data Access Auditing and Lineage Tracking
Configure Kubernetes audit logging to capture RequestResponse-level events for sensitive resources (secrets, PVCs). For AI-specific data lineage, tools like MLflow, DVC, and Kubeflow Pipelines track which dataset version produced which model checkpoint.
Treat compliance as a set of Kubernetes namespaces with different security boundaries. A HIPAA-scoped namespace gets: enforce=restricted PSA, dedicated node pool with encrypted root volumes, default-deny NetworkPolicy, mandatory mTLS, and audit logs exported to an immutable log store.
Runtime threat detection using Falco watches the Linux syscall stream and alerts on compromise indicators: spawning shells inside containers, reading /etc/shadow, or outbound connections on unusual ports.
Key Takeaway
Protect data at every layer: encryption at rest for PVs and etcd, encryption in transit via TLS and mTLS, and comprehensive access auditing.
Map compliance obligations (SOC 2, HIPAA, GDPR) to specific Kubernetes controls and enforce them programmatically via admission policies and namespace labels.
Manual processes drift over time, but policy-as-code does not.
Key Terms
Term
Definition
RDMA
Remote Direct Memory Access — a networking technique allowing a NIC to read/write remote host memory without involving the remote CPU
InfiniBand
High-speed, low-latency interconnect fabric for HPC/AI clusters, providing 200–400+ Gb/s per port
GPUDirect RDMA
NVIDIA technology enabling direct DMA between GPU memory and a NIC over PCIe, bypassing CPU copies
SR-IOV
Single Root I/O Virtualization — partitions a physical NIC into hardware-isolated Virtual Functions
Multus CNI
Meta-CNI plugin enabling pods to have multiple network interfaces by delegating to multiple CNI plugins
NetworkPolicy
Kubernetes resource defining Layer 3/4 rules controlling pod-to-pod and pod-to-external communication
mTLS
Mutual TLS — both client and server present and verify certificates for cryptographic identity
RBAC
Role-Based Access Control using Roles/ClusterRoles and Bindings to grant permissions
Pod Security Standards
Three built-in Kubernetes profiles (Privileged, Baseline, Restricted) enforced via Pod Security Admission
Admission Controller
API server plugin that intercepts requests and can validate, mutate, or reject them before objects are persisted
Post-Study Assessment
Post-Study Assessment
1. What is the primary advantage of RDMA over traditional TCP/IP for distributed training?
It supports higher MTU sizes on Ethernet networks
It allows the NIC to read/write remote memory without involving the remote CPU, eliminating kernel buffer copies
It uses UDP instead of TCP to reduce connection overhead
It compresses gradient data before transmission to reduce bandwidth
2. What does GPUDirect RDMA eliminate from the data path during distributed training?
The need for a network switch between GPU nodes
The CPU-mediated copies between GPU memory and the NIC
The requirement for InfiniBand hardware
The gradient synchronization step in AllReduce
3. What is the role of Multus CNI in a Kubernetes AI cluster?
It replaces the default CNI plugin with a higher-performance alternative
It enables pods to have multiple network interfaces by delegating to multiple CNI plugins
It encrypts all pod-to-pod traffic using IPsec
It enforces NetworkPolicy rules at Layer 7
4. What should be the first NetworkPolicy applied to every AI namespace?
Allow all egress to the internet for package updates
Allow all ingress from the monitoring namespace
A default-deny policy that blocks all ingress and egress
A policy that limits bandwidth per pod
5. In Istio's STRICT mTLS mode, what happens to unencrypted connections?
They are encrypted automatically by the sidecar proxy
They are logged and allowed through for backward compatibility
They are rejected at the sidecar proxy before reaching the application
They are routed through a separate unencrypted channel
6. Why is granting cluster-admin to training job service accounts a security risk?
It prevents the training job from accessing GPU resources
It creates a lateral movement vector with full cluster access if the job is compromised
It causes RBAC policy conflicts with namespace-scoped roles
It increases the memory footprint of the training pod
7. Which Pod Security Standard profile should be applied to AI inference namespaces in multi-tenant clusters?
Privileged
Baseline
Restricted
Custom (via OPA/Gatekeeper only)
8. What is the recommended way to provide secrets (API keys, credentials) to AI training pods?
As environment variables from a ConfigMap
Hardcoded in the container image
As files mounted via volumeMounts with restrictive permissions (0400)
As command-line arguments to the container entrypoint
9. What does SR-IOV provide to Kubernetes pods?
Software-defined virtual NICs with shared DMA queues
Hardware-isolated Virtual Functions with their own DMA engines, providing near-native NIC performance
Automatic load balancing across multiple physical NICs
Encrypted tunnels between pods on different nodes
10. For HIPAA compliance on a Kubernetes AI cluster, which combination of controls is required?
RBAC only, since access control covers all HIPAA requirements
PV encryption + TLS + RBAC + Secrets management
NetworkPolicy + Pod Security Standards only
Service mesh with mTLS only, since encryption covers all requirements
11. Why should training pods be denied egress to the public internet?
Internet traffic would consume bandwidth needed for gradient synchronization
Training pods have no legitimate need for internet access and allowing it creates a data exfiltration path
Kubernetes DNS does not resolve external domains by default
Internet access would trigger Pod Security Admission violations
12. What is the purpose of tainting GPU nodes with nvidia.com/gpu=true:NoSchedule?
To enable GPUDirect RDMA on those nodes
To prevent non-GPU workloads from being scheduled on expensive GPU nodes
To label nodes for the NVIDIA device plugin discovery