Chapter 11: Cisco UCS Configuration for AI Workloads
Learning Objectives
Configure domain profiles and service profiles on Cisco UCS for AI workloads
Implement power, storage, and NTP policies for AI compute nodes
Configure LAN connectivity, vNIC, and QoS on UCS
Design system classes and QoS policies for AI traffic on UCS
Section 1: UCS Domain Profiles and Service Profiles
Pre-Quiz: Domain Profiles and Service Profiles
1. What is a UCS Domain Profile in Intersight?
A policy that configures a single server's BIOS settings
A top-level construct that configures a Fabric Interconnect pair
A template for creating VLANs across multiple switches
A storage configuration for boot-from-SAN
2. What does "stateless computing" mean in UCS?
Servers do not retain any data after power-off
Server identity is abstracted from physical hardware and can migrate between servers
The server runs without an operating system
Fabric Interconnects operate without configuration
3. Which four policy categories are used in a Server Profile in Intersight Managed Mode?
Compute, Network, Storage, Management
BIOS, Boot, Power, Thermal
LAN, SAN, VLAN, VSAN
Domain, Server, Adapter, QoS
4. Why is template-based provisioning essential for AI clusters?
It reduces the number of VLANs needed
It ensures consistent configuration across all GPU nodes and enables rapid replacement
It eliminates the need for power policies
It automatically enables RoCE on all vNICs
5. When a VLAN policy referenced by multiple domain profiles is updated, what happens?
Only the first domain profile receives the update
All domain profiles referencing it must be manually redeployed
Every domain profile referencing that policy inherits the change automatically
The update is queued until the next maintenance window
Key Points
A Domain Profile configures a Fabric Interconnect pair (ports, VLANs, VSANs, NTP, QoS system classes). Domain profile templates enable reuse across multiple AI clusters.
A Service Profile (UCSM) / Server Profile (IMM) abstracts server identity (UUID, MAC, WWNN, WWPN, boot policy) from hardware -- enabling stateless computing and workload mobility.
Server Profiles in IMM organize policies into four categories: Compute, Network, Storage, Management.
Template-based provisioning lets you define a golden configuration once and derive hundreds of identical server profiles. Updates to the template sync automatically to all derived profiles.
Identity pools (MAC, WWPN, UUID) ensure each derived profile gets unique identifiers while sharing identical policy configuration.
Domain Profile Architecture
A UCS Domain Profile is the top-level configuration construct in Cisco Intersight that represents and configures a pair of Fabric Interconnects (FIs). It encapsulates all the policies that define FI behavior: port configurations, port channels, VLANs, VSANs, and network control settings. A single domain policy (such as a VLAN policy) can be assigned to any number of domain profiles -- updating the policy once propagates changes to all referencing profiles automatically.
Ensures sufficient 100G/200G uplinks for GPU traffic
VLAN Policy
Configures L2 broadcast domains
Segregates AI training, storage, and management traffic
VSAN Policy
Configures Fibre Channel domains
Enables boot-from-SAN for stateless AI nodes
Network Control Policy
CDP, LLDP, MAC settings
Required for proper DCBX negotiation with upstream switches
NTP Policy
Time synchronization
Critical for distributed training coordination
QoS System Class
Traffic prioritization
Enables no-drop classes for RoCE/RDMA
Service Profile and Server Profile Design
Cisco UCS implements stateless computing through service profiles (UCS Manager) and server profiles (Intersight Managed Mode). A service profile abstracts the complete server identity -- UUID, MAC addresses, WWNN, WWPN, boot policy, firmware level, and BIOS settings -- from the physical hardware. When migrated to another server, the entire identity moves with it.
GPU-optimized BIOS settings, UEFI boot, power no-cap
Network
LAN Connectivity, SAN Connectivity, Adapter Policies
vNIC configuration, RoCE enablement, jumbo MTU
Storage
Local disk, SAN storage, Boot-from-SAN
M.2 RAID1 boot, NVMe data drives, SAN boot targets
Management
IPMI, Serial over LAN, SNMP, Syslog
Monitoring, out-of-band access, log collection
Template-Based Provisioning for AI Clusters
For AI clusters where tens or hundreds of identically configured GPU nodes are required, template-based provisioning is essential. Server Profile Templates in Intersight (or Service Profile Templates in UCS Manager) let you define a golden configuration once and derive individual profiles from it. Any modification to the template automatically syncs to all derived profiles.
Worked Example: Creating a GPU Node Server Profile Template in Intersight -- (1) Create template with name AI-GPU-Node-Template, (2) attach Compute policies (GPU-optimized BIOS, UEFI boot, no-cap power), (3) attach Network policies (two RoCE-enabled vNICs, MTU 9000, no-drop QoS), (4) attach Storage (M.2 RAID1 boot), (5) attach Management (SNMP, Syslog), (6) derive individual server profiles and associate each to a physical server.
Animation: Drag-and-drop domain profile assembly -- attach policies (Port, VLAN, VSAN, NTP, QoS) to a domain profile template, then derive multiple domain profiles for AI clusters.
Post-Quiz: Domain Profiles and Service Profiles
1. A domain profile template is updated to add a new VLAN. What happens to the three domain profiles derived from it?
Nothing -- derived profiles are snapshots at creation time
All three automatically inherit the new VLAN
Only the most recently deployed profile inherits it
The update is rejected because derived profiles are locked
2. Which construct provides stateless computing in UCS Manager?
Domain profile
Service profile
VLAN policy
Power policy
3. In Intersight Managed Mode, which policy category includes BIOS and boot order?
Network
Storage
Compute
Management
4. Where do derived server profiles get unique MAC addresses and WWPNs?
They are manually assigned by the administrator
From identity pools referenced by the template
From the physical server's hardware ROM
From the Fabric Interconnect's MAC table
5. A domain profile configures which level of UCS infrastructure?
Individual server blades
The Fabric Interconnect pair
GPU adapter cards
Storage arrays
Section 2: Power and NTP Policies
Pre-Quiz: Power and NTP Policies
1. How much power can a single NVIDIA H100 GPU draw?
150W
350W
700W
1200W
2. Which power redundancy mode is recommended for AI deployments?
Non-Redundant
N+1 Redundancy
Grid Redundancy
Active-Standby
3. What does "no-cap" power priority mean in UCS?
The server has unlimited power from the grid
The blade is prioritized over others during dynamic power rebalancing
Power capping is disabled for the entire chassis
The PSUs run at maximum output at all times
4. Why is NTP critical for AI training clusters?
It controls GPU clock speeds
Distributed training frameworks rely on synchronized timing for barrier operations
It determines the training batch size
It is required to boot the operating system
5. What does Extended Power Capacity provide on UCS X-Series?
Doubles the number of available PSU slots
Increases total power allocation by 15%
Enables hot-swap of GPU modules
Adds battery backup for uninterruptible operation
Key Points
Grid Redundancy is the default and recommended power mode for AI -- it protects against full power circuit loss (e.g., PDU failure).
No-cap power priority ensures GPU blades are prioritized during power rebalancing; prevents GPU throttling under contention.
Extended Power Capacity on X-Series increases the chassis power budget by 15% -- critical for 8-GPU servers exceeding 6,000W.
UCS dynamically rebalances power: active blades can borrow from idle blades; priority (no-cap > high > medium > low) determines allocation under contention.
NTP is configured at the FI level. Best practice: at least 2 redundant NTP servers (stratum-1 or stratum-2). Needed for distributed training coordination, log correlation, security protocols, and performance benchmarking.
Power Policy for GPU Systems
GPU-accelerated AI servers are among the most power-hungry systems in a data center. A single NVIDIA H100 GPU draws up to 700W, and a server with eight GPUs can easily exceed 6,000W total system power. Cisco UCS power policies must ensure GPU nodes receive adequate power under all conditions.
Redundancy Mode
Description
PSU Behavior on Failure
AI Recommendation
Grid Redundancy
Two independent power sources
Surviving PSUs on alternate circuit continue
Recommended for all AI deployments
N+1 Redundancy
One extra PSU beyond minimum
Remaining PSUs share load
Acceptable for non-critical AI dev
Non-Redundant
All PSUs active, no redundancy
Single PSU failure may cause outage
Never use for AI workloads
Power Capping and Dynamic Rebalancing
UCS uses power control policies to manage how power is allocated and borrowed among blades within a chassis. During normal operation, active blades can borrow power from idle blades. When all blades are active and at their power cap, the priority determines which blades get preference. For AI workloads, use no-cap or high priority.
stateDiagram-v2
[*] --> InitialAllocation: Server powers on
InitialAllocation: Initial Power Allocation
InitialAllocation --> NormalOperation: Power budget assigned
NormalOperation: Normal Operation
NormalOperation --> BorrowingPower: Blade needs more power
BorrowingPower: Borrowing from Idle Blades
BorrowingPower --> NormalOperation: Load decreases
NormalOperation --> Contention: All blades active at cap
Contention: Power Contention
Contention --> Throttled: Low-priority blade
Contention --> FullPower: No-cap / High-priority blade
Throttled: GPU Throttled
FullPower: Full Power Maintained
FullPower --> NormalOperation: Contention resolves
Throttled --> NormalOperation: Contention resolves
Power Feature
Default Setting
AI-Optimized Setting
Impact
Redundancy Mode
Grid
Grid
Protects against full circuit loss
Power Control Priority
Medium
No-Cap
Prevents GPU throttling under load
Extended Power Capacity
Disabled
Enabled
+15% power budget for GPU headroom
Power Save Mode
Enabled
Evaluate per deployment
May turn off unused PSUs to save energy
NTP for AI Clusters
NTP is applied at the Fabric Interconnect level and is common to the FI pair. The NTP policy accepts one to four NTP server addresses. Accurate time synchronization is critical for:
Distributed training coordination -- frameworks like Horovod and NCCL rely on synchronized timing for barrier operations and gradient aggregation
Log correlation -- troubleshooting stalled training jobs across 64+ GPU nodes requires aligned timestamps
Security protocols -- TLS, SSH, and Kerberos depend on acceptable clock skew
Performance benchmarking -- NVIDIA Nsight and DCGM need consistent time references for cross-node comparisons
Best practice: configure at least two NTP servers, preferring internal stratum-1 or stratum-2 sources.
Animation: Power contention simulation -- show 4 blades in a chassis competing for power, with priority-based allocation and GPU throttling visualization when low-priority blades lose budget.
Post-Quiz: Power and NTP Policies
1. An AI chassis has all blades active at maximum load. Which blade gets full power first?
The blade with the most GPUs
The blade with no-cap power priority
The blade that powered on first
All blades share equally regardless of priority
2. Extended Power Capacity on UCS X-Series increases the power budget by what percentage?
5%
10%
15%
25%
3. At which UCS level is NTP configured?
Individual blade BIOS
Fabric Interconnect level
Per-vNIC adapter policy
Storage controller
4. How many NTP servers should be configured as a best practice?
Exactly one for consistency
At least two for redundancy
At least five for accuracy
NTP is not needed if all servers are in the same rack
5. Which power redundancy mode should NEVER be used for AI workloads?
Grid Redundancy
N+1 Redundancy
Non-Redundant
Active-Standby
Section 3: Storage Policies on UCS
Pre-Quiz: Storage Policies
1. What is the recommended boot drive configuration for AI compute nodes on UCS?
Single NVMe drive in RAID0
Two M.2 drives in RAID1
Four SAS drives in RAID5
USB flash drive
2. Why are M.2 boot drives preferred for AI servers?
They are the cheapest storage option
They free up PCIe slots and drive bays for GPUs and NVMe data storage
They provide the highest IOPS for training data
They support RAID5 for better redundancy
3. What is boot-from-SAN?
Booting from a local SAN-attached NVMe drive
Booting an OS from external SAN-based storage rather than a local disk
Using SAN storage as swap space during training
A method to install the OS over the network via PXE
4. What is the default RAID mode of the UCS-M2-HWRAID controller?
RAID1
RAID0
JBOD
RAID5
5. Which FC zoning model is the default recommendation for most deployments?
Single initiator, multiple targets
Multiple initiators, single target
Single initiator, single target
Fabric-wide zoning
Key Points
Use M.2 RAID1 for OS boot on AI servers -- frees all PCIe slots and front-panel bays for GPUs and NVMe data drives.
Two M.2 RAID controllers available: UCS-M2-HWRAID (SATA, RAID1 only) and UCS-M2-NVRAID (NVMe, RAID0/1, recommended for new builds).
The default mode for UCS-M2-HWRAID is JBOD -- it must be explicitly reconfigured to RAID1 for production.
Boot-from-SAN enables true stateless computing: when a service profile migrates, the new server boots from the same SAN OS image. Requires vHBAs, WWNN/WWPN, and proper FC zoning.
NVMe local storage should be reserved for dataset staging, model checkpoints, and scratch space -- not OS boot.
FC zoning: Single initiator, single target (one zone per vHBA-storage port pair) is the default and clearest for troubleshooting.
Local Disk Policies
The best practice for AI compute nodes is two disks in RAID1 as a boot drive, keeping the OS separate from data storage. M.2 boot drives have become the preferred approach because they free up all PCIe slots and front-panel drive bays for GPU cards and NVMe data storage.
Controller
Model
Supported RAID
Boot Mode
Notes
UCS-M2-HWRAID
SATA M.2 RAID
RAID1 only
UEFI only
Legacy option, widely deployed
UCS-M2-NVRAID
NVMe M.2 RAID
RAID0, RAID1
UEFI only
Higher performance, recommended for new builds
With the OS on M.2 drives, the remaining NVMe slots can be dedicated to high-speed dataset staging, model checkpoint storage, and scratch space. NVMe local storage provides the lowest latency for these operations, critical when training jobs need to load datasets of hundreds of GB to multiple TB quickly.
SAN Connectivity and Boot-from-SAN
Boot from SAN allows servers to boot an OS from external SAN-based storage rather than a local disk. This is central to UCS's stateless computing model. When a service profile migrates, the new server boots from the exact same OS image on the SAN.
Configuring Boot-from-SAN
Open Service Profile / Server Profile storage settings, navigate to vHBAs
Assign a WWNN (static or from pool)
Click Add SAN Boot, specify vHBA name and primary/secondary path
Enter the WWPN of the storage target and the appropriate LUN ID
Configure FC zoning: initiator (vHBA) to target (storage array)
Zoning Model
Description
When to Use
Single Initiator, Single Target
One zone per vHBA-storage port pair; two members per zone
Default for most deployments; clearest troubleshooting
Single Initiator, Multiple Targets
One zone per vHBA containing all its target ports
When zone count may reach or exceed platform limits
graph TD
subgraph Server["AI GPU Server"]
subgraph Boot["Boot Storage"]
M2C["M.2 RAID Controller"]
M2A["M.2 Drive A"] --> M2C
M2B["M.2 Drive B"] --> M2C
M2C -->|RAID1 Mirror| OS["OS Boot Volume (UEFI)"]
end
subgraph Data["Data Storage (PCIe Slots)"]
NV1["NVMe Drive 1"]
NV2["NVMe Drive 2"]
NV3["NVMe Drive N"]
end
subgraph SAN["SAN Connectivity"]
VHBA0["vHBA0 (Primary Path)"]
VHBA1["vHBA1 (Secondary Path)"]
end
end
NV1 -->|"Dataset Staging"| GPU["GPU Training Jobs"]
NV2 -->|"Checkpoints"| GPU
NV3 -->|"Scratch Space"| GPU
VHBA0 -->|FC Fabric A| SA["Storage Array (Boot LUN)"]
VHBA1 -->|FC Fabric B| SA
SA -.->|"Boot-from-SAN (stateless)"| OS
style Boot fill:#e8f4e8
style Data fill:#e8e8f4
style SAN fill:#f4e8e8
Best Practices for AI Storage Configuration
Practice
Rationale
Use M.2 RAID1 for OS boot
Frees PCIe slots for GPUs; provides OS redundancy
Use UEFI boot mode exclusively
Required for M.2 controllers; standard for modern AI servers
Configure boot-from-SAN for stateless nodes
Enables rapid server replacement without OS reinstallation
Dedicate NVMe drives to dataset staging
Minimizes I/O bottleneck during training data loading
Use single-drive RAID0 only when one disk is present
Animation: AI server storage architecture walkthrough -- show M.2 RAID1 boot path, NVMe data path to GPUs, and SAN boot failover between dual vHBAs across two FC fabrics.
Post-Quiz: Storage Policies
1. Why must UCS-M2-HWRAID be explicitly reconfigured for production AI deployments?
It defaults to RAID0, which has no redundancy
It defaults to JBOD mode, which provides no mirroring
It defaults to RAID5, which is too slow
It does not support UEFI boot by default
2. What happens when a service profile with boot-from-SAN migrates to a new physical server?
The OS must be reinstalled on the new server
The new server boots from the same SAN OS image using the migrated identity
The SAN storage is automatically replicated to the new server's local disk
Boot-from-SAN does not support profile migration
3. Which M.2 RAID controller is recommended for new AI server builds?
UCS-M2-HWRAID
UCS-M2-NVRAID
UCS-M2-SATARAID
Any controller works equally well
4. What should NVMe local storage be used for on AI servers?
OS boot volume
Dataset staging, model checkpoints, and scratch space
Backup of the SAN boot LUN
VLAN configuration storage
5. In boot-from-SAN configuration, what identity must be assigned to each vHBA?
IP address and subnet mask
WWNN and WWPN
MAC address and VLAN
UUID and serial number
Section 4: LAN Connectivity and QoS on UCS
Pre-Quiz: LAN Connectivity and QoS
1. What MTU should be configured on vNICs carrying RoCEv2 AI training traffic?
1500
4096
9000
16000
2. Which QoS system class and CoS value are used for RoCEv2 on UCS?
Gold, CoS 4
Platinum, CoS 5, no-drop
Silver, CoS 3, drop
Best Effort, CoS 0
3. What mechanism prevents packet drops for RoCEv2 traffic?
TCP retransmission
Priority Flow Control (PFC)
Link aggregation
VLAN trunking
4. RoCEv2 on UCS cannot coexist with which feature on the same vNIC?
VLAN tagging
NVGRE, NetFlow, or VMQ
Jumbo frames
RSS (Receive Side Scaling)
5. Why must QoS configuration be consistent across UCS and upstream Nexus switches?
Different vendors require different CoS values
A PFC mismatch at any point causes RDMA packet drops
Nexus switches do not support no-drop classes
UCS cannot communicate with Nexus without identical firmware
Key Points
AI training vNICs need: MTU 9000 (jumbo frames), dedicated AI VLAN, RoCE-enabled adapter policy, and Platinum no-drop QoS (CoS 5).
The Platinum system class with CoS 5 and no-drop triggers PFC, which pauses transmission when buffers fill rather than dropping packets -- essential for RDMA.
End-to-end QoS consistency is mandatory: UCS FI, upstream Nexus 9000 switches must all have matching PFC and ECN on the same CoS value.
RoCEv2 cannot coexist with NVGRE, NetFlow, or VMQ on the same vNIC. Requires VIC 1400 or 15000 series adapters (M5+).
Adapter policy tuning: enable RSS, maximize TX/RX queues, set queue pairs (min 4, up to 8192), and use static interrupt coalescing for sustained high-throughput AI traffic. Disable adaptive interrupt coalescing at >80% link utilization.
Typical AI cluster VLAN design: RoCEv2 training (Platinum, MTU 9000), Storage (Gold, MTU 9000), Management (Best Effort, MTU 1500), Provisioning (Best Effort, MTU 1500).
vNIC Configuration
The LAN Connectivity Policy defines how vNICs connect to the network. For AI workloads, vNIC configuration is where network performance is won or lost.
Parameter
Description
AI-Optimized Setting
VLAN Assignment
Native and allowed VLANs
Dedicated VLANs for AI training, storage, management
MAC Address
Static or from pool
Pool-based for template-driven provisioning
MTU
Maximum Transmission Unit
9000 (jumbo frames) for RDMA/RoCE
Failover
Active/standby behavior
Enabled for resiliency
Adapter Policy
Determines vNIC behavior
RoCE-enabled policy
QoS Policy
Assigns system class to traffic
No-drop class for RDMA interfaces
Network Control Policy
CDP, LLDP, MAC settings
LLDP enabled for DCBX negotiation
LAN Connectivity and VLAN Design
VLAN Purpose
Traffic Type
MTU
QoS Class
AI Training / GPU-to-GPU
RoCEv2 RDMA
9000
Platinum (no-drop, CoS 5)
Storage (NVMeoF/iSCSI)
Storage I/O
9000
Gold or Platinum
Management
IPMI, SSH, Intersight
1500
Best Effort
Provisioning / PXE
OS deployment
1500
Best Effort
QoS System Classes for AI Traffic
Cisco UCS Manager supports multiple QoS system classes configured at LAN > LAN Cloud > QoS System Class. These map to CoS values and determine how the Fabric Interconnect prioritizes and queues traffic. Enabling RoCE requires configuring Platinum with CoS 5 as no-drop, which triggers Priority Flow Control (PFC). A single dropped RDMA packet forces an expensive transport-layer retransmission, destroying RDMA's latency advantage.
flowchart LR
subgraph Server["GPU Server"]
VNIC["vNIC (RoCEv2 Enabled)"]
AP["Adapter Policy Queue Pairs, RSS, Interrupt Coalescing"]
QP["QoS Policy (Platinum)"]
end
subgraph FI["Fabric Interconnect"]
SC["QoS System Class Platinum = CoS 5 No-Drop"]
PFC1["PFC Enabled on CoS 5"]
end
subgraph Nexus["Upstream Nexus 9000"]
PFC2["PFC Enabled on CoS 5"]
ECN["ECN Configured for RDMA Class"]
end
subgraph Dest["Destination GPU Server"]
VNIC2["vNIC (RoCEv2 Enabled)"]
end
AP --> VNIC
QP --> VNIC
VNIC -->|"CoS 5 Tagged MTU 9000"| SC
SC --> PFC1
PFC1 -->|"Lossless Path"| PFC2
PFC2 --> ECN
ECN -->|"Lossless Path"| VNIC2
style VNIC fill:#4a90d9,color:#fff
style VNIC2 fill:#4a90d9,color:#fff
style SC fill:#d94a4a,color:#fff
style PFC1 fill:#d94a4a,color:#fff
style PFC2 fill:#d94a4a,color:#fff
style ECN fill:#d94a4a,color:#fff
QoS System Class
CoS Value
Drop Policy
Typical Use
Platinum
5
No-Drop
RoCEv2 / RDMA for AI training
Gold
4
Drop
Storage traffic (FC, iSCSI)
Silver
2
Drop
Standard application traffic
Bronze
1
Drop
Background / bulk transfers
Best Effort
0
Drop
Management, default traffic
Adapter-Level QoS for RoCE
Beyond system-class configuration, the adapter policy on each vNIC must be tuned for RoCEv2. Cisco provides predefined adapter policies, though custom user-defined policies are recommended for Linux RDMA AI training workloads.
Adapter Policy
RoCE Mode
Use Case
Win-HPN-SMBd
RoCEv2 Mode 1
Windows HPN with SMB Direct
MQ-SMBd
RoCEv2 Mode 2
Multi-queue SMB Direct
Custom (user-defined)
Configurable
Linux RDMA for AI training (recommended)
RoCEv2 constraints: Cannot coexist with NVGRE, NetFlow, or VMQ on the same vNIC. Requires VIC 1400 or VIC 15000 series adapters (M5+ servers). Supports up to 2 RoCEv2-enabled vNICs per adapter and 4 virtual ports per adapter interface. Queue pairs: minimum 4, maximum up to 8192 (platform-dependent).
Tuning Parameter
AI Recommendation
Interrupt Coalescing
Static coalescing with tuned intervals for sustained high throughput
Adaptive Interrupt Coalescing
Disable for AI workloads at >80% link utilization
Receive Side Scaling (RSS)
Enable on all vNICs for high-throughput data pipelines
TX/RX Queue Count
Maximize to enable parallel packet processing across CPU cores
RoCEv2 Configuration Workflow
flowchart TD
S1["Step 1: Enable No-Drop System Class LAN > LAN Cloud > QoS System Class Platinum, CoS 5, No-Drop"]
S2["Step 2: Create QoS Policy LAN > Policies > QoS Policies AI-RoCE-QoS - Platinum"]
S3["Step 3: Create Adapter Policy Enable RoCEv2, set queue pairs, enable RSS, max TX/RX queues"]
S4["Step 4: Create LAN Connectivity Policy Add RDMA vNIC: AI VLAN, MTU 9000, attach QoS + adapter policy"]
S5["Step 5: Verify Upstream Switches Nexus 9000: PFC on CoS 5, ECN for same traffic class"]
S6["Step 6: Attach to Server Profile Template Reference LAN Connectivity Policy in AI-GPU-Node-Template"]
S1 --> S2 --> S3 --> S4 --> S5 --> S6
S1 -.->|"System-level config"| FI["Fabric Interconnect"]
S4 -.->|"Per-server config"| SP["Server Profile"]
S5 -.->|"Network-level config"| NX["Nexus 9000"]
style S1 fill:#d94a4a,color:#fff
style S2 fill:#d97a4a,color:#fff
style S3 fill:#d9b34a,color:#fff
style S4 fill:#7bc47f,color:#fff
style S5 fill:#4a90d9,color:#fff
style S6 fill:#7a4ad9,color:#fff
Animation: End-to-end RoCEv2 packet flow -- trace a tagged CoS 5 packet from GPU server vNIC through the Fabric Interconnect (PFC queuing), across uplink to Nexus 9000 (PFC + ECN), and into the destination GPU server vNIC. Highlight lossless behavior at each hop.
Post-Quiz: LAN Connectivity and QoS
1. What is the first step in configuring RoCEv2 on UCS Manager?
Create the LAN Connectivity Policy
Enable the Platinum no-drop system class with CoS 5
Configure the adapter policy with queue pairs
Verify upstream Nexus switch PFC settings
2. What happens if PFC is configured on UCS but NOT on the upstream Nexus switches?
Traffic falls back to TCP automatically
RDMA packets will be dropped at the mismatch point
The Nexus switches auto-negotiate PFC via DCBX
Only management traffic is affected
3. Why should Adaptive Interrupt Coalescing be disabled for AI workloads at high utilization?
It consumes too much CPU
It provides no latency benefit when link utilization exceeds 80%
It conflicts with RoCEv2
It causes packet drops
4. How many RoCEv2-enabled vNICs can be configured per VIC adapter?
1
2
4
8
5. Which adapter policy type is recommended for Linux RDMA AI training on UCS?