NDFC exposes REST APIs for Terraform, Ansible, NaC, and Python automation (IaC).
DCNM has reached End of Life -- all modern capabilities are exclusive to NDFC.
Platform Architecture and Unified Services
Cisco Nexus Dashboard is the unified operations platform for data center network management. With version 4.1, three previously separate services -- Fabric Controller (NDFC), Orchestrator (NDO), and Insights -- were consolidated into a single-image installation. Exam questions may reference them individually, but they operate within one platform.
Microburst detection, INT, anomaly correlation and remediation
NDFC for AI Fabric Provisioning
NDFC is the comprehensive management and automation solution for Cisco Nexus and MDS platforms running NX-OS. For AI workloads it provides:
Built-in AI fabric templates that pre-configure PFC and ECN for lossless RoCEv2 transport -- hours of manual work completed in minutes.
Support for Nexus 9364E 800G switches with Cisco Intelligent Packet Flow algorithms optimized for AI training and inference.
Centralized configuration push with drift detection that alarms when devices deviate from intended state.
VXLAN EVPN fabric management with BGP EVPN control plane for multi-tenancy and mobility across data centers.
Analogy: Think of NDFC as a building contractor's project management software. Rather than visiting each site (switch) with paper blueprints, the contractor loads the approved blueprint (fabric template) into the system, which pushes instructions to every subcontractor (device) simultaneously and flags deviations.
Multi-Site Orchestration with NDO
NDO provides centralized network and policy management across ACI, Cloud ACI, and VXLAN EVPN sites:
Maintains VLAN, VRF, and network configuration consistency across sites
Automates multi-fabric connectivity between NX-OS VXLAN EVPN and ACI
Enables seamless L2/L3 network extension with end-to-end policy
Streamlines security and segmentation for GPO-aware VXLAN fabrics
Telemetry and Insights for AI Fabrics
Nexus Dashboard Insights collects streaming telemetry from every fabric node. For AI workloads -- where a single congested link can stall training across thousands of GPUs -- this visibility is critical.
Capability
Description
Anomaly Detection
Learns baseline behavior and flags deviations automatically
Contracts are the mechanism that defines allowed inter-EPG communication (e.g., permitting RoCEv2 between GPU EPGs).
APIC is the centralized SDN controller (typically 3-node cluster) for fabric discovery, policy programming, and health monitoring.
AI design pattern: separate East-West GPU traffic (lossless, RoCEv2) from North-South storage/external traffic using distinct VRFs and bridge domains.
ACI spine-leaf topology ensures every GPU is exactly two hops from every other GPU.
APIC integrates with VMware vCenter, Kubernetes, and OpenStack for dynamic endpoint mapping.
ACI Fabric Architecture for AI
Cisco ACI is the SDN solution for data centers, providing application-driven policy management through a declarative, object-oriented framework. It uses a spine-leaf topology where all policy is defined declaratively and enforced by APIC.
The ACI Policy Model
Understanding this hierarchy is essential for the exam:
graph TD
T["Tenant"] --> VRF["VRF (L3 Forwarding Domain)"]
T --> AP["Application Profile"]
VRF --> BD["Bridge Domain"]
BD --> SUB["Subnet"]
AP --> EPG["Endpoint Group (EPG)"]
EPG --> EP["Endpoints (Servers, VMs, Containers)"]
EPG --> CON["Contracts (Inter-EPG Communication)"]
style T fill:#1a5276,color:#fff
style VRF fill:#2874a6,color:#fff
style AP fill:#2874a6,color:#fff
style BD fill:#2e86c1,color:#fff
style EPG fill:#2e86c1,color:#fff
style SUB fill:#5dade2,color:#fff
style EP fill:#85c1e9,color:#000
style CON fill:#85c1e9,color:#000
Component
Definition
AI Infrastructure Relevance
Tenant
Logical container for policies; unit of isolation
Separate tenants for AI training vs. inference, or per business unit
VRF
Unique L3 forwarding and policy domain
Isolates GPU backend traffic from general DC traffic
Bridge Domain
Forwarding policy providing VLAN-like behavior
Maps to specific AI cluster segments needing L2 adjacency
Application Profile
Container for EPGs within a tenant
Groups all EPGs of a single AI app (e.g., training pipeline)
EPG
Named logical entity containing endpoints
GPU servers in one EPG, storage in another, management in a third
Contract
Policy enabling inter-EPG communication
Permits RoCEv2 between GPU EPGs; allows storage access for checkpointing
flowchart LR
subgraph EW["East-West Traffic (GPU Backend)"]
direction LR
GPU1["GPU Worker EPG"] -- "RoCEv2 Lossless" --> GPU2["Parameter Server EPG"]
GPU2 -- "Gradient Updates All-Reduce" --> GPU1
end
subgraph NS["North-South Traffic (Storage/External)"]
direction TB
GPU3["GPU Worker EPG"] -- "NFS/iSCSI Checkpointing" --> STOR["Training Storage EPG"]
EXT["External API Clients"] -- "Inference Requests" --> GPU3
end
style EW fill:#f9e79f,color:#000
style NS fill:#aed6f1,color:#000
East-West (Inter-GPU): The dominant pattern during training. GPUs exchange gradient updates via all-reduce operations. Requires ultra-low latency and lossless delivery -- a single dropped packet can stall the entire collective across thousands of GPUs.
North-South (Storage/External): Data ingestion, model checkpointing, and inference API serving. High-bandwidth but more tolerant of minor latency variations.
APIC as the SDN Controller
The APIC cluster (typically three controllers) serves as the centralized policy engine:
Fabric discovery -- automatically discovers spine and leaf switches via LLDP
Policy programming -- translates high-level intent into concrete forwarding rules
Health monitoring -- tracks health scores at tenant, application, EPG, and endpoint levels
Spine-leaf ensures every GPU is 2 hops from every other
North-South storage access
Separate bridge domains and contracts with appropriate QoS
Animation Slot: ACI policy model hierarchy builder -- interactive drag-and-drop showing how Tenants, VRFs, Bridge Domains, Application Profiles, EPGs, and Contracts relate, with an AI training tenant example.
Post-Quiz: APIC and ACI
1. In an AI training ACI design, GPU backend traffic should be placed in a separate VRF because:
A) ACI requires one VRF per physical switch
B) It isolates high-bandwidth lossless RoCEv2 traffic from other data center traffic
C) VRFs are only used for external routing
D) Bridge Domains cannot span multiple VRFs
2. A Contract in ACI performs which function?
A) Assigns IP addresses to endpoints
B) Defines allowed communication between EPGs via filters and subjects
C) Configures VXLAN tunnel endpoints
D) Provisions physical switch ports
3. How does APIC discover spine and leaf switches in the ACI fabric?
A) Manual IP address registration
B) Automatic discovery via LLDP
C) DHCP snooping
D) BGP peering auto-negotiation
4. Why is a single dropped packet in East-West GPU traffic so impactful?
A) It causes the entire ACI fabric to reconverge
B) It triggers retransmission that can stall the entire collective operation across thousands of GPUs
C) It forces APIC to reprogram all leaf switches
D) It disables PFC on the affected link
5. In a spine-leaf ACI topology, how many hops separate any two GPUs?
A) One
B) Two
C) Three
D) It varies based on fabric size
Section 3: Hyperfabric Deployment
Pre-Quiz: Hyperfabric
1. What distinguishes Hyperfabric's operational model from Nexus Dashboard and ACI?
A) It uses on-premises APIC controllers
B) It uses a cloud controller managed by Cisco
C) It requires manual CLI configuration
D) It only supports NX-OS switches
2. Hyperfabric full stack AI infrastructure is compliant with which NVIDIA architecture?
A) NVIDIA DGX BasePOD
B) NVIDIA Enterprise Reference Architecture (ERA)
C) NVIDIA HGX Blueprint
D) NVIDIA SuperPOD
3. What provisioning model does Hyperfabric use for Day-1 deployment?
A) Manual CLI bootstrap
B) POAP with DHCP
C) Zero-touch plug-and-play provisioning
D) Ansible push from local server
4. How many distinct fabric tiers does the Hyperfabric AI cluster architecture provide?
A) One unified fabric
B) Two (compute and storage)
C) Three (backend, frontend, storage)
D) Four (backend, frontend, storage, management)
5. What management approach does Hyperfabric use during Day-2 operations?
A) SNMP-based polling
B) Assertion-based management
C) Manual health checks
D) Syslog-driven automation
Key Points
Hyperfabric uses a cloud controller managed by Cisco -- fundamentally different from on-prem controllers (ND/APIC).
Full stack AI option is NVIDIA ERA-compliant, integrating Cisco Silicon One switches, UCS C885A servers with NVIDIA HGX GPUs, and storage.
Zero-touch plug-and-play provisioning: power on switches, cloud controller claims and configures -- operational in minutes.
Assertion-based management: continuously validates operational state matches defined intent.
Intersight (purchased separately) provides detailed GPU server and storage management alongside Hyperfabric.
Architecture and Cloud-Managed Model
Cisco Nexus Hyperfabric uses a cloud controller managed by Cisco to design, deploy, and manage fabrics located anywhere -- primary data centers, colocation facilities, and edge sites. This eliminates the operational burden of maintaining on-premises controllers.
Analogy: The difference between self-hosting your email server (Nexus Dashboard/ACI) versus using a managed email service (Hyperfabric). Both deliver email, but the managed service eliminates platform maintenance overhead.
Hyperfabric Full Stack AI Infrastructure
The full stack option is a turnkey, vertically integrated AI platform that is NVIDIA ERA-compliant:
Component
Specifics
Networking
Cisco Silicon One switches (6000 Series and N9100/N9300 Series)
Compute
Cisco UCS C885A M8 Rack Servers with NVIDIA HGX GPUs
Storage
Integrated high-throughput storage systems
Management
Nexus Hyperfabric cloud controller for end-to-end lifecycle
Day-0 (Design and Order): Preconfigured templates for AI clusters, cloud-assisted capacity planning, integrated ordering workflow.
Day-1 (Deployment): Hardware arrives, switches power on and auto-connect to the cloud controller via zero-touch plug-and-play. Fully operational fabric in minutes -- no manual CLI.
Day-2 (Operations): Continuous loss/latency monitoring, assertion-based validation against defined intent, cloud-managed firmware upgrades, and auto-provisioned scaling.
Three-Tier AI Cluster Network Architecture
graph TD
CC["Hyperfabric Cloud Controller (Managed by Cisco)"]
CC --> BF
CC --> FF
CC --> SF
subgraph BF["Backend Fabric"]
BS["Backend Spine"] --- BL1["Backend Leaf"]
BS --- BL2["Backend Leaf"]
BL1 --- GPU1["GPU Node"]
BL1 --- GPU2["GPU Node"]
BL2 --- GPU3["GPU Node"]
BL2 --- GPU4["GPU Node"]
end
subgraph FF["Frontend Fabric"]
FS["Frontend Spine"] --- FL["Frontend Leaf"]
FL --- MGMT["Management / Scheduling"]
FL --- DATA["Data Ingestion"]
end
subgraph SF["Storage Fabric"]
SS["Storage Spine"] --- SL["Storage Leaf"]
SL --- ST1["Storage Array"]
SL --- ST2["Storage Array"]
end
style CC fill:#1a5276,color:#fff
style BF fill:#fadbd8,color:#000
style FF fill:#d5f5e3,color:#000
style SF fill:#d6eaf8,color:#000
Fabric Tier
Purpose
Characteristics
Backend Network
GPU-to-GPU communication
Low-latency, lossless, optimized for collective operations (all-reduce, all-gather)
Frontend Fabric
Management and data ingestion
Standard DC connectivity for job scheduling, monitoring, data loading
Storage Fabric
High-throughput data access
Dedicated bandwidth for training data retrieval and model checkpointing
Integration with Cisco Ecosystem
Hyperfabric does not operate in isolation in the full stack AI model:
Cisco Intersight (purchased separately) manages GPU servers and storage servers in detail.
The Hyperfabric cloud controller handles high-level assertion-based control across network, GPU, and storage layers.
Together they provide centralized control from configuration through daily operations.
Animation Slot: Hyperfabric Day-0 to Day-2 lifecycle animation -- showing the progression from template selection and ordering, through zero-touch provisioning with cloud controller auto-claim, to assertion-based monitoring and node scaling.
Post-Quiz: Hyperfabric
1. Hyperfabric's cloud controller is:
A) Deployed on-premises as a VM cluster
B) Managed by Cisco in the cloud
C) A feature within Nexus Dashboard
D) An APIC running in public cloud
2. The purpose of the three-tier fabric architecture (backend, frontend, storage) is to:
A) Reduce the number of switches needed
B) Eliminate contention between GPU collective operations and storage I/O
C) Provide redundancy for the cloud controller
D) Enable multi-tenant isolation
3. During Day-1 Hyperfabric deployment, what happens after switches are powered on?
A) An engineer SSHs into each switch for initial config
B) They auto-connect to the cloud controller via zero-touch plug-and-play
C) NDFC discovers them via LLDP
D) Ansible runs a bootstrap playbook
4. Assertion-based management in Hyperfabric means:
A) Engineers write unit tests for switch configurations
B) The system continuously validates that operational state matches defined intent
C) SNMP traps trigger automated remediation scripts
D) Configuration changes require approval assertions from two admins
5. Which compute platform is included in the Hyperfabric full stack AI infrastructure?
A) Cisco UCS B-Series blade servers
B) Cisco UCS C885A M8 Rack Servers with NVIDIA HGX GPUs
C) Cisco HyperFlex HX-Series
D) Third-party GPU servers only
Section 4: Intersight for Infrastructure Management
Pre-Quiz: Intersight
1. What type of platform is Cisco Intersight?
A) On-premises-only fabric controller
B) SaaS infrastructure management platform
C) Cloud-managed network switch OS
D) GPU workload scheduler
2. Which Intersight deployment model is designed for fully air-gapped environments?
A) SaaS
B) Connected Virtual Appliance (CVA)
C) Private Virtual Appliance (PVA)
D) Hybrid Cloud Appliance
3. Which IaC tools does Intersight support? (Select the best answer)
A) Only Terraform
B) Only Ansible and Python
C) Terraform, Ansible, PowerShell SDK, and Python SDK
D) Only REST API with no pre-built integrations
4. What does Intersight check server components against to ensure compatibility?
A) Cisco TAC case database
B) Hardware Compatibility Lists (HCL)
C) NVIDIA GPU driver matrix only
D) VMware compatibility guides
5. Intersight's primary management focus is on:
A) Network fabric orchestration
B) Compute and cross-domain infrastructure lifecycle
C) Application deployment and containers
D) DNS and load balancing
Key Points
Intersight is a SaaS platform for compute and cross-domain infrastructure lifecycle management.
Manages all UCS server form factors including the AI-optimized UCS C885A with NVIDIA HGX.
Full IaC support: Terraform, Ansible, PowerShell SDK, Python SDK.
Firmware compliance: HCL checking, firmware policies, compliance dashboards, rolling upgrades.
Integrates with Nexus Dashboard for cross-domain visibility across network, compute, and storage.
Workflow designer with automatic Python/PowerShell code generation.
Platform Overview
Cisco Intersight provides unified, intelligent management of Cisco UCS compute infrastructure from core to edge. While Nexus Dashboard focuses on network fabric and Hyperfabric provides cloud-managed full-stack solutions, Intersight concentrates on compute and cross-domain infrastructure lifecycle management.
Core Capabilities
Capability Domain
What Intersight Manages
Compute
All UCS server form factors: rack, blade, modular (including AI-optimized UCS C885A with NVIDIA HGX)
Networking
Nexus 9000 switches in NX-OS mode with inventory views and switch configuration
Storage
HyperFlex, NetApp, Pure Storage, Hitachi integration
Virtualization
VMware vSphere, Microsoft Hyper-V
Automation
Drag-and-drop workflow designer with auto Python/PowerShell code generation
Deployment Options
flowchart TD
INT["Cisco Intersight Platform"]
INT --> SAAS["SaaS (Cloud-Hosted at intersight.com)"]
INT --> CVA["Connected Virtual Appliance (On-Prem + Cloud Analytics)"]
INT --> PVA["Private Virtual Appliance (Fully Air-Gapped)"]
SAAS --> S1["Standard Enterprise with Internet"]
CVA --> C1["Local Data Processing + Cloud Features"]
PVA --> P1["Government / Defense No External Connectivity"]
SAAS -.->|"Full cloud features"| CLOUD["Cisco Cloud Services"]
CVA -.->|"Selective connectivity"| CLOUD
PVA -.->|"No connection"| AIR["Air-Gapped Network"]
style INT fill:#1a5276,color:#fff
style SAAS fill:#2e86c1,color:#fff
style CVA fill:#2e86c1,color:#fff
style PVA fill:#2e86c1,color:#fff
HCL checking: Automated validation of server components against Cisco's validated firmware/driver combinations
Firmware policies: Define target firmware versions enforced across all managed servers
Compliance dashboards: Real-time visibility into compliant vs. non-compliant servers and hardware advisories
Rolling firmware upgrades: Orchestrated updates minimizing disruption to running AI workloads
Intersight and Nexus Dashboard Integration
Intersight provides global management of UCS, HyperFlex, APIC, and Nexus Dashboard from a single pane of glass. This cross-domain visibility is essential for AI troubleshooting: when a training job slows, the root cause could be network congestion, compute thermal throttling, or storage I/O bottleneck. Unified visibility across all three domains accelerates root cause analysis.
Animation Slot: Intersight deployment model comparison -- interactive selector showing SaaS, CVA, and PVA deployments with data flow diagrams, highlighting which features are available in each model and when to choose each for AI infrastructure.
Post-Quiz: Intersight
1. An organization handling classified defense data needs Intersight but cannot have any external network connectivity. Which deployment model should they use?
A) SaaS
B) Connected Virtual Appliance (CVA)
C) Private Virtual Appliance (PVA)
D) Hyperfabric cloud controller
2. Which Intersight feature validates that server firmware and drivers match Cisco's tested combinations?
A) Compliance dashboards
B) Hardware Compatibility List (HCL) checking
C) Rolling firmware upgrades
D) Terraform server profiles
3. How does Intersight's integration with Nexus Dashboard benefit AI infrastructure troubleshooting?
A) It replaces Nexus Dashboard Insights
B) It provides cross-domain visibility across network, compute, and storage for faster root cause analysis
C) It enables Intersight to configure ACI policies
D) It migrates all management to a single APIC cluster
4. What unique automation feature does the Intersight workflow designer provide?
A) Natural language configuration input
B) Drag-and-drop design with automatic Python/PowerShell code generation
C) AI-driven auto-remediation without human input
D) Direct GPU kernel compilation
5. When comparing orchestration platforms, which one is best suited for greenfield AI-first deployments?