Architecture and Validated Design
A Cisco AI POD is a pre-validated, full-stack infrastructure bundle that combines Cisco UCS compute servers, Cisco Nexus networking switches, partner storage systems, NVIDIA GPUs, and a management software stack into a single orderable solution. Each AI POD is backed by a Cisco Validated Design (CVD) specifying exactly how the components were tested and configured in Cisco's labs.
graph TD
POD["Cisco AI POD"]
POD --> COMPUTE["Compute Tier"]
POD --> NETWORK["Networking Tier"]
POD --> STORAGE["Storage Tier"]
POD --> MGMT["Management Tier"]
COMPUTE --> C885["UCS C885A M8<br/>Training / Fine-Tuning<br/>8x NVIDIA H200 GPUs"]
COMPUTE --> C845["UCS C845A M8<br/>Inferencing<br/>RTX PRO 6000 Blackwell GPUs"]
NETWORK --> N9332["Nexus 9332D-GX2B<br/>32-port 400GbE"]
NETWORK --> N9364D["Nexus 9364D-GX2A<br/>64-port 400GbE"]
NETWORK --> N9364E["Nexus 9364E-SG2<br/>64-port 800GbE"]
STORAGE --> NETAPP["NetApp AFF<br/>NVMe / NFS / GDS"]
STORAGE --> PURE["Pure Storage<br/>FlashBlade//S"]
STORAGE --> VAST["VAST Data<br/>Parallel File System"]
MGMT --> INTERSIGHT["Cisco Intersight"]
MGMT --> NXDASH["Nexus Dashboard"]
MGMT --> SPLUNK["Splunk Observability"]
| Tier | Components | Key Specifications |
| Compute | Cisco UCS C885A M8 | 8-GPU, 8RU rack server; NVIDIA HGX architecture; 8x NVIDIA H200 GPUs via NVLink |
| Compute | Cisco UCS C845A M8 | NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs (inferencing) |
| Networking | Nexus 9332D-GX2B | 1RU, 32-port 400GbE switch |
| Networking | Nexus 9364D-GX2A | 2RU, 64-port 400GbE switch |
| Networking | Nexus 9364E-SG2 | 2RU, 64-port 800GbE switch |
| Storage | NetApp All-Flash FAS (AFF) | NVMe-based; NFS, NFS over RDMA, GPU Direct Storage (GDS) |
| Storage | Pure Storage FlashBlade//S | High-performance all-flash, optimized for AI/ML |
| Storage | VAST Data | High-performance parallel file system |
| Management | Cisco Intersight | UCS server lifecycle management |
| Management | Nexus Dashboard | Network fabric management and monitoring |
| Management | Splunk Observability Cloud | End-to-end full-stack visibility |
NVIDIA Enterprise Reference Architecture (ERA)
The ERA 2-8-9-400 designation encodes the validated topology: 2 servers per building block, 8 GPUs per server (H200), Nexus 9000 series switches, and 400 Gbps per-port bandwidth. ERA compliance certifies that the hardware and cabling topology have been tested at the GPU vendor level for maximum performance.
flowchart LR
ERA["ERA 2-8-9-400"] --> S["2 Servers per building block"]
ERA --> G["8 GPUs per server H200"]
ERA --> N["9 Nexus 9000 series switches"]
ERA --> B["400 Gbps per-port bandwidth"]
GPU Compute and Networking
The UCS C885A M8 houses 8 NVIDIA H200 GPUs interconnected through NVIDIA NVLink, providing GPU-to-GPU bandwidth far exceeding PCIe alone. On the networking side, Nexus 9000 series switches deliver lossless, low-latency Ethernet at up to 800 Gbps per port, configured with priority flow control (PFC) and explicit congestion notification (ECN) to ensure zero data loss during gradient synchronization.
Deployment Scenarios and Sizing
Large-Scale Training: Multiple AI PODs with UCS C885A M8 servers for training foundation models. The high-bandwidth backend network (400G/800G per port) handles all-reduce operations across dozens or hundreds of GPUs.
Fine-Tuning: Smaller AI POD configurations (2-4 C885A servers) for adapting pre-trained models to domain-specific data.
High-Throughput Inferencing: UCS C845A M8 with RTX PRO 6000 Blackwell GPUs, optimized for throughput per watt and cost-efficiency for serving millions of inference requests.
Animation: AI POD scaling from a 2-server building block to a multi-rack training cluster, showing how gradient synchronization flows across the lossless Ethernet backend.
Architecture and the Deep Network Model
Cisco AI Canvas is a generative AI workspace for IT operations, purpose-built on Cisco's Deep Network Model -- a domain-specific LLM trained on over 40 million tokens of Cisco networking knowledge including CCIE-level materials. At 8 to 30 billion parameters, the model is deliberately compact so it can run at the edge, even in air-gapped environments with no internet connectivity, while outperforming larger general-purpose models on networking tasks by up to 20%.
AgenticOps and Workflow Orchestration
AI Canvas introduces AgenticOps, a paradigm where AI agents act autonomously based on context rather than waiting for step-by-step human instruction. Key capabilities include:
- Autonomous action: Agents recommend or execute configuration changes independently.
- Dynamic runbook generation: Customized troubleshooting procedures created on the fly based on observed symptoms.
- Continuous learning: Agents improve by learning from the outcomes of previous actions.
- Structured reasoning: Complex troubleshooting broken into transparent, auditable steps.
sequenceDiagram
participant Op as Operator
participant Asst as Cisco AI Assistant
participant Agent as AI Agent(s)
participant Infra as Infrastructure
Op->>Asst: Describe issue in natural language
Asst->>Agent: Identify symptoms and delegate
Agent->>Infra: Collect telemetry (ThousandEyes, Meraki, Splunk)
Infra-->>Agent: Return observability data
Agent->>Agent: Structured reasoning and root cause analysis
Agent->>Agent: Generate dynamic runbook
Agent-->>Asst: Diagnosis + recommended actions
Asst-->>Op: Present findings and remediation plan
Op->>Asst: Approve action
Asst->>Agent: Execute configuration change
Agent->>Infra: Apply remediation
Infra-->>Agent: Confirm change
Agent->>Agent: Continuous learning (record outcome)
Agent-->>Asst: Resolution confirmed
Asst-->>Op: Issue resolved
Integration and Unified Observability
AI Canvas consolidates telemetry from across the Cisco ecosystem into a single, real-time view: ThousandEyes for internet and WAN path visibility, Meraki for wireless, switching, and SD-WAN telemetry, and Splunk for log analytics and security events. This eliminates the "swivel chair" problem of switching between multiple dashboards. Visualizations are generated dynamically and tailored to each specific incident.
Use Cases
| Use Case | How AI Canvas Helps |
| Incident triage | Correlates alerts from multiple sources, identifies root cause, suggests remediation |
| Capacity planning | Analyzes historical trends and predicts when thresholds will be exceeded |
| Change validation | Generates pre-change and post-change verification procedures dynamically |
| Cross-domain troubleshooting | Coordinates agents across network, security, and cloud domains |
| Team collaboration | Shared dashboards, persistent sessions, invite-based collaboration |
Animation: AgenticOps workflow showing an operator describing a network issue in natural language, the AI Assistant delegating to specialized agents, collecting telemetry, and executing autonomous remediation.
Architecture and Capabilities
Cisco Nexus Hyperfabric AI is a cloud-managed, full-stack AI infrastructure solution delivered as a combined hardware, software, and service offering. The hardware stack includes Cisco Silicon One switches, Cisco 6000 Series Switches, N9100 and N9300 Series Switches for spine/leaf fabrics, and Cisco UCS C885A M8 Rack Servers for GPU compute. The entire stack is NVIDIA ERA-compliant.
The Cloud Controller
The defining feature is the cloud controller -- a scalable, distributed, multitenant service at hyperfabric.cisco.com. It manages fabrics regardless of geographic location and handles: fabric design and blueprint creation, zero-touch provisioning, continuous monitoring, firmware upgrades, and integration with Cisco Commerce for automated quoting/ordering.
Blueprint-Based Provisioning
A blueprint contains: physical components (switches, optics, servers, airflow, power specs), cabling plan, bill of materials integrated with Cisco Commerce, and logical configuration (VLANs, routing protocols, fabric parameters). The desired end-state is declared once, and the system drives all devices toward that state automatically.
The Deployment Lifecycle
stateDiagram-v2
[*] --> Day0
state "Day 0: Design and Plan" as Day0 {
d0a: Access cloud portal
d0b: Visual designer tool
d0c: Generate blueprint with BOM and cabling
d0d: Order via Cisco Commerce
d0a --> d0b
d0b --> d0c
d0c --> d0d
}
Day0 --> Day1
state "Day 1: Deploy and Validate" as Day1 {
d1a: Rack and cable switches and servers
d1b: Zero-touch plug-and-play provisioning
d1c: Real-time topology validation
d1a --> d1b
d1b --> d1c
}
Day1 --> Day2
state "Day 2+: Operate and Scale" as Day2 {
d2a: Continuous cloud monitoring
d2b: Firmware upgrades and scaling
d2c: Blueprint-verification protocol
d2a --> d2b
d2b --> d2c
d2c --> d2a
}
Hyperfabric AI vs. Traditional Fabric Management (NDFC)
| Aspect | Hyperfabric AI | Traditional NDFC |
| Deployment model | Cloud-managed fabric-as-a-service | On-premises management platform |
| Target user | IT generalists, application teams, DevOps | Network engineers with deep CLI expertise |
| Provisioning | Blueprint-based, zero-touch plug-and-play | Intent-based via NDFC UI or CLI templates |
| Monitoring | Blueprint-verification protocol (actual vs. intended) | Syslog, SNMP, streaming telemetry |
| Hardware model | Turnkey, vertically integrated full stack | Software overlay on heterogeneous hardware |
| AI optimization | Premium AI tier, GPU/DPU integration, ERA-compliant | General-purpose fabric controller |
| Expertise required | Minimal | Significant networking expertise |
| Scaling | Cloud-native, multi-geography from single portal | Within single NDFC instance or cluster |
Animation: Side-by-side comparison showing a traditional NDFC deployment (engineer configuring switches via CLI) vs. Hyperfabric (IT generalist using cloud portal with blueprint auto-provisioning devices).
Mapping Workloads to Solutions
flowchart TD
START(["What is the primary need?"]) --> Q1{"New AI infrastructure or operations optimization?"}
Q1 -->|"Operations optimization"| CANVAS["AI Canvas: AgenticOps + unified observability"]
Q1 -->|"New AI infrastructure"| Q2{"Existing Cisco/networking expertise on staff?"}
Q2 -->|"Yes"| Q3{"Greenfield site or existing Nexus/UCS?"}
Q2 -->|"No -- IT generalists"| HF["Hyperfabric AI: Cloud-managed, zero-touch"]
Q3 -->|"Existing infrastructure"| PODS["AI PODs: Pre-validated building blocks"]
Q3 -->|"Greenfield deployment"| Q4{"Want full lifecycle cloud management?"}
Q4 -->|"Yes"| HF
Q4 -->|"No"| PODS
PODS --> COMBO{"Need intelligent Day 2 operations?"}
HF --> COMBO
COMBO -->|"Yes"| PLUS["Add AI Canvas for AgenticOps"]
COMBO -->|"No"| DONE(["Solution selected"])
PLUS --> DONE
| Scenario | Recommended Solution | Rationale |
| Enterprise with existing Nexus/UCS wants AI training | AI PODs | Leverages existing expertise; integrates with current tools |
| Greenfield AI deployment, limited networking staff | Hyperfabric AI | Cloud-managed lifecycle; blueprint-based provisioning reduces errors |
| Multi-vendor alert fatigue, slow incident resolution | AI Canvas | Unifies observability; AgenticOps automates troubleshooting |
| Large-scale AI factory needing infra + ops intelligence | Hyperfabric AI + AI Canvas | Hyperfabric manages fabric lifecycle; AI Canvas optimizes Day 2 ops |
| Budget-conscious AI inference serving | AI PODs (Inferencing) | RTX PRO 6000 GPUs optimize cost-per-inference |
Scalability Considerations
| Solution | Scaling Model | Practical Limits | Scaling Trigger |
| AI PODs | Horizontal -- add more POD building blocks | Physical data center capacity and fabric design | Need for additional GPU compute or storage |
| AI Canvas | Cloud-native SaaS | Scales with Cisco cloud infrastructure | Growth in monitored devices, incidents, or team size |
| Hyperfabric AI | Cloud-managed -- add fabrics from portal | Multi-geography from single controller | New sites, fabric expansions, geographic growth |
Total Cost of Ownership (TCO)
| Cost Category | AI PODs | AI Canvas | Hyperfabric AI |
| Hardware acquisition | Bundled pricing | N/A (software/service) | Full-stack pricing (HW+SW+service) |
| Deployment labor | Reduced ~50% via CVDs | N/A | Minimized via zero-touch + blueprints |
| Operational expertise | Requires skilled engineers | Reduces specialized troubleshooting need | Enables IT generalists |
| Management infra | Intersight + Nexus Dashboard | SaaS-delivered | Cloud controller included |
| Ongoing maintenance | Standard Cisco TAC support | SaaS subscription | Included in service model |
| Integration cost | Moderate | Low (native Cisco ecosystem) | Low (vertically integrated) |
Worked Example: An organization deploying 16 NVIDIA H200 GPUs can choose AI PODs (lower subscription, higher labor due to specialized engineers) or Hyperfabric AI (higher subscription, dramatically lower labor). For organizations without deep networking expertise, Hyperfabric AI often delivers lower three-year TCO despite higher per-unit pricing.
Animation: Decision tree walkthrough -- user selects their organization's characteristics (expertise level, greenfield vs. brownfield, workload type) and the tree highlights the recommended solution path with TCO comparison bars.