Chapter 4: Cisco AI Solutions Portfolio

Learning Objectives

Section 1: Cisco AI PODs

Pre-Quiz: Cisco AI PODs

1. What does a Cisco AI POD combine into a single orderable solution?

2. What does the "8" in the NVIDIA ERA 2-8-9-400 designation represent?

3. Which UCS server model is designated for AI training and fine-tuning workloads in AI PODs?

4. What interconnect technology links the 8 GPUs inside a UCS C885A M8 server?

5. By approximately how much do CVD-backed deployment guides reduce setup time compared to building from individual components?

Key Points: Cisco AI PODs

Architecture and Validated Design

A Cisco AI POD is a pre-validated, full-stack infrastructure bundle that combines Cisco UCS compute servers, Cisco Nexus networking switches, partner storage systems, NVIDIA GPUs, and a management software stack into a single orderable solution. Each AI POD is backed by a Cisco Validated Design (CVD) specifying exactly how the components were tested and configured in Cisco's labs.

graph TD POD["Cisco AI POD"] POD --> COMPUTE["Compute Tier"] POD --> NETWORK["Networking Tier"] POD --> STORAGE["Storage Tier"] POD --> MGMT["Management Tier"] COMPUTE --> C885["UCS C885A M8<br/>Training / Fine-Tuning<br/>8x NVIDIA H200 GPUs"] COMPUTE --> C845["UCS C845A M8<br/>Inferencing<br/>RTX PRO 6000 Blackwell GPUs"] NETWORK --> N9332["Nexus 9332D-GX2B<br/>32-port 400GbE"] NETWORK --> N9364D["Nexus 9364D-GX2A<br/>64-port 400GbE"] NETWORK --> N9364E["Nexus 9364E-SG2<br/>64-port 800GbE"] STORAGE --> NETAPP["NetApp AFF<br/>NVMe / NFS / GDS"] STORAGE --> PURE["Pure Storage<br/>FlashBlade//S"] STORAGE --> VAST["VAST Data<br/>Parallel File System"] MGMT --> INTERSIGHT["Cisco Intersight"] MGMT --> NXDASH["Nexus Dashboard"] MGMT --> SPLUNK["Splunk Observability"]
TierComponentsKey Specifications
ComputeCisco UCS C885A M88-GPU, 8RU rack server; NVIDIA HGX architecture; 8x NVIDIA H200 GPUs via NVLink
ComputeCisco UCS C845A M8NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs (inferencing)
NetworkingNexus 9332D-GX2B1RU, 32-port 400GbE switch
NetworkingNexus 9364D-GX2A2RU, 64-port 400GbE switch
NetworkingNexus 9364E-SG22RU, 64-port 800GbE switch
StorageNetApp All-Flash FAS (AFF)NVMe-based; NFS, NFS over RDMA, GPU Direct Storage (GDS)
StoragePure Storage FlashBlade//SHigh-performance all-flash, optimized for AI/ML
StorageVAST DataHigh-performance parallel file system
ManagementCisco IntersightUCS server lifecycle management
ManagementNexus DashboardNetwork fabric management and monitoring
ManagementSplunk Observability CloudEnd-to-end full-stack visibility

NVIDIA Enterprise Reference Architecture (ERA)

The ERA 2-8-9-400 designation encodes the validated topology: 2 servers per building block, 8 GPUs per server (H200), Nexus 9000 series switches, and 400 Gbps per-port bandwidth. ERA compliance certifies that the hardware and cabling topology have been tested at the GPU vendor level for maximum performance.

flowchart LR ERA["ERA 2-8-9-400"] --> S["2 Servers per building block"] ERA --> G["8 GPUs per server H200"] ERA --> N["9 Nexus 9000 series switches"] ERA --> B["400 Gbps per-port bandwidth"]

GPU Compute and Networking

The UCS C885A M8 houses 8 NVIDIA H200 GPUs interconnected through NVIDIA NVLink, providing GPU-to-GPU bandwidth far exceeding PCIe alone. On the networking side, Nexus 9000 series switches deliver lossless, low-latency Ethernet at up to 800 Gbps per port, configured with priority flow control (PFC) and explicit congestion notification (ECN) to ensure zero data loss during gradient synchronization.

Deployment Scenarios and Sizing

Large-Scale Training: Multiple AI PODs with UCS C885A M8 servers for training foundation models. The high-bandwidth backend network (400G/800G per port) handles all-reduce operations across dozens or hundreds of GPUs.

Fine-Tuning: Smaller AI POD configurations (2-4 C885A servers) for adapting pre-trained models to domain-specific data.

High-Throughput Inferencing: UCS C845A M8 with RTX PRO 6000 Blackwell GPUs, optimized for throughput per watt and cost-efficiency for serving millions of inference requests.

Animation: AI POD scaling from a 2-server building block to a multi-rack training cluster, showing how gradient synchronization flows across the lossless Ethernet backend.
Post-Quiz: Cisco AI PODs

1. Which storage vendor's solution is validated for GPU Direct Storage (GDS) within AI PODs?

2. Which two congestion management mechanisms are configured on Nexus switches for AI training traffic?

3. For a budget-conscious inference-serving deployment, which GPU option does the AI POD use?

4. What is the maximum per-port bandwidth supported by the Nexus 9364E-SG2 switch?

5. What management platform handles UCS server lifecycle management within an AI POD?

Section 2: Cisco AI Canvas

Pre-Quiz: Cisco AI Canvas

1. What powers Cisco AI Canvas under the hood?

2. What is AgenticOps?

3. How large is the Deep Network Model?

4. Which three observability sources does AI Canvas unify?

5. Can AI Canvas operate in air-gapped environments?

Key Points: Cisco AI Canvas

Architecture and the Deep Network Model

Cisco AI Canvas is a generative AI workspace for IT operations, purpose-built on Cisco's Deep Network Model -- a domain-specific LLM trained on over 40 million tokens of Cisco networking knowledge including CCIE-level materials. At 8 to 30 billion parameters, the model is deliberately compact so it can run at the edge, even in air-gapped environments with no internet connectivity, while outperforming larger general-purpose models on networking tasks by up to 20%.

AgenticOps and Workflow Orchestration

AI Canvas introduces AgenticOps, a paradigm where AI agents act autonomously based on context rather than waiting for step-by-step human instruction. Key capabilities include:

sequenceDiagram participant Op as Operator participant Asst as Cisco AI Assistant participant Agent as AI Agent(s) participant Infra as Infrastructure Op->>Asst: Describe issue in natural language Asst->>Agent: Identify symptoms and delegate Agent->>Infra: Collect telemetry (ThousandEyes, Meraki, Splunk) Infra-->>Agent: Return observability data Agent->>Agent: Structured reasoning and root cause analysis Agent->>Agent: Generate dynamic runbook Agent-->>Asst: Diagnosis + recommended actions Asst-->>Op: Present findings and remediation plan Op->>Asst: Approve action Asst->>Agent: Execute configuration change Agent->>Infra: Apply remediation Infra-->>Agent: Confirm change Agent->>Agent: Continuous learning (record outcome) Agent-->>Asst: Resolution confirmed Asst-->>Op: Issue resolved

Integration and Unified Observability

AI Canvas consolidates telemetry from across the Cisco ecosystem into a single, real-time view: ThousandEyes for internet and WAN path visibility, Meraki for wireless, switching, and SD-WAN telemetry, and Splunk for log analytics and security events. This eliminates the "swivel chair" problem of switching between multiple dashboards. Visualizations are generated dynamically and tailored to each specific incident.

Use Cases

Use CaseHow AI Canvas Helps
Incident triageCorrelates alerts from multiple sources, identifies root cause, suggests remediation
Capacity planningAnalyzes historical trends and predicts when thresholds will be exceeded
Change validationGenerates pre-change and post-change verification procedures dynamically
Cross-domain troubleshootingCoordinates agents across network, security, and cloud domains
Team collaborationShared dashboards, persistent sessions, invite-based collaboration
Animation: AgenticOps workflow showing an operator describing a network issue in natural language, the AI Assistant delegating to specialized agents, collecting telemetry, and executing autonomous remediation.
Post-Quiz: Cisco AI Canvas

1. How much Cisco networking knowledge was the Deep Network Model trained on?

2. What does "dynamic runbook generation" mean in the context of AgenticOps?

3. What problem does AI Canvas's unified observability solve?

4. Which third-party system can AI Canvas integrate with for ticket management?

5. What serves as the conversational interface to AgenticOps capabilities?

Section 3: Cisco Hyperfabric AI

Pre-Quiz: Cisco Hyperfabric AI

1. How is Cisco Hyperfabric AI primarily managed?

2. What serves as the "single source of truth" in Hyperfabric deployments?

3. What is the target user persona for Hyperfabric AI?

4. What provisioning method does Hyperfabric use?

5. How does Hyperfabric monitor fabric health differently from traditional approaches?

Key Points: Cisco Hyperfabric AI

Architecture and Capabilities

Cisco Nexus Hyperfabric AI is a cloud-managed, full-stack AI infrastructure solution delivered as a combined hardware, software, and service offering. The hardware stack includes Cisco Silicon One switches, Cisco 6000 Series Switches, N9100 and N9300 Series Switches for spine/leaf fabrics, and Cisco UCS C885A M8 Rack Servers for GPU compute. The entire stack is NVIDIA ERA-compliant.

The Cloud Controller

The defining feature is the cloud controller -- a scalable, distributed, multitenant service at hyperfabric.cisco.com. It manages fabrics regardless of geographic location and handles: fabric design and blueprint creation, zero-touch provisioning, continuous monitoring, firmware upgrades, and integration with Cisco Commerce for automated quoting/ordering.

Blueprint-Based Provisioning

A blueprint contains: physical components (switches, optics, servers, airflow, power specs), cabling plan, bill of materials integrated with Cisco Commerce, and logical configuration (VLANs, routing protocols, fabric parameters). The desired end-state is declared once, and the system drives all devices toward that state automatically.

The Deployment Lifecycle

stateDiagram-v2 [*] --> Day0 state "Day 0: Design and Plan" as Day0 { d0a: Access cloud portal d0b: Visual designer tool d0c: Generate blueprint with BOM and cabling d0d: Order via Cisco Commerce d0a --> d0b d0b --> d0c d0c --> d0d } Day0 --> Day1 state "Day 1: Deploy and Validate" as Day1 { d1a: Rack and cable switches and servers d1b: Zero-touch plug-and-play provisioning d1c: Real-time topology validation d1a --> d1b d1b --> d1c } Day1 --> Day2 state "Day 2+: Operate and Scale" as Day2 { d2a: Continuous cloud monitoring d2b: Firmware upgrades and scaling d2c: Blueprint-verification protocol d2a --> d2b d2b --> d2c d2c --> d2a }

Hyperfabric AI vs. Traditional Fabric Management (NDFC)

AspectHyperfabric AITraditional NDFC
Deployment modelCloud-managed fabric-as-a-serviceOn-premises management platform
Target userIT generalists, application teams, DevOpsNetwork engineers with deep CLI expertise
ProvisioningBlueprint-based, zero-touch plug-and-playIntent-based via NDFC UI or CLI templates
MonitoringBlueprint-verification protocol (actual vs. intended)Syslog, SNMP, streaming telemetry
Hardware modelTurnkey, vertically integrated full stackSoftware overlay on heterogeneous hardware
AI optimizationPremium AI tier, GPU/DPU integration, ERA-compliantGeneral-purpose fabric controller
Expertise requiredMinimalSignificant networking expertise
ScalingCloud-native, multi-geography from single portalWithin single NDFC instance or cluster
Animation: Side-by-side comparison showing a traditional NDFC deployment (engineer configuring switches via CLI) vs. Hyperfabric (IT generalist using cloud portal with blueprint auto-provisioning devices).
Post-Quiz: Cisco Hyperfabric AI

1. What four elements does a Hyperfabric blueprint contain?

2. During Day 1 deployment, how do switches connect to the cloud controller?

3. How does Hyperfabric integrate with Cisco Commerce?

4. What type of switching silicon does Hyperfabric AI use?

5. Cisco validated Hyperfabric by using it internally as what?

Section 4: Solution Comparison and Selection

Pre-Quiz: Solution Comparison and Selection

1. An enterprise with existing Nexus/UCS infrastructure and skilled network engineers wants to add AI training. Which solution is recommended?

2. A greenfield AI deployment with limited networking staff should choose which solution?

3. Which solution addresses multi-vendor alert fatigue and slow incident resolution?

4. How do AI PODs scale?

5. In TCO analysis, which cost category most favors Hyperfabric AI over AI PODs?

Key Points: Solution Comparison and Selection

Mapping Workloads to Solutions

flowchart TD START(["What is the primary need?"]) --> Q1{"New AI infrastructure or operations optimization?"} Q1 -->|"Operations optimization"| CANVAS["AI Canvas: AgenticOps + unified observability"] Q1 -->|"New AI infrastructure"| Q2{"Existing Cisco/networking expertise on staff?"} Q2 -->|"Yes"| Q3{"Greenfield site or existing Nexus/UCS?"} Q2 -->|"No -- IT generalists"| HF["Hyperfabric AI: Cloud-managed, zero-touch"] Q3 -->|"Existing infrastructure"| PODS["AI PODs: Pre-validated building blocks"] Q3 -->|"Greenfield deployment"| Q4{"Want full lifecycle cloud management?"} Q4 -->|"Yes"| HF Q4 -->|"No"| PODS PODS --> COMBO{"Need intelligent Day 2 operations?"} HF --> COMBO COMBO -->|"Yes"| PLUS["Add AI Canvas for AgenticOps"] COMBO -->|"No"| DONE(["Solution selected"]) PLUS --> DONE
ScenarioRecommended SolutionRationale
Enterprise with existing Nexus/UCS wants AI trainingAI PODsLeverages existing expertise; integrates with current tools
Greenfield AI deployment, limited networking staffHyperfabric AICloud-managed lifecycle; blueprint-based provisioning reduces errors
Multi-vendor alert fatigue, slow incident resolutionAI CanvasUnifies observability; AgenticOps automates troubleshooting
Large-scale AI factory needing infra + ops intelligenceHyperfabric AI + AI CanvasHyperfabric manages fabric lifecycle; AI Canvas optimizes Day 2 ops
Budget-conscious AI inference servingAI PODs (Inferencing)RTX PRO 6000 GPUs optimize cost-per-inference

Scalability Considerations

SolutionScaling ModelPractical LimitsScaling Trigger
AI PODsHorizontal -- add more POD building blocksPhysical data center capacity and fabric designNeed for additional GPU compute or storage
AI CanvasCloud-native SaaSScales with Cisco cloud infrastructureGrowth in monitored devices, incidents, or team size
Hyperfabric AICloud-managed -- add fabrics from portalMulti-geography from single controllerNew sites, fabric expansions, geographic growth

Total Cost of Ownership (TCO)

Cost CategoryAI PODsAI CanvasHyperfabric AI
Hardware acquisitionBundled pricingN/A (software/service)Full-stack pricing (HW+SW+service)
Deployment laborReduced ~50% via CVDsN/AMinimized via zero-touch + blueprints
Operational expertiseRequires skilled engineersReduces specialized troubleshooting needEnables IT generalists
Management infraIntersight + Nexus DashboardSaaS-deliveredCloud controller included
Ongoing maintenanceStandard Cisco TAC supportSaaS subscriptionIncluded in service model
Integration costModerateLow (native Cisco ecosystem)Low (vertically integrated)

Worked Example: An organization deploying 16 NVIDIA H200 GPUs can choose AI PODs (lower subscription, higher labor due to specialized engineers) or Hyperfabric AI (higher subscription, dramatically lower labor). For organizations without deep networking expertise, Hyperfabric AI often delivers lower three-year TCO despite higher per-unit pricing.

Animation: Decision tree walkthrough -- user selects their organization's characteristics (expertise level, greenfield vs. brownfield, workload type) and the tree highlights the recommended solution path with TCO comparison bars.
Post-Quiz: Solution Comparison and Selection

1. A large-scale AI factory needs both infrastructure management and intelligent operations. What combination is recommended?

2. Why might Hyperfabric AI deliver lower 3-year TCO than AI PODs despite higher per-unit pricing?

3. What is the scaling model for AI Canvas?

4. Which solution requires on-premises management platforms like Intersight and Nexus Dashboard?

5. For a budget-conscious organization focused on AI inference, which specific GPU configuration is recommended?

Your Progress

Answer Explanations