Study Guide: Chapter 4 — Cisco AI Solutions Portfolio

Pre-Quiz: Cisco AI PODs

1. What does a Cisco AI POD combine into a single orderable solution?

A) Only NVIDIA GPUs and Cisco switches B) UCS compute, Nexus networking, partner storage, NVIDIA GPUs, and management software C) Third-party servers with Cisco software licenses D) Cloud-hosted virtual machines with GPU passthrough

2. What does the "8" in the NVIDIA ERA 2-8-9-400 designation represent?

A) 8 Nexus switches per rack B) 8 storage shelves per POD C) 8 GPUs per server D) 800 Gbps network bandwidth

3. Which UCS server model is designated for AI training and fine-tuning workloads in AI PODs?

A) UCS C240 M7 B) UCS C845A M8 C) UCS C885A M8 D) UCS X210c M7

4. What interconnect technology links the 8 GPUs inside a UCS C885A M8 server?

A) PCIe Gen5 B) InfiniBand HDR C) NVIDIA NVLink D) Cisco Fabric Interconnect

5. By approximately how much do CVD-backed deployment guides reduce setup time compared to building from individual components?

A) 10% B) 25% C) 50% D) 75%

Architecture and Validated Design

A Cisco AI POD is a pre-validated, full-stack infrastructure bundle that combines Cisco UCS compute servers, Cisco Nexus networking switches, partner storage systems, NVIDIA GPUs, and a management software stack into a single orderable solution. Each AI POD is backed by a Cisco Validated Design (CVD) specifying exactly how the components were tested and configured in Cisco's labs.

graph TD POD["Cisco AI POD"] POD --> COMPUTE["Compute Tier"] POD --> NETWORK["Networking Tier"] POD --> STORAGE["Storage Tier"] POD --> MGMT["Management Tier"] COMPUTE --> C885["UCS C885A M8 Training / Fine-Tuning 8x NVIDIA H200 GPUs"] COMPUTE --> C845["UCS C845A M8 Inferencing RTX PRO 6000 Blackwell GPUs"] NETWORK --> N9332["Nexus 9332D-GX2B 32-port 400GbE"] NETWORK --> N9364D["Nexus 9364D-GX2A 64-port 400GbE"] NETWORK --> N9364E["Nexus 9364E-SG2 64-port 800GbE"] STORAGE --> NETAPP["NetApp AFF NVMe / NFS / GDS"] STORAGE --> PURE["Pure Storage FlashBlade//S"] STORAGE --> VAST["VAST Data Parallel File System"] MGMT --> INTERSIGHT["Cisco Intersight"] MGMT --> NXDASH["Nexus Dashboard"] MGMT --> SPLUNK["Splunk Observability"]

Tier	Components	Key Specifications
Compute	Cisco UCS C885A M8	8-GPU, 8RU rack server; NVIDIA HGX architecture; 8x NVIDIA H200 GPUs via NVLink
Compute	Cisco UCS C845A M8	NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs (inferencing)
Networking	Nexus 9332D-GX2B	1RU, 32-port 400GbE switch
Networking	Nexus 9364D-GX2A	2RU, 64-port 400GbE switch
Networking	Nexus 9364E-SG2	2RU, 64-port 800GbE switch
Storage	NetApp All-Flash FAS (AFF)	NVMe-based; NFS, NFS over RDMA, GPU Direct Storage (GDS)
Storage	Pure Storage FlashBlade//S	High-performance all-flash, optimized for AI/ML
Storage	VAST Data	High-performance parallel file system
Management	Cisco Intersight	UCS server lifecycle management
Management	Nexus Dashboard	Network fabric management and monitoring
Management	Splunk Observability Cloud	End-to-end full-stack visibility

NVIDIA Enterprise Reference Architecture (ERA)

The ERA 2-8-9-400 designation encodes the validated topology: 2 servers per building block, 8 GPUs per server (H200), Nexus 9000 series switches, and 400 Gbps per-port bandwidth. ERA compliance certifies that the hardware and cabling topology have been tested at the GPU vendor level for maximum performance.

flowchart LR ERA["ERA 2-8-9-400"] --> S["2 Servers per building block"] ERA --> G["8 GPUs per server H200"] ERA --> N["9 Nexus 9000 series switches"] ERA --> B["400 Gbps per-port bandwidth"]

GPU Compute and Networking

The UCS C885A M8 houses 8 NVIDIA H200 GPUs interconnected through NVIDIA NVLink, providing GPU-to-GPU bandwidth far exceeding PCIe alone. On the networking side, Nexus 9000 series switches deliver lossless, low-latency Ethernet at up to 800 Gbps per port, configured with priority flow control (PFC) and explicit congestion notification (ECN) to ensure zero data loss during gradient synchronization.

Deployment Scenarios and Sizing

Large-Scale Training: Multiple AI PODs with UCS C885A M8 servers for training foundation models. The high-bandwidth backend network (400G/800G per port) handles all-reduce operations across dozens or hundreds of GPUs.

Fine-Tuning: Smaller AI POD configurations (2-4 C885A servers) for adapting pre-trained models to domain-specific data.

High-Throughput Inferencing: UCS C845A M8 with RTX PRO 6000 Blackwell GPUs, optimized for throughput per watt and cost-efficiency for serving millions of inference requests.

Animation: AI POD scaling from a 2-server building block to a multi-rack training cluster, showing how gradient synchronization flows across the lossless Ethernet backend.

Post-Quiz: Cisco AI PODs

1. Which storage vendor's solution is validated for GPU Direct Storage (GDS) within AI PODs?

A) Pure Storage FlashBlade//S B) NetApp All-Flash FAS (AFF) C) VAST Data D) Dell PowerScale

2. Which two congestion management mechanisms are configured on Nexus switches for AI training traffic?

A) WRED and tail drop B) Priority flow control (PFC) and explicit congestion notification (ECN) C) Traffic shaping and policing D) PAUSE frames and DSCP marking

3. For a budget-conscious inference-serving deployment, which GPU option does the AI POD use?

A) NVIDIA H200 via NVLink B) NVIDIA A100 PCIe C) NVIDIA RTX PRO 6000 Blackwell GPUs D) NVIDIA L40S

4. What is the maximum per-port bandwidth supported by the Nexus 9364E-SG2 switch?

A) 100 GbE B) 200 GbE C) 400 GbE D) 800 GbE

5. What management platform handles UCS server lifecycle management within an AI POD?

A) Nexus Dashboard B) Cisco DNA Center C) Cisco Intersight D) Splunk Observability Cloud

Section 2: Cisco AI Canvas

Pre-Quiz: Cisco AI Canvas

1. What powers Cisco AI Canvas under the hood?

A) OpenAI GPT-4 B) Cisco's Deep Network Model, a domain-specific LLM C) Google Gemini D) Meta LLaMA fine-tuned on network logs

2. What is AgenticOps?

A) A monitoring dashboard for network agents B) A paradigm where AI agents autonomously diagnose, recommend, and execute infrastructure changes C) A ticketing system integrated with ServiceNow D) A CLI tool for automating switch configuration

3. How large is the Deep Network Model?

A) 100-200 billion parameters B) Under 1 billion parameters C) 8 to 30 billion parameters D) Over 500 billion parameters

4. Which three observability sources does AI Canvas unify?

A) Datadog, PagerDuty, and Grafana B) ThousandEyes, Meraki, and Splunk C) Nagios, Zabbix, and Prometheus D) AppDynamics, Duo, and Webex

5. Can AI Canvas operate in air-gapped environments?

A) No, it requires constant cloud connectivity B) Yes, because the Deep Network Model is compact enough to run at the edge C) Only with a dedicated satellite uplink D) Only when paired with Hyperfabric AI

Architecture and the Deep Network Model

Cisco AI Canvas is a generative AI workspace for IT operations, purpose-built on Cisco's Deep Network Model -- a domain-specific LLM trained on over 40 million tokens of Cisco networking knowledge including CCIE-level materials. At 8 to 30 billion parameters, the model is deliberately compact so it can run at the edge, even in air-gapped environments with no internet connectivity, while outperforming larger general-purpose models on networking tasks by up to 20%.

AgenticOps and Workflow Orchestration

AI Canvas introduces AgenticOps, a paradigm where AI agents act autonomously based on context rather than waiting for step-by-step human instruction. Key capabilities include:

Autonomous action: Agents recommend or execute configuration changes independently.
Dynamic runbook generation: Customized troubleshooting procedures created on the fly based on observed symptoms.
Continuous learning: Agents improve by learning from the outcomes of previous actions.
Structured reasoning: Complex troubleshooting broken into transparent, auditable steps.

sequenceDiagram participant Op as Operator participant Asst as Cisco AI Assistant participant Agent as AI Agent(s) participant Infra as Infrastructure Op->>Asst: Describe issue in natural language Asst->>Agent: Identify symptoms and delegate Agent->>Infra: Collect telemetry (ThousandEyes, Meraki, Splunk) Infra-->>Agent: Return observability data Agent->>Agent: Structured reasoning and root cause analysis Agent->>Agent: Generate dynamic runbook Agent-->>Asst: Diagnosis + recommended actions Asst-->>Op: Present findings and remediation plan Op->>Asst: Approve action Asst->>Agent: Execute configuration change Agent->>Infra: Apply remediation Infra-->>Agent: Confirm change Agent->>Agent: Continuous learning (record outcome) Agent-->>Asst: Resolution confirmed Asst-->>Op: Issue resolved

Integration and Unified Observability

AI Canvas consolidates telemetry from across the Cisco ecosystem into a single, real-time view: ThousandEyes for internet and WAN path visibility, Meraki for wireless, switching, and SD-WAN telemetry, and Splunk for log analytics and security events. This eliminates the "swivel chair" problem of switching between multiple dashboards. Visualizations are generated dynamically and tailored to each specific incident.

Use Cases

Use Case	How AI Canvas Helps
Incident triage	Correlates alerts from multiple sources, identifies root cause, suggests remediation
Capacity planning	Analyzes historical trends and predicts when thresholds will be exceeded
Change validation	Generates pre-change and post-change verification procedures dynamically
Cross-domain troubleshooting	Coordinates agents across network, security, and cloud domains
Team collaboration	Shared dashboards, persistent sessions, invite-based collaboration

Animation: AgenticOps workflow showing an operator describing a network issue in natural language, the AI Assistant delegating to specialized agents, collecting telemetry, and executing autonomous remediation.

Post-Quiz: Cisco AI Canvas

1. How much Cisco networking knowledge was the Deep Network Model trained on?

A) 5 million tokens B) Over 40 million tokens C) 500 million tokens D) 1 billion tokens

2. What does "dynamic runbook generation" mean in the context of AgenticOps?

A) Pre-written playbooks are selected from a library B) Customized troubleshooting procedures are created on the fly based on specific symptoms observed C) Static runbooks are automatically emailed to the on-call engineer D) Engineers manually write runbooks and upload them to the system

3. What problem does AI Canvas's unified observability solve?

A) The need for more monitoring tools B) The "swivel chair" problem of switching between multiple dashboards to correlate issues C) Eliminating the need for any monitoring at all D) Replacing Splunk with a Cisco-native logging solution

4. Which third-party system can AI Canvas integrate with for ticket management?

A) Jira B) Zendesk C) ServiceNow D) Freshdesk

5. What serves as the conversational interface to AgenticOps capabilities?

A) Nexus Dashboard CLI B) The Cisco AI Assistant C) Meraki Dashboard D) Intersight API

Section 3: Cisco Hyperfabric AI

Pre-Quiz: Cisco Hyperfabric AI

1. How is Cisco Hyperfabric AI primarily managed?

A) Through on-premises Nexus Dashboard Fabric Controller B) Via a cloud-based controller hosted at hyperfabric.cisco.com C) By SSH into each individual switch D) Through Cisco Intersight only

2. What serves as the "single source of truth" in Hyperfabric deployments?

A) The switch running configuration B) The CMDB C) The blueprint D) The Ansible playbook

3. What is the target user persona for Hyperfabric AI?

A) Network engineers with deep CLI expertise B) IT generalists, application teams, and DevOps C) Data scientists only D) Cisco TAC engineers

4. What provisioning method does Hyperfabric use?

A) CLI templates pushed via NDFC B) Manual device-by-device configuration C) Blueprint-based, zero-touch plug-and-play D) Ansible playbooks from a jump host

5. How does Hyperfabric monitor fabric health differently from traditional approaches?

A) Uses only SNMP polling B) Uses a blueprint-verification protocol that compares actual state against intended design C) Relies entirely on syslog messages D) Requires manual health checks by engineers

Architecture and Capabilities

Cisco Nexus Hyperfabric AI is a cloud-managed, full-stack AI infrastructure solution delivered as a combined hardware, software, and service offering. The hardware stack includes Cisco Silicon One switches, Cisco 6000 Series Switches, N9100 and N9300 Series Switches for spine/leaf fabrics, and Cisco UCS C885A M8 Rack Servers for GPU compute. The entire stack is NVIDIA ERA-compliant.

The Cloud Controller

The defining feature is the cloud controller -- a scalable, distributed, multitenant service at hyperfabric.cisco.com. It manages fabrics regardless of geographic location and handles: fabric design and blueprint creation, zero-touch provisioning, continuous monitoring, firmware upgrades, and integration with Cisco Commerce for automated quoting/ordering.

Blueprint-Based Provisioning

A blueprint contains: physical components (switches, optics, servers, airflow, power specs), cabling plan, bill of materials integrated with Cisco Commerce, and logical configuration (VLANs, routing protocols, fabric parameters). The desired end-state is declared once, and the system drives all devices toward that state automatically.

The Deployment Lifecycle

stateDiagram-v2 [*] --> Day0 state "Day 0: Design and Plan" as Day0 { d0a: Access cloud portal d0b: Visual designer tool d0c: Generate blueprint with BOM and cabling d0d: Order via Cisco Commerce d0a --> d0b d0b --> d0c d0c --> d0d } Day0 --> Day1 state "Day 1: Deploy and Validate" as Day1 { d1a: Rack and cable switches and servers d1b: Zero-touch plug-and-play provisioning d1c: Real-time topology validation d1a --> d1b d1b --> d1c } Day1 --> Day2 state "Day 2+: Operate and Scale" as Day2 { d2a: Continuous cloud monitoring d2b: Firmware upgrades and scaling d2c: Blueprint-verification protocol d2a --> d2b d2b --> d2c d2c --> d2a }

Hyperfabric AI vs. Traditional Fabric Management (NDFC)

Aspect	Hyperfabric AI	Traditional NDFC
Deployment model	Cloud-managed fabric-as-a-service	On-premises management platform
Target user	IT generalists, application teams, DevOps	Network engineers with deep CLI expertise
Provisioning	Blueprint-based, zero-touch plug-and-play	Intent-based via NDFC UI or CLI templates
Monitoring	Blueprint-verification protocol (actual vs. intended)	Syslog, SNMP, streaming telemetry
Hardware model	Turnkey, vertically integrated full stack	Software overlay on heterogeneous hardware
AI optimization	Premium AI tier, GPU/DPU integration, ERA-compliant	General-purpose fabric controller
Expertise required	Minimal	Significant networking expertise
Scaling	Cloud-native, multi-geography from single portal	Within single NDFC instance or cluster

Animation: Side-by-side comparison showing a traditional NDFC deployment (engineer configuring switches via CLI) vs. Hyperfabric (IT generalist using cloud portal with blueprint auto-provisioning devices).

Post-Quiz: Cisco Hyperfabric AI

1. What four elements does a Hyperfabric blueprint contain?

A) IP addresses, subnet masks, OSPF areas, and BGP ASNs B) Physical components, cabling plan, bill of materials, and logical configuration C) Server firmware, switch NX-OS version, storage ONTAP version, and GPU driver D) Purchase order, shipping manifest, install guide, and warranty terms

2. During Day 1 deployment, how do switches connect to the cloud controller?

A) Manual registration via CLI B) Zero-touch plug-and-play (automatic) C) SSH tunnel configured by an engineer D) VPN connection to Cisco TAC

3. How does Hyperfabric integrate with Cisco Commerce?

A) Engineers must manually create quotes from the BOM B) The blueprint's bill of materials generates accurate quotes automatically C) Commerce integration is not available D) Only through a Cisco partner portal

4. What type of switching silicon does Hyperfabric AI use?

A) Broadcom Memory B) Cisco Silicon One C) Intel Tofino D) Mellanox ConnectX

5. Cisco validated Hyperfabric by using it internally as what?

A) A proof of concept for partners B) Customer zero C) A demo environment for Cisco Live D) A beta test for NVIDIA

Section 4: Solution Comparison and Selection

Pre-Quiz: Solution Comparison and Selection

1. An enterprise with existing Nexus/UCS infrastructure and skilled network engineers wants to add AI training. Which solution is recommended?

A) AI Canvas B) Hyperfabric AI C) AI PODs D) Cisco ACI

2. A greenfield AI deployment with limited networking staff should choose which solution?

A) AI PODs with NDFC B) Hyperfabric AI C) AI Canvas standalone D) Third-party infrastructure with Cisco switches

3. Which solution addresses multi-vendor alert fatigue and slow incident resolution?

A) AI PODs B) Hyperfabric AI C) AI Canvas D) Nexus Dashboard

4. How do AI PODs scale?

A) Vertically by adding more GPUs to existing servers B) Horizontally by adding more POD building blocks C) By upgrading the cloud controller license D) By migrating to Hyperfabric AI

5. In TCO analysis, which cost category most favors Hyperfabric AI over AI PODs?

A) Hardware acquisition cost B) Operational expertise and labor costs C) Storage licensing D) GPU costs

Mapping Workloads to Solutions

flowchart TD START(["What is the primary need?"]) --> Q1{"New AI infrastructure or operations optimization?"} Q1 -->|"Operations optimization"| CANVAS["AI Canvas: AgenticOps + unified observability"] Q1 -->|"New AI infrastructure"| Q2{"Existing Cisco/networking expertise on staff?"} Q2 -->|"Yes"| Q3{"Greenfield site or existing Nexus/UCS?"} Q2 -->|"No -- IT generalists"| HF["Hyperfabric AI: Cloud-managed, zero-touch"] Q3 -->|"Existing infrastructure"| PODS["AI PODs: Pre-validated building blocks"] Q3 -->|"Greenfield deployment"| Q4{"Want full lifecycle cloud management?"} Q4 -->|"Yes"| HF Q4 -->|"No"| PODS PODS --> COMBO{"Need intelligent Day 2 operations?"} HF --> COMBO COMBO -->|"Yes"| PLUS["Add AI Canvas for AgenticOps"] COMBO -->|"No"| DONE(["Solution selected"]) PLUS --> DONE

Scenario	Recommended Solution	Rationale
Enterprise with existing Nexus/UCS wants AI training	AI PODs	Leverages existing expertise; integrates with current tools
Greenfield AI deployment, limited networking staff	Hyperfabric AI	Cloud-managed lifecycle; blueprint-based provisioning reduces errors
Multi-vendor alert fatigue, slow incident resolution	AI Canvas	Unifies observability; AgenticOps automates troubleshooting
Large-scale AI factory needing infra + ops intelligence	Hyperfabric AI + AI Canvas	Hyperfabric manages fabric lifecycle; AI Canvas optimizes Day 2 ops
Budget-conscious AI inference serving	AI PODs (Inferencing)	RTX PRO 6000 GPUs optimize cost-per-inference

Scalability Considerations

Solution	Scaling Model	Practical Limits	Scaling Trigger
AI PODs	Horizontal -- add more POD building blocks	Physical data center capacity and fabric design	Need for additional GPU compute or storage
AI Canvas	Cloud-native SaaS	Scales with Cisco cloud infrastructure	Growth in monitored devices, incidents, or team size
Hyperfabric AI	Cloud-managed -- add fabrics from portal	Multi-geography from single controller	New sites, fabric expansions, geographic growth

Total Cost of Ownership (TCO)

Cost Category	AI PODs	AI Canvas	Hyperfabric AI
Hardware acquisition	Bundled pricing	N/A (software/service)	Full-stack pricing (HW+SW+service)
Deployment labor	Reduced ~50% via CVDs	N/A	Minimized via zero-touch + blueprints
Operational expertise	Requires skilled engineers	Reduces specialized troubleshooting need	Enables IT generalists
Management infra	Intersight + Nexus Dashboard	SaaS-delivered	Cloud controller included
Ongoing maintenance	Standard Cisco TAC support	SaaS subscription	Included in service model
Integration cost	Moderate	Low (native Cisco ecosystem)	Low (vertically integrated)

Worked Example: An organization deploying 16 NVIDIA H200 GPUs can choose AI PODs (lower subscription, higher labor due to specialized engineers) or Hyperfabric AI (higher subscription, dramatically lower labor). For organizations without deep networking expertise, Hyperfabric AI often delivers lower three-year TCO despite higher per-unit pricing.

Animation: Decision tree walkthrough -- user selects their organization's characteristics (expertise level, greenfield vs. brownfield, workload type) and the tree highlights the recommended solution path with TCO comparison bars.

Post-Quiz: Solution Comparison and Selection

1. A large-scale AI factory needs both infrastructure management and intelligent operations. What combination is recommended?

A) AI PODs + NDFC B) Hyperfabric AI + AI Canvas C) AI Canvas + Meraki D) Hyperfabric AI + Intersight only

2. Why might Hyperfabric AI deliver lower 3-year TCO than AI PODs despite higher per-unit pricing?

A) It uses cheaper GPUs B) Dramatically lower labor costs and no need for specialized networking engineers C) It does not include storage costs D) NVIDIA provides a rebate for Hyperfabric orders

3. What is the scaling model for AI Canvas?

A) Horizontal hardware expansion B) Vertical GPU scaling C) Cloud-native SaaS that scales with telemetry volume D) Manual capacity provisioning

4. Which solution requires on-premises management platforms like Intersight and Nexus Dashboard?

A) AI Canvas B) Hyperfabric AI C) AI PODs D) None of them

5. For a budget-conscious organization focused on AI inference, which specific GPU configuration is recommended?

A) UCS C885A M8 with NVIDIA H200 B) UCS C845A M8 with NVIDIA RTX PRO 6000 Blackwell GPUs C) Hyperfabric with NVIDIA A100 D) Any server with NVIDIA L40S

Chapter 4: Cisco AI Solutions Portfolio

Learning Objectives

Section 1: Cisco AI PODs

Key Points: Cisco AI PODs