Chapter 18: Software Management and Network Health Monitoring

Learning Objectives

Section 1: Software Image Management (SWIM)

Pre-Quiz — Section 1: Software Image Management

1. Which step in the SWIM workflow is a mandatory policy gate that must be completed before Catalyst Center will allow image distribution to proceed?

A. Import Image B. Tag as Golden Image C. Distribute Image D. Poll Task Status

2. A SWIM distribution call returns immediately with a taskId. What is the correct subsequent action?

A. Wait a fixed 10 minutes, then check device version B. Poll /dna/intent/api/v1/task/{task_id} until endTime is populated C. Issue the activation call immediately without waiting D. Check the Software Images dashboard in the GUI only

3. Which parameter in the SWIM activation API allows scheduling a device reload for a future maintenance window?

A. maintenanceWindow B. delayActivation C. scheduleAt D. activationTime

4. Which Ansible module from the cisco.dnac collection handles the full SWIM lifecycle declaratively?

A. cisco.dnac.image_distribution B. cisco.dnac.swim_workflow_manager C. cisco.dnac.software_upgrade D. cisco.ios.software_install

5. During SWIM, which step causes service interruption on the target device?

A. Import — the image binary is uploaded to Catalyst Center B. Tag as Golden — compliance policy is applied C. Distribute — the image is copied to device flash/disk D. Activate — the device reloads to boot the new image

1.1 What Is SWIM?

Software Image Management (SWIM) is Catalyst Center's lifecycle automation framework for network device operating system images. It replaces ad-hoc manual processes with a governed pipeline that enforces approval gates, tracks compliance, and coordinates upgrades at scale.

Think of SWIM as a combination of an enterprise software package manager (like apt or yum) and a change-management workflow engine. SWIM maintains a repository of network OS images and enforces the concept of a golden image — the single approved version for each device family and role.

1.2 The Five-Step SWIM Workflow

The SWIM lifecycle consists of five sequential operations. Distribution and activation are asynchronous — they return a taskId immediately and require polling for completion.

Animation: SWIM Five-Step Pipeline
1. ImportUpload binary to DNAC repository
2. Tag GoldenMark approved for device family + site
3. DistributePush to device flash (no disruption)
4. ActivateSchedule reload (disruptive)
5. Poll TaskWait for endTime or error
flowchart TD A([Start: Security Advisory or Version Policy]) --> B[Step 1: Import Image\nUpload binary to DNAC\nrepository via URL or file] B --> C{Import task\ncomplete?} C -- Poll taskId --> C C -- endTime populated --> D[Step 2: Tag as Golden\nAssign approved image to\ndevice family + role + site] D --> E[Step 3: Distribute\nPush image binary to\ndevice flash/disk via HTTPS/SFTP\nNo service interruption] E --> F{Distribution task\ncomplete?} F -- Poll taskId --> F F -- endTime populated --> G[Step 4: Activate\nSchedule reload for\nmaintenance window\nscheduleAt parameter] G --> H{Activation task\ncomplete?\nTimeout: 1800s} H -- Poll taskId --> H H -- endTime populated --> I([Device running\nnew golden image]) H -- isError=true --> J([Raise RuntimeError\ncheck failureReason]) style A fill:#1a4a7a,color:#fff,stroke:#0d2d4a style I fill:#1a6b3a,color:#fff,stroke:#0d3d20 style J fill:#8b1a1a,color:#fff,stroke:#5a0d0d style D fill:#4a3a7a,color:#fff,stroke:#2d2050

1.3 Core SWIM API Endpoints

OperationMethodEndpoint
Import image via URLPOST/dna/intent/api/v1/image/importation/source/url
List imported imagesGET/dna/intent/api/v1/image/importation
Tag as golden imagePOST/dna/intent/api/v1/image/importation/golden
Distribute to devicePOST/dna/intent/api/v1/image/distribution
Activate on devicePOST/dna/intent/api/v1/image/activation/device
Check task statusGET/dna/intent/api/v1/task/{task_id}

All endpoints require the X-Auth-Token header obtained from /dna/system/api/v1/auth/token.

1.4 Golden Image Compliance Enforcement

Once you tag an image golden for a site/family/role combination, Catalyst Center continuously evaluates every device for compliance. Non-compliant devices (running a non-golden OS version) can be queried programmatically and automatically upgraded:

GET /dna/intent/api/v1/image/importation?isTaggedGolden=false&siteId=<uuid>
flowchart TD A([Scheduled Compliance Check]) --> B["GET /image/importation\n?isTaggedGolden=false&siteId=X"] B --> C{Non-compliant\ndevices found?} C -- No --> Z([All devices compliant\nLog and exit]) C -- Yes --> D[Open change ticket\nor auto-initiate SWIM] D --> E[Tag golden image\nfor site + role] E --> F["POST /image/distribution\nReturns taskId"] F --> G["Poll /task/{taskId}\nevery 10s"] G --> H{task.endTime\npopulated?} H -- No, elapsed < timeout --> G H -- isError = true --> I([Raise RuntimeError\nfailureReason logged]) H -- Yes --> J["POST /image/activation\nscheduleAt = maintenance window\nReturns taskId"] J --> K["Poll /task/{taskId}\nevery 10s, timeout 1800s"] K --> L{Activation\ncomplete?} L -- No --> K L -- isError = true --> I L -- Yes --> M([Device upgraded\nUpdate compliance record]) style A fill:#1a4a7a,color:#fff,stroke:#0d2d4a style Z fill:#1a6b3a,color:#fff,stroke:#0d3d20 style M fill:#1a6b3a,color:#fff,stroke:#0d3d20 style I fill:#8b1a1a,color:#fff,stroke:#5a0d0d

1.5 Python SDK — Async Task Polling Pattern

The dnacentersdk library wraps all SWIM REST endpoints. The critical pattern every SWIM integration must implement is async task polling:

def poll_task(task_id, timeout=600, interval=10):
    """Poll a Catalyst Center async task until completion or timeout."""
    elapsed = 0
    while elapsed < timeout:
        result = api.task.get_task_by_id(task_id=task_id)
        task_data = result.response
        if task_data.isError:
            raise RuntimeError(f"Task failed: {task_data.failureReason}")
        if task_data.endTime:  # Task completed successfully
            return task_data
        time.sleep(interval)
        elapsed += interval
    raise TimeoutError(f"Task {task_id} did not complete within {timeout}s")

1.6 Ansible SWIM: swim_workflow_manager

The cisco.dnac.swim_workflow_manager Ansible module handles the full lifecycle in a single task. The dnac_api_task_timeout and dnac_task_poll_interval parameters control async wait behavior. Setting taggingPriority: true supersedes any previously tagged golden image for the same combination.

1.7 SWIM at Scale: Scheduling Maintenance Windows

The scheduleAt parameter accepts a UTC epoch timestamp in milliseconds. Distribution can happen during business hours (non-disruptive) while activation is deferred to the weekend window — enabling fire-and-forget upgrade campaigns across hundreds of devices.

Key Points — Section 1: SWIM

Post-Quiz — Section 1: Software Image Management

1. Which step in the SWIM workflow is a mandatory policy gate that must be completed before Catalyst Center will allow image distribution to proceed?

A. Import Image B. Tag as Golden Image C. Distribute Image D. Poll Task Status

2. A SWIM distribution call returns immediately with a taskId. What is the correct subsequent action?

A. Wait a fixed 10 minutes, then check device version B. Poll /dna/intent/api/v1/task/{task_id} until endTime is populated C. Issue the activation call immediately without waiting D. Check the Software Images dashboard in the GUI only

3. Which parameter in the SWIM activation API allows scheduling a device reload for a future maintenance window?

A. maintenanceWindow B. delayActivation C. scheduleAt D. activationTime

4. Which Ansible module from the cisco.dnac collection handles the full SWIM lifecycle declaratively?

A. cisco.dnac.image_distribution B. cisco.dnac.swim_workflow_manager C. cisco.dnac.software_upgrade D. cisco.ios.software_install

5. During SWIM, which step causes service interruption on the target device?

A. Import — the image binary is uploaded to Catalyst Center B. Tag as Golden — compliance policy is applied C. Distribute — the image is copied to device flash/disk D. Activate — the device reloads to boot the new image

Section 2: Network Health Monitoring with Catalyst Center

Pre-Quiz — Section 2: Network Health Monitoring

1. An individual device has System Health = 9, Data Plane Connectivity = 3, and Control Plane Connectivity = 8. What is its Device Health Score?

A. 9 (the highest component score) B. 6.67 (the average of all three) C. 3 (the minimum of all three) D. 20 (the sum of all three)

2. Catalyst Center's overall Network Health Score (%) is calculated as:

A. Average of all individual device scores B. Percentage of devices with a score in the 8–10 healthy range divided by total devices C. Number of healthy devices minus number of unhealthy devices D. Percentage of devices reachable via SNMP polling

3. Which Catalyst Center Assurance API endpoint returns per-device and aggregate infrastructure health scores?

A. GET /dna/intent/api/v1/client-health B. GET /dna/intent/api/v1/application-health C. GET /dna/intent/api/v1/network-health D. GET /dna/intent/api/v1/device-health

4. For Application Health scoring, which three KPIs are evaluated against CVD thresholds?

A. CPU utilization, memory usage, and interface errors B. Packet loss, network latency, and jitter C. Uptime, reachability, and SNMP response time D. Throughput, VLAN count, and spanning-tree convergence time

5. How do you retrieve Catalyst Center Assurance health data for a specific point in the past (e.g., during a reported outage)?

A. Query a separate historical archive API at /dna/intent/api/v1/history B. Pass a timestamp query parameter (epoch milliseconds) to the standard health API C. Historical data is only accessible through the Catalyst Center GUI, not via API D. Use the startTime and endTime parameters on the inventory API

2.1 The Assurance Architecture

Catalyst Center Assurance continuously collects telemetry from every managed device using SNMP polling, model-driven streaming telemetry (gRPC/gNMI), syslog ingestion, NetFlow records, and 802.11 wireless radio data. Raw telemetry is normalized, correlated, and aggregated into health scores that update every five minutes.

DomainWhat It MeasuresAPI Endpoint
Network HealthInfrastructure devices (switches, routers, APs, WLCs)GET /dna/intent/api/v1/network-health
Client HealthEndpoint connectivity (wired and wireless)GET /dna/intent/api/v1/client-health
Application HealthBusiness application performance (latency, loss, jitter)GET /dna/intent/api/v1/application-health

2.2 The Health Scoring Model

Individual Device Health Score uses a weakest-link model:

Device Health Score = MIN(System Health, Data Plane Connectivity, Control Plane Connectivity)

Scores: 8–10 = Healthy, 4–7 = Fair, 1–3 = Poor. A device scoring 9/3/8 gets an overall score of 3.

Overall Network Health Score:

Network Health Score (%) = (Count of Devices with Score 8-10) / (Total Monitored Devices) × 100

Devices in maintenance mode are excluded from this calculation.

Animation: Network Health Score Calculation
Five campus devices — click Play to score them
SW-CORE-01
MIN(?, ?, ?)
SW-ACC-02
MIN(?, ?, ?)
AP-FLOOR3
MIN(?, ?, ?)
RTR-WAN-01
MIN(?, ?, ?)
SW-DIST-01
MIN(?, ?, ?)
Overall Network Health Score
flowchart LR subgraph TELEMETRY["Telemetry Sources"] T1[SNMP Polling] T2[gRPC/gNMI\nStreaming Telemetry] T3[Syslog Ingestion] T4[NetFlow Records] T5[802.11 Wireless\nRadio Data] end subgraph DEVICE_SCORE["Per-Device Scoring\n(Weakest-Link Model)"] D1[System Health\n1-10] D2[Data Plane\nConnectivity 1-10] D3[Control Plane\nConnectivity 1-10] D4["Device Score =\nMIN(D1, D2, D3)"] D1 --> D4 D2 --> D4 D3 --> D4 end subgraph AGGREGATE["Aggregate Score Calculation"] A1["Healthy Devices\nScore 8-10"] A2["Fair Devices\nScore 4-7"] A3["Poor Devices\nScore 1-3"] A4["Network Health % =\nHealthy Count / Total x 100\n(maintenance mode excluded)"] A1 --> A4 A2 --> A4 A3 --> A4 end TELEMETRY --> DEVICE_SCORE DEVICE_SCORE --> AGGREGATE style TELEMETRY fill:#1a2a4a,color:#fff,stroke:#0d1a2d style DEVICE_SCORE fill:#2a1a4a,color:#fff,stroke:#1a0d2d style AGGREGATE fill:#1a3a2a,color:#fff,stroke:#0d2018

2.3 Client Health Score

Client health uses the same 8–10 healthy threshold but is maintained separately for wired and wireless populations. This separation prevents a large healthy wired fleet from masking a spike in wireless issues after an AP firmware upgrade.

2.4 Application Health Score and CVD Thresholds

Traffic ClassLatency ThresholdPacket LossJitter
Voice< 150ms< 1%< 30ms
Video< 200ms< 1%< 50ms
Transactional< 300ms< 3%N/A
Bulk Data< 500ms< 5%N/A

Thresholds are customizable per traffic class via PUT /dna/intent/api/v1/AssuranceGetHealthScoreDefinitions.

2.5 Historical Health Queries

All three Assurance APIs accept an optional timestamp query parameter (epoch milliseconds) for point-in-time historical retrieval. Catalyst Center retains Assurance data for a configurable period (typically 90 days).

ts_ms = int(calendar.timegm(target_time.timetuple()) * 1000)
historical_health = get_health(token, "network-health", params={"timestamp": ts_ms})

Key Points — Section 2: Health Monitoring

Post-Quiz — Section 2: Network Health Monitoring

1. An individual device has System Health = 9, Data Plane Connectivity = 3, and Control Plane Connectivity = 8. What is its Device Health Score?

A. 9 (the highest component score) B. 6.67 (the average of all three) C. 3 (the minimum of all three) D. 20 (the sum of all three)

2. Catalyst Center's overall Network Health Score (%) is calculated as:

A. Average of all individual device scores B. Percentage of devices with a score in the 8–10 healthy range divided by total devices C. Number of healthy devices minus number of unhealthy devices D. Percentage of devices reachable via SNMP polling

3. Which Catalyst Center Assurance API endpoint returns per-device and aggregate infrastructure health scores?

A. GET /dna/intent/api/v1/client-health B. GET /dna/intent/api/v1/application-health C. GET /dna/intent/api/v1/network-health D. GET /dna/intent/api/v1/device-health

4. For Application Health scoring, which three KPIs are evaluated against CVD thresholds?

A. CPU utilization, memory usage, and interface errors B. Packet loss, network latency, and jitter C. Uptime, reachability, and SNMP response time D. Throughput, VLAN count, and spanning-tree convergence time

5. How do you retrieve Catalyst Center Assurance health data for a specific point in the past (e.g., during a reported outage)?

A. Query a separate historical archive API at /dna/intent/api/v1/history B. Pass a timestamp query parameter (epoch milliseconds) to the standard health API C. Historical data is only accessible through the Catalyst Center GUI, not via API D. Use the startTime and endTime parameters on the inventory API

Section 3: Monitoring with Meraki and SD-WAN

Pre-Quiz — Section 3: Meraki and SD-WAN Monitoring

1. How does authentication work with the Meraki Dashboard API?

A. OAuth 2.0 bearer token obtained from a token endpoint B. Session cookie obtained from POST /j_security_check C. An API key passed in the X-Cisco-Meraki-API-Key header D. Basic authentication with username and password on every request

2. Which Meraki API endpoint provides online/offline/alerting status for all devices across an entire organization in a single call?

A. GET /networks/{networkId}/devices B. GET /organizations/{orgId}/devices/statuses C. GET /organizations/{orgId}/inventory/devices D. GET /organizations/{orgId}/health/summary

3. What authentication mechanism does vManage (Cisco SD-WAN) use for its REST API?

A. API key in the X-Auth-Token header B. Session cookie from POST /j_security_check C. Bearer token from OAuth 2.0 flow D. Client certificate mutual TLS

3.1 Meraki API-Based Health Monitoring

Unlike Catalyst Center (on-premises), Meraki monitoring is cloud-native. All telemetry flows to the Meraki cloud dashboard and is accessible via REST API using an API key in the X-Cisco-Meraki-API-Key header at base URL https://api.meraki.com/api/v1/.

EndpointDescription
GET /organizations/{orgId}/devices/statusesOnline/offline/alerting status for all org devices
GET /networks/{networkId}/devices/{serial}/lossAndLatencyHistoryPer-device loss and latency time-series
GET /organizations/{orgId}/summary/top/devices/byUsageTop devices by traffic volume
GET /organizations/{orgId}/uplinks/statusesWAN uplink status for all MX appliances

3.2 SD-WAN (vManage) Health Monitoring

vManage REST API uses session cookie auth from POST /j_security_check. Key endpoints include GET /dataservice/device for inventory/status, GET /dataservice/device/counters for OMP/BFD counters, and GET /dataservice/alarms for active fabric alarms.

3.3 Cross-Platform Health Aggregation

Large enterprises span multiple controllers: Catalyst Center (campus), vManage (SD-WAN), and Meraki (branches). A normalization layer translates controller-specific schemas into a common format:

# Catalyst Center device → common schema
{"source": "Catalyst Center", "health_score": device.overallHealth,
 "status": "healthy" if health >= 8 else "degraded"}

# Meraki device → common schema
{"source": "Meraki", "health_score": 10 if status=="online" else 1}

# SD-WAN device → common schema
{"source": "SD-WAN", "health_score": 10 if reachability=="reachable" else 1}
flowchart LR subgraph SOURCES["Controller Data Sources"] CC["Catalyst Center\nAssurance API\nX-Auth-Token header"] VM["vManage API\nSD-WAN Fabric\nSession cookie auth"] MK["Meraki Dashboard API\nCloud-Managed Branches\nX-Cisco-Meraki-API-Key header"] end subgraph NORM["Normalization Layer\n(Python Service)"] N1["normalize_to_common_schema()\nsource: catalyst_center\nhealth_score, status"] N2["normalize_to_common_schema()\nsource: sdwan\nreachability to score"] N3["normalize_to_common_schema()\nsource: meraki\nonline status to score"] end subgraph DEDUP["Alert Processing"] AD["Correlate by 60s window"] AR["De-duplicate by root cause"] AE["Enrich with topology context"] AS["Suppress during maintenance"] AD --> AR --> AE --> AS end subgraph OUTPUTS["Downstream Systems"] G["Grafana Dashboard"] P["PagerDuty Escalation"] S["ServiceNow Ticketing"] end CC --> N1 VM --> N2 MK --> N3 N1 --> DEDUP N2 --> DEDUP N3 --> DEDUP DEDUP --> G DEDUP --> P DEDUP --> S style SOURCES fill:#1a2a4a,color:#fff,stroke:#0d1a2d style NORM fill:#2a1a4a,color:#fff,stroke:#1a0d2d style DEDUP fill:#3a2a1a,color:#fff,stroke:#2d1a0d style OUTPUTS fill:#1a3a2a,color:#fff,stroke:#0d2018

3.4 Alert Aggregation and Deduplication

Alert storms occur when a single upstream failure (a WAN circuit going down) generates dozens of downstream alerts across multiple controllers simultaneously. An effective aggregation layer must:

  1. Correlate by time window — group alerts arriving within a 60-second window affecting the same network segment
  2. De-duplicate by root cause — create one "WAN circuit failure" alert rather than 30 individual device alerts
  3. Enrich with topology context — understand parent-child relationships (WAN router → downstream switches → clients)
  4. Suppress during maintenance — suppress alerts for devices in scheduled maintenance windows

Key Points — Section 3: Meraki and SD-WAN Monitoring

Post-Quiz — Section 3: Meraki and SD-WAN Monitoring

1. How does authentication work with the Meraki Dashboard API?

A. OAuth 2.0 bearer token obtained from a token endpoint B. Session cookie obtained from POST /j_security_check C. An API key passed in the X-Cisco-Meraki-API-Key header D. Basic authentication with username and password on every request

2. Which Meraki API endpoint provides online/offline/alerting status for all devices across an entire organization in a single call?

A. GET /networks/{networkId}/devices B. GET /organizations/{orgId}/devices/statuses C. GET /organizations/{orgId}/inventory/devices D. GET /organizations/{orgId}/health/summary

3. What authentication mechanism does vManage (Cisco SD-WAN) use for its REST API?

A. API key in the X-Auth-Token header B. Session cookie from POST /j_security_check C. Bearer token from OAuth 2.0 flow D. Client certificate mutual TLS

Section 4: Automated Alerting and Remediation

Pre-Quiz — Section 4: Automated Alerting and Remediation

1. At which tier of the Self-Healing Maturity Model does ENAUTO automation skill — building Python services and Ansible playbooks that detect issues and execute corrective actions — primarily apply?

A. Tier 1 — Auto-Detection B. Tier 2 — Auto-Correlation C. Tier 3 — Auto-Remediation D. Tier 4 — Autonomous Operation

2. When registering a Catalyst Center webhook subscription, what HTTP status code must the receiver return to acknowledge successful receipt of an event?

A. 201 Created B. 204 No Content C. 200 OK D. 202 Accepted

3. Which Catalyst Center API enriches a raw event ID with root cause analysis, recommended actions, affected hosts, and historical occurrence count?

A. GET /dna/intent/api/v1/event/subscription B. GET /dna/intent/api/v1/issues/{issue_id} C. GET /dna/intent/api/v1/network-health D. GET /dna/intent/api/v1/event/webhook

4. What is the critical differentiator of NSO (Network Services Orchestrator) for multi-device remediation compared to a simple Python script?

A. NSO can push changes faster than REST API calls B. NSO provides atomic multi-device transactions with rollback — either all changes apply or none do C. NSO generates Ansible playbooks automatically D. NSO eliminates the need for device authentication

5. What percentage of network alerts does Cisco IT's production self-healing automation handle without human intervention?

A. 75% B. 95% C. 99% D. 99.998%

4.1 The Self-Healing Maturity Model

TierNameDescriptionTechnology
1Auto-DetectionReal-time visibility through continuous monitoringCatalyst Center Assurance, Meraki alerts
2Auto-CorrelationIntelligent grouping to identify root causesCatalyst Center AI analytics
3Auto-RemediationAutomated evaluation and execution of corrective actionsPython + Catalyst Center APIs, Ansible AWX
4Autonomous OperationFull closed-loop AI-driven autonomyEmerging (LLM-based, 2025–2026)

Cisco IT's production automation handles 99.998% of all network alerts without human intervention, processing millions of daily events.

4.2 Catalyst Center Event Notifications and Webhooks

Instead of polling health APIs every five minutes, subscribe to specific events — Catalyst Center pushes notifications via HTTPS POST the moment conditions change. Event domains include Assurance (health degradation, AI anomalies), SWIM (distribution/activation completion), and Network (reachability changes, interface transitions).

Two-step setup:

  1. Register a webhook destination via POST /dna/intent/api/v1/event/webhook
  2. Subscribe to specific event IDs via POST /dna/intent/api/v1/event/subscription

Common Assurance event IDs: NETWORK-DEVICES-3-250 (device unreachable), NETWORK-DEVICES-3-251 (high CPU), NETWORK-DEVICES-3-252 (memory threshold), NETWORK-CLIENTS-3-502 (client onboarding failure).

4.3 Issue Enrichment API

Before executing remediation, enrich the raw event. The Issue Enrichment API returns root cause analysis, recommended actions, affected hosts, and historical occurrence count. Pass the issue ID in both the URL path and the entity_value header:

headers = {
    "X-Auth-Token": token,
    "entity_type": "issue_id",   # Required header
    "entity_value": issue_id
}

4.4 Flask Webhook Receiver and REMEDIATION_MAP Pattern

The central orchestration pattern is a REMEDIATION_MAP dict mapping event IDs to handler functions. Each handler receives enriched context and decides whether to auto-fix, escalate, or log:

REMEDIATION_MAP = {
    "NETWORK-DEVICES-3-250": handle_device_unreachable,
    "NETWORK-DEVICES-3-251": handle_high_cpu,
    "NETWORK-CLIENTS-3-502": handle_client_onboarding_failure,
}

The webhook endpoint must always return HTTP 200 — Catalyst Center expects acknowledgement regardless of internal processing outcome.

flowchart TD subgraph DETECT["Detection Layer"] CA["Catalyst Center Assurance\nHealth scores + AI anomaly detection\nIssue correlation every 5 min"] CA --> EN["Event Notification System\nSubscribe per event ID\nDomains: Assurance, SWIM, Network"] end subgraph ORCHESTRATE["Orchestration Layer"] WR["Flask/FastAPI\nWebhook Receiver\nHTTPS POST /webhook"] IE["Issue Enrichment API\n/dna/intent/api/v1/issues/{id}\nRoot cause + occurrence count"] CE["Context Evaluation\nOccurrence threshold\nSeverity classification"] RD["REMEDIATION_MAP\nDispatch to handler\nby event ID"] WR --> IE --> CE --> RD end subgraph ACTIONS["Action Layer"] AF["Auto-Fix\nAnsible AWX runbook\nor NSO atomic transaction"] ES["Escalate\nPagerDuty / Webex / Slack"] TK["Ticket + Audit Log\nServiceNow / Splunk"] end subgraph FEEDBACK["Feedback Layer"] FB["Remediation outcomes\nRefine thresholds\nUpdate alert rules via GitOps"] end EN -- "HTTPS POST\n(eventId, deviceId, issueId)" --> WR RD --> AF RD --> ES RD --> TK AF --> FB ES --> FB TK --> FB FB --> CA style DETECT fill:#1a2a4a,color:#fff,stroke:#0d1a2d style ORCHESTRATE fill:#2a1a4a,color:#fff,stroke:#1a0d2d style ACTIONS fill:#1a3a2a,color:#fff,stroke:#0d2018 style FEEDBACK fill:#3a2a1a,color:#fff,stroke:#2d1a0d

4.5 NSO for Multi-Device Remediation

NSO's MAAPI Python API provides atomic multi-device transactions with rollback. Changes to two devices either both commit together or neither does — preventing partial failure states that leave the network worse than before:

with ncs.maapi.single_write_trans("admin", "python") as t:
    try:
        primary.config.ios__interface.GigabitEthernet[iface].shutdown = True
        backup.config.ios__interface.GigabitEthernet["0/1"].shutdown = False
        t.apply()   # Atomic: both commit or neither does
    except Exception as e:
        t.revert()  # Roll back both devices
        raise

4.6 Notification Integrations

Webex: POST to https://webexapis.com/v1/messages with Authorization: Bearer {token} and {"roomId": ..., "text": ...}.

Slack: POST to an incoming webhook URL with an attachments payload. Color-code by severity: green (info), orange (warning), red (critical).

PagerDuty: POST to https://events.pagerduty.com/v2/enqueue with routing key and severity.

Key Points — Section 4: Alerting and Remediation

Post-Quiz — Section 4: Automated Alerting and Remediation

1. At which tier of the Self-Healing Maturity Model does ENAUTO automation skill — building Python services and Ansible playbooks that detect issues and execute corrective actions — primarily apply?

A. Tier 1 — Auto-Detection B. Tier 2 — Auto-Correlation C. Tier 3 — Auto-Remediation D. Tier 4 — Autonomous Operation

2. When registering a Catalyst Center webhook subscription, what HTTP status code must the receiver return to acknowledge successful receipt of an event?

A. 201 Created B. 204 No Content C. 200 OK D. 202 Accepted

3. Which Catalyst Center API enriches a raw event ID with root cause analysis, recommended actions, affected hosts, and historical occurrence count?

A. GET /dna/intent/api/v1/event/subscription B. GET /dna/intent/api/v1/issues/{issue_id} C. GET /dna/intent/api/v1/network-health D. GET /dna/intent/api/v1/event/webhook

4. What is the critical differentiator of NSO (Network Services Orchestrator) for multi-device remediation compared to a simple Python script?

A. NSO can push changes faster than REST API calls B. NSO provides atomic multi-device transactions with rollback — either all changes apply or none do C. NSO generates Ansible playbooks automatically D. NSO eliminates the need for device authentication

5. What percentage of network alerts does Cisco IT's production self-healing automation handle without human intervention?

A. 75% B. 95% C. 99% D. 99.998%

Your Progress

Answer Explanations