Study Guide: Chapter 5 - Data, Control, and Management Plane Technologies

Answer these questions before studying to gauge your current understanding. You will see the same questions again after studying.

Pre-Quiz

1. An architect needs to select a data plane forwarding technology for a cost-sensitive data center leaf-spine fabric. Which technology best balances cost and performance?

Custom silicon ASICs designed by the switch vendor

Software forwarding on general-purpose CPUs

Merchant silicon ASICs from a third-party vendor like Broadcom

Discrete FPGAs on every line card

2. What is the primary purpose of Control Plane Policing (CoPP)?

To encrypt routing protocol traffic between peers

To classify and rate-limit traffic destined to the control plane CPU, protecting it from overload

To accelerate transit data plane traffic through QoS prioritization

To provide redundant paths for management traffic

3. Why do BFD and Graceful Restart fundamentally conflict with each other?

They both require excessive CPU resources and cannot run simultaneously

BFD detects forwarding failures to trigger rerouting, while GR masks control plane failures to continue forwarding -- opposite goals

BFD operates at Layer 2 while Graceful Restart operates at Layer 3

Graceful Restart requires SNMPv3, which is incompatible with BFD timers

4. Which management protocol provides native streaming telemetry using Protocol Buffers over HTTP/2?

SNMP with INFORM notifications

NETCONF with on-change subscriptions

RESTCONF with SSE (Server-Sent Events)

gNMI with Subscribe RPCs

5. A network engineer loses SSH access to a core router during a major outage. The organization uses in-band management. What design change would most directly address this problem?

Upgrading from SNMPv2c to SNMPv3

Deploying NETCONF instead of SSH for configuration

Implementing out-of-band management with dedicated management interfaces and a separate switch infrastructure

Increasing the CoPP rate limit for SSH traffic

6. During an OSPF Graceful Restart, what causes helper nodes to immediately terminate the GR procedure?

The restarting device sends an updated Grace LSA with a shorter wait period

Any relevant topology change occurs in the OSPF domain during the restart

The helper node's CPU utilization exceeds 80%

The restarting device's data plane drops below 50% forwarding capacity

7. What key capability does NETCONF provide that RESTCONF does not?

Support for JSON encoding

Use of YANG data models

Transaction support with candidate datastores and rollback on failure

Ability to retrieve operational state data

8. In a CoPP policy, which traffic class should receive the highest rate limit and priority?

ICMP and ARP traffic

Routing protocol traffic (BGP, OSPF, BFD)

Management traffic (SNMP, SSH)

The class-default catch-all

9. What is the relationship between SSO and NSF on a dual-supervisor platform?

NSF is a prerequisite for SSO to function

SSO provides state synchronization between supervisors, which is the foundation that enables NSF to continue forwarding during a control plane restart

They are independent mechanisms that serve unrelated purposes

SSO handles Layer 2 failover while NSF handles Layer 3 failover exclusively

10. A network architect is designing a programmable data plane that must support custom packet headers and in-network telemetry at hardware speed. Which technology is most appropriate?

DPDK on commodity x86 servers

P4 on programmable ASICs

Standard merchant silicon with fixed pipelines

Software forwarding with kernel bypass

11. What distinguishes Non-Stop Routing (NSR) from Graceful Restart (GR)?

NSR requires neighbor helper support while GR does not

NSR transparently fails over routing state without neighbor awareness, while GR requires cooperative helper nodes

GR is faster than NSR because it uses BFD for detection

NSR works only with BGP while GR works with all routing protocols

12. Why should TACACS+ be preferred over RADIUS for network device management plane authentication?

TACACS+ uses UDP which is faster than RADIUS over TCP

TACACS+ supports per-command authorization granularity, enabling fine-grained access control

TACACS+ encrypts only the password while RADIUS encrypts the entire packet

TACACS+ is open-source while RADIUS is proprietary

13. When a forwarding table on a data center leaf switch exceeds the TCAM capacity of its ASIC, what happens to traffic for entries that do not fit?

Traffic is silently dropped at wire speed

The ASIC automatically compresses entries using route summarization

Entries overflow to slower software lookup paths, degrading performance

The device redistributes excess routes to neighboring switches

14. Which three phases make up network convergence after a link failure?

Authentication, authorization, and accounting

Encapsulation, forwarding, and decapsulation

Detection, propagation, and computation

Classification, queuing, and scheduling

15. In a spine-leaf data center fabric, what is the recommended approach for control plane resilience on spine switches?

Enable NSF and Graceful Restart with aggressive BFD timers for maximum protection

Use a simple non-redundant control plane with BFD, relying on path diversity for redundancy

Deploy SSO with NSR and disable BFD entirely

Use in-band management with SNMP polling for failure detection

Section 1: Data Plane Design

The data plane -- also called the forwarding plane -- is where the actual work of moving packets happens. Every packet entering an ingress interface, being looked up against forwarding tables, and exiting an egress interface is a data plane operation. Its performance directly determines the throughput, latency, and scalability of the entire network.

The Three-Plane Model

Every network device organizes its internal functions into three planes: the data plane (forwarding), the control plane (routing decisions), and the management plane (configuration and monitoring). Think of a commercial airport: the data plane is the runway system moving aircraft; the control plane is air traffic control making routing decisions; the management plane is the administration office handling scheduling and compliance.

flowchart LR subgraph DP["Data Plane"] D1["Packet Forwarding"] D2["ASIC / FPGA / Software"] D3["Forwarding Tables"] end subgraph CP["Control Plane"] C1["Routing Protocols\n(BGP, OSPF, IS-IS)"] C2["Path Computation"] C3["Topology Discovery"] end subgraph MP["Management Plane"] M1["Configuration\n(NETCONF, gNMI)"] M2["Monitoring\n(SNMP, Telemetry)"] M3["AAA / Access Control"] end CP -- "Programs forwarding tables" --> DP MP -- "Configures & monitors" --> CP MP -- "Configures & monitors" --> DP

Hardware vs. Software Data Planes

Software Data Planes process packets using general-purpose CPUs. They offer maximum flexibility -- any forwarding behavior can be implemented or modified through software updates -- but are orders of magnitude slower than hardware alternatives. Suitable for low-throughput applications, virtual network functions, or scenarios where programmability outweighs raw performance.

Hardware Data Planes use specialized silicon -- typically ASICs or FPGAs -- to forward packets at wire speed. ASICs are purpose-built chips that are 100 to 1,000 times faster than software solutions for packet forwarding.

Animation: Packet traversal through a hardware ASIC pipeline vs. software CPU forwarding, showing the latency and throughput difference at each stage

Merchant Silicon vs. Custom Silicon

Characteristic	Merchant Silicon	Custom Silicon
Designer	Third-party chip vendors (e.g., Broadcom, Marvell)	Equipment vendor (e.g., Cisco, Juniper)
Time to Market	Faster -- available off-the-shelf	Slower -- minimum 2-year R&D cycle
Cost	Lower unit cost, shared across vendors	Higher development investment
Differentiation	Limited -- same chip available to competitors	High -- unique capabilities
Flexibility	Constrained by vendor roadmap	Full control over feature set

FPGAs as a Middle Ground: Some vendors deploy FPGAs where merchant silicon cannot deliver required performance. Embedding FPGA technology into ASICs can minimize cost by 90% and power consumption by 85% compared to discrete FPGAs.

flowchart LR SW["Software\nForwarding\n(CPU-based)"] -->|"More flexible\nless performant"| FPGA["FPGA\n(Reprogrammable\nHardware)"] FPGA -->|"More performant\nless flexible"| MS["Merchant\nSilicon\n(Off-the-shelf ASIC)"] MS -->|"More differentiated\nhigher cost"| CS["Custom\nSilicon\n(Vendor ASIC)"] style SW fill:#4a90d9,color:#fff style FPGA fill:#7b68ee,color:#fff style MS fill:#e67e22,color:#fff style CS fill:#c0392b,color:#fff

Data Plane Programmability: P4 and DPDK

P4 (Programming Protocol-Independent Packet Processors) is a domain-specific language that lets architects define custom headers, match-action tables, and forwarding logic at compile time on programmable ASICs. This enables use cases like in-network telemetry and custom encapsulations at hardware speeds.

DPDK (Data Plane Development Kit) optimizes software-based packet processing on commodity x86 hardware by bypassing the kernel networking stack. Widely used in NFV environments where virtual routers, firewalls, and load balancers run on standard servers.

Performance and Scalability Considerations

Forwarding table capacity: ASICs have finite TCAM. Exceeding table capacity forces entries into slower software lookup paths.
Pipeline depth vs. latency: More features mean more pipeline stages and more forwarding latency.
Buffer capacity: Shallow-buffered merchant silicon works for lossless DC fabrics but struggles with bursty WAN/campus traffic. Deep-buffered custom silicon addresses this at higher cost.

Section 2: Control Plane Architecture

The control plane is the brain of the network. It runs protocols -- BGP, OSPF, IS-IS, STP, BFD, LACP -- that discover topology, compute paths, and program the data plane's forwarding tables. While the data plane handles millions of packets per second, the control plane processes hundreds or thousands of protocol messages that shape how every subsequent packet is forwarded.

Routing Protocol Convergence

Network convergence -- the time for all routers to agree on a consistent topology view after a change -- involves three phases:

Detection: Recognizing failure (interface down events, hello timer expiry, or BFD)
Propagation: Distributing failure information (LSAs in OSPF, UPDATEs in BGP)
Computation: Recalculating paths and reprogramming the data plane (SPF in OSPF, best-path in BGP)

A design using BFD with 50ms detection, prefix-independent convergence (PIC), and tuned SPF timers can achieve sub-second failover. Default OSPF hello/dead timers (10s/40s) may take 40+ seconds.

graph TD F["Link or Node Failure"] --> DET["1. Detection\n(BFD: ~50ms | OSPF Dead Timer: ~40s)"] DET --> PROP["2. Propagation\n(LSA Flooding / BGP UPDATE)"] PROP --> COMP["3. Computation\n(SPF Recalculation / Best-Path Selection)"] COMP --> PROG["4. Data Plane Reprogramming\n(FIB / LFIB Update)"] PROG --> CONV["Convergence Complete\n(Traffic on New Path)"] style F fill:#c0392b,color:#fff style CONV fill:#27ae60,color:#fff

Design principle: Convergence speed must be balanced against control plane stability. Aggressive timers detect failures faster but increase the risk of false positives and protocol flapping.

Control Plane Policing (CoPP)

The control plane CPU is a shared, finite resource. If overwhelmed by an attacker or traffic burst, routing adjacencies drop, the management plane becomes unreachable, and the network collapses. CoPP treats the control plane as a logical interface with QoS-based filters to classify, rate-limit, and prioritize control plane traffic.

Two primary attack vectors:

Overwhelming attacks: DoS attempts flooding the CPU with control packets (e.g., spoofed BGP SYN packets)
Data corruption attacks: Malicious packets injecting false routing information

CoPP implementation follows three steps using MQC:

Traffic Classification: Define important traffic classes using class maps and ACLs
Policy Definition: Assign rate limits and actions per class
Application: Apply the policy to the control-plane interface

graph TD INB["Inbound Traffic\nto Control Plane CPU"] --> CLASS["CoPP Classification\n(class-map + ACL)"] CLASS --> P1["Priority 1: Routing Protocols\n(BGP, OSPF, BFD)\nPolice: 500 Kbps"] CLASS --> P2["Priority 2: Management\n(SNMP, SSH, NETCONF)\nPolice: 100 Kbps"] CLASS --> P3["Priority 3: General\n(ICMP, ARP)\nPolice: 64 Kbps"] CLASS --> P4["class-default\n(All Other Traffic)\nPolice: 50 Kbps"] P1 --> CPU["Control Plane CPU\n(Protected)"] P2 --> CPU P3 --> CPU P4 --> CPU style P1 fill:#27ae60,color:#fff style P2 fill:#2980b9,color:#fff style P3 fill:#e67e22,color:#fff style P4 fill:#c0392b,color:#fff

Animation: Simulated CoPP in action -- traffic streams arriving at the control plane CPU, being classified and rate-limited, with excess traffic being dropped while critical routing protocol traffic passes through

Layer 2 Control Plane Protection is equally important:

BPDU Guard: Shuts down access ports receiving unexpected BPDUs
BPDU Filter: Suppresses BPDU transmission on specific ports
DTP Disablement: Prevents trunk negotiation attacks via switchport mode access and switchport nonegotiate

Layer 3 Control Plane Protection uses routing protocol authentication: BGP MD5 + TTL security, OSPF MD5/SHA (OSPFv3 uses IPsec), EIGRP/RIPv2 keychain-based MD5.

Graceful Restart, NSF, NSR, and SSO

Dual-supervisor platforms face a key design question: when one supervisor fails, should the network react as if the device failed, or mask the failure and continue forwarding?

SSO (Stateful Switchover): Real-time state sync between supervisors -- the foundation for NSF and NSR
NSF (Non-Stop Forwarding): Data plane continues forwarding using existing tables while the control plane restarts
NSR (Non-Stop Routing): Transparently fails over routing state to a redundant processor without neighbor awareness
Graceful Restart (GR): Protocol-level mechanism where neighbors (helper nodes) maintain routes during a router's control plane restart

graph TD SSO["SSO\n(Stateful Switchover)\nSyncs state between supervisors"] --> NSF["NSF\n(Non-Stop Forwarding)\nData plane continues during\ncontrol plane restart"] SSO --> NSR["NSR\n(Non-Stop Routing)\nTransparent routing failover\nNo neighbor awareness needed"] NSF --> GR["Graceful Restart\n(Protocol-level)\nNeighbors act as helpers"] GR --> RESTART["Restarting Device\n(NSF-capable router)"] GR --> HELPER["Helper Node\n(Adjacent router maintains routes)"] BFD["BFD\n(Sub-second failure detection)"] -.->|"CONFLICTS WITH"| GR style SSO fill:#2980b9,color:#fff style BFD fill:#c0392b,color:#fff style GR fill:#8e44ad,color:#fff

The BFD and Graceful Restart Tension

A critical design conflict: BFD detects forwarding failures rapidly (sub-second), assuming data and control planes share fate. GR/NSF/NSR/SSO mask control plane failures to preserve forwarding, assuming the planes are independent. These are fundamentally opposite goals.

Device Role	Recommended Approach	Rationale
Leaf switches	NSF + GR enabled	Control plane resilience during upgrades; hitless software updates
Spine switches	Simple control plane + BFD	Rapid failover via BFD; redundancy through path diversity
Alternative	Redundant paths + simple routers + BFD	Avoid NSF/NSR/SSO complexity; rely on topology redundancy

Control Plane Scaling Challenges

Routing table size: Full Internet BGP tables exceed 1 million prefixes
Adjacency count: Every OSPF neighbor or BGP peer consumes CPU for keepalives
Convergence storms: A single link failure in a large OSPF area can trigger SPF computation on every router simultaneously
Resource contention: Shared CPU between control and management planes means heavy routing computation can lock out management access

Mitigations: route summarization, hierarchical OSPF areas, BGP route reflectors, prefix filtering, and dedicated control plane hardware.

Section 3: Management Plane Design

The management plane provides the operational interface to the network -- how engineers configure, monitor, collect telemetry, and respond to incidents. While it carries no revenue traffic, a well-designed management plane is the difference between a network that can be operated efficiently at scale and one that becomes an operational burden.

In-Band vs. Out-of-Band Management

In-Band Management routes management traffic across the same interfaces and links that carry production data. Simpler and less expensive, but management access is lost when the production network fails -- precisely when it is needed most.

Out-of-Band (OOB) Management provides a completely separate management path using dedicated interfaces, switches, and routers. The primary objective: ensuring authorized personnel can manage infrastructure even when the production network is disrupted.

flowchart LR subgraph PROD["Production Network"] R1["Router A"] <--> R2["Router B"] R2 <--> R3["Router C"] end subgraph OOB["Out-of-Band Management Network"] MS["Management\nStation"] --> OOBS["OOB Switch"] OOBS --> R1M["Router A\nmgmt0"] OOBS --> R2M["Router B\nmgmt0"] OOBS --> R3M["Router C\nmgmt0"] end ENG["Network\nEngineer"] --> MS ENG -.->|"In-Band Path\n(lost during outage)"| R1 style OOB fill:#d5f5e3,stroke:#27ae60 style PROD fill:#fadbd8,stroke:#c0392b

Aspect	In-Band	Out-of-Band
Cost	Lower -- uses existing infrastructure	Higher -- dedicated hardware and links
Availability during outages	Lost when production fails	Independent of production state
Security	Shares attack surface with production	Isolated attack surface
Best for	Small/non-critical networks	Data centers, SPs, critical infrastructure

OOB Design Best Practices: physical isolation via dedicated mgmt interfaces, deliberate simplicity, ACLs and RBAC for access control, strong authentication via TACACS+/RADIUS, and explicit verification that no unauthorized cross-access exists.

Animation: Split-screen showing a production network outage -- in-band management session disconnects while the OOB path remains active, allowing the engineer to diagnose and resolve the issue

Management Protocol Evolution: SNMP, NETCONF, RESTCONF, gNMI

SNMP has been the monitoring workhorse since 1988. Agent-manager model using MIB hierarchies and OIDs. Designed for monitoring, not configuration. Uses ASN.1 BER encoding over UDP. Lacks transaction support. Only SNMPv3 should be deployed in production.

NETCONF (RFC 6241) -- the most mature modern protocol. Uses XML over SSH/TLS. Provides transaction support, multiple datastores (running/candidate/startup), validation before application, and rollback on failure. Four-layer architecture: transport, messages, operations, content.

RESTCONF (RFC 8040) -- brings YANG data to the web via HTTP/HTTPS. Uses standard HTTP methods and supports XML/JSON. Stateless and web-friendly, but lacks NETCONF's transactions, locking, and candidate datastore. Unsuitable for complex multi-device workflows requiring atomicity.

gNMI -- newest entrant from the OpenConfig Working Group. Uses gRPC with Protocol Buffers over HTTP/2 (3x-10x smaller messages than NETCONF XML). Four RPCs: Capabilities, Get, Set, Subscribe. Subscribe supports STREAM, POLL, and ONCE modes for native streaming telemetry. Related protocols: gNOI (operational commands) and gRIBI (programmatic RIB injection).

Aspect	SNMP	NETCONF	RESTCONF	gNMI
Transport	UDP/TCP (TLS)	SSH/TLS	HTTP/HTTPS	gRPC over HTTP/2
Encoding	ASN.1 BER	XML	JSON or XML	Protocol Buffers
Data Model	SMI (MIB)	YANG	YANG	YANG
Transactions	No	Yes	No	Yes
Streaming Telemetry	No	Limited	No	Yes (native)
Primary Strength	Monitoring	Config management	Developer access	Telemetry + automation

YANG is the common thread connecting NETCONF, RESTCONF, and gNMI -- a structured, protocol-independent data modeling language defined in RFC 7950.

Management Plane Security and Access Control

Authentication: Centralized AAA using TACACS+ (preferred for per-command authorization) or RADIUS. Local accounts as fallback only.
Authorization: RBAC ensuring operators get appropriate access levels
Accounting: Complete audit trails of access and commands
Encryption: SSH (not Telnet), HTTPS (not HTTP), SNMPv3 (not v1/v2c)
Access restriction: ACLs on VTY lines and management interfaces

Post-Study Assessment

Now that you have studied the material, answer the same questions again. Compare your pre and post scores to measure your learning.

Post-Quiz