Chapter 5: Data, Control, and Management Plane Technologies

Learning Objectives

Pre-Study Assessment

Answer these questions before studying to gauge your current understanding. You will see the same questions again after studying.

Pre-Quiz

1. An architect needs to select a data plane forwarding technology for a cost-sensitive data center leaf-spine fabric. Which technology best balances cost and performance?

Custom silicon ASICs designed by the switch vendor
Software forwarding on general-purpose CPUs
Merchant silicon ASICs from a third-party vendor like Broadcom
Discrete FPGAs on every line card

2. What is the primary purpose of Control Plane Policing (CoPP)?

To encrypt routing protocol traffic between peers
To classify and rate-limit traffic destined to the control plane CPU, protecting it from overload
To accelerate transit data plane traffic through QoS prioritization
To provide redundant paths for management traffic

3. Why do BFD and Graceful Restart fundamentally conflict with each other?

They both require excessive CPU resources and cannot run simultaneously
BFD detects forwarding failures to trigger rerouting, while GR masks control plane failures to continue forwarding -- opposite goals
BFD operates at Layer 2 while Graceful Restart operates at Layer 3
Graceful Restart requires SNMPv3, which is incompatible with BFD timers

4. Which management protocol provides native streaming telemetry using Protocol Buffers over HTTP/2?

SNMP with INFORM notifications
NETCONF with on-change subscriptions
RESTCONF with SSE (Server-Sent Events)
gNMI with Subscribe RPCs

5. A network engineer loses SSH access to a core router during a major outage. The organization uses in-band management. What design change would most directly address this problem?

Upgrading from SNMPv2c to SNMPv3
Deploying NETCONF instead of SSH for configuration
Implementing out-of-band management with dedicated management interfaces and a separate switch infrastructure
Increasing the CoPP rate limit for SSH traffic

6. During an OSPF Graceful Restart, what causes helper nodes to immediately terminate the GR procedure?

The restarting device sends an updated Grace LSA with a shorter wait period
Any relevant topology change occurs in the OSPF domain during the restart
The helper node's CPU utilization exceeds 80%
The restarting device's data plane drops below 50% forwarding capacity

7. What key capability does NETCONF provide that RESTCONF does not?

Support for JSON encoding
Use of YANG data models
Transaction support with candidate datastores and rollback on failure
Ability to retrieve operational state data

8. In a CoPP policy, which traffic class should receive the highest rate limit and priority?

ICMP and ARP traffic
Routing protocol traffic (BGP, OSPF, BFD)
Management traffic (SNMP, SSH)
The class-default catch-all

9. What is the relationship between SSO and NSF on a dual-supervisor platform?

NSF is a prerequisite for SSO to function
SSO provides state synchronization between supervisors, which is the foundation that enables NSF to continue forwarding during a control plane restart
They are independent mechanisms that serve unrelated purposes
SSO handles Layer 2 failover while NSF handles Layer 3 failover exclusively

10. A network architect is designing a programmable data plane that must support custom packet headers and in-network telemetry at hardware speed. Which technology is most appropriate?

DPDK on commodity x86 servers
P4 on programmable ASICs
Standard merchant silicon with fixed pipelines
Software forwarding with kernel bypass

11. What distinguishes Non-Stop Routing (NSR) from Graceful Restart (GR)?

NSR requires neighbor helper support while GR does not
NSR transparently fails over routing state without neighbor awareness, while GR requires cooperative helper nodes
GR is faster than NSR because it uses BFD for detection
NSR works only with BGP while GR works with all routing protocols

12. Why should TACACS+ be preferred over RADIUS for network device management plane authentication?

TACACS+ uses UDP which is faster than RADIUS over TCP
TACACS+ supports per-command authorization granularity, enabling fine-grained access control
TACACS+ encrypts only the password while RADIUS encrypts the entire packet
TACACS+ is open-source while RADIUS is proprietary

13. When a forwarding table on a data center leaf switch exceeds the TCAM capacity of its ASIC, what happens to traffic for entries that do not fit?

Traffic is silently dropped at wire speed
The ASIC automatically compresses entries using route summarization
Entries overflow to slower software lookup paths, degrading performance
The device redistributes excess routes to neighboring switches

14. Which three phases make up network convergence after a link failure?

Authentication, authorization, and accounting
Encapsulation, forwarding, and decapsulation
Detection, propagation, and computation
Classification, queuing, and scheduling

15. In a spine-leaf data center fabric, what is the recommended approach for control plane resilience on spine switches?

Enable NSF and Graceful Restart with aggressive BFD timers for maximum protection
Use a simple non-redundant control plane with BFD, relying on path diversity for redundancy
Deploy SSO with NSR and disable BFD entirely
Use in-band management with SNMP polling for failure detection

Section 1: Data Plane Design

The data plane -- also called the forwarding plane -- is where the actual work of moving packets happens. Every packet entering an ingress interface, being looked up against forwarding tables, and exiting an egress interface is a data plane operation. Its performance directly determines the throughput, latency, and scalability of the entire network.

The Three-Plane Model

Every network device organizes its internal functions into three planes: the data plane (forwarding), the control plane (routing decisions), and the management plane (configuration and monitoring). Think of a commercial airport: the data plane is the runway system moving aircraft; the control plane is air traffic control making routing decisions; the management plane is the administration office handling scheduling and compliance.

flowchart LR subgraph DP["Data Plane"] D1["Packet Forwarding"] D2["ASIC / FPGA / Software"] D3["Forwarding Tables"] end subgraph CP["Control Plane"] C1["Routing Protocols\n(BGP, OSPF, IS-IS)"] C2["Path Computation"] C3["Topology Discovery"] end subgraph MP["Management Plane"] M1["Configuration\n(NETCONF, gNMI)"] M2["Monitoring\n(SNMP, Telemetry)"] M3["AAA / Access Control"] end CP -- "Programs forwarding tables" --> DP MP -- "Configures & monitors" --> CP MP -- "Configures & monitors" --> DP

Hardware vs. Software Data Planes

Software Data Planes process packets using general-purpose CPUs. They offer maximum flexibility -- any forwarding behavior can be implemented or modified through software updates -- but are orders of magnitude slower than hardware alternatives. Suitable for low-throughput applications, virtual network functions, or scenarios where programmability outweighs raw performance.

Hardware Data Planes use specialized silicon -- typically ASICs or FPGAs -- to forward packets at wire speed. ASICs are purpose-built chips that are 100 to 1,000 times faster than software solutions for packet forwarding.

Animation: Packet traversal through a hardware ASIC pipeline vs. software CPU forwarding, showing the latency and throughput difference at each stage

Merchant Silicon vs. Custom Silicon

CharacteristicMerchant SiliconCustom Silicon
DesignerThird-party chip vendors (e.g., Broadcom, Marvell)Equipment vendor (e.g., Cisco, Juniper)
Time to MarketFaster -- available off-the-shelfSlower -- minimum 2-year R&D cycle
CostLower unit cost, shared across vendorsHigher development investment
DifferentiationLimited -- same chip available to competitorsHigh -- unique capabilities
FlexibilityConstrained by vendor roadmapFull control over feature set

FPGAs as a Middle Ground: Some vendors deploy FPGAs where merchant silicon cannot deliver required performance. Embedding FPGA technology into ASICs can minimize cost by 90% and power consumption by 85% compared to discrete FPGAs.

flowchart LR SW["Software\nForwarding\n(CPU-based)"] -->|"More flexible\nless performant"| FPGA["FPGA\n(Reprogrammable\nHardware)"] FPGA -->|"More performant\nless flexible"| MS["Merchant\nSilicon\n(Off-the-shelf ASIC)"] MS -->|"More differentiated\nhigher cost"| CS["Custom\nSilicon\n(Vendor ASIC)"] style SW fill:#4a90d9,color:#fff style FPGA fill:#7b68ee,color:#fff style MS fill:#e67e22,color:#fff style CS fill:#c0392b,color:#fff

Data Plane Programmability: P4 and DPDK

P4 (Programming Protocol-Independent Packet Processors) is a domain-specific language that lets architects define custom headers, match-action tables, and forwarding logic at compile time on programmable ASICs. This enables use cases like in-network telemetry and custom encapsulations at hardware speeds.

DPDK (Data Plane Development Kit) optimizes software-based packet processing on commodity x86 hardware by bypassing the kernel networking stack. Widely used in NFV environments where virtual routers, firewalls, and load balancers run on standard servers.

Performance and Scalability Considerations

Key Points -- Data Plane Design

Section 2: Control Plane Architecture

The control plane is the brain of the network. It runs protocols -- BGP, OSPF, IS-IS, STP, BFD, LACP -- that discover topology, compute paths, and program the data plane's forwarding tables. While the data plane handles millions of packets per second, the control plane processes hundreds or thousands of protocol messages that shape how every subsequent packet is forwarded.

Routing Protocol Convergence

Network convergence -- the time for all routers to agree on a consistent topology view after a change -- involves three phases:

  1. Detection: Recognizing failure (interface down events, hello timer expiry, or BFD)
  2. Propagation: Distributing failure information (LSAs in OSPF, UPDATEs in BGP)
  3. Computation: Recalculating paths and reprogramming the data plane (SPF in OSPF, best-path in BGP)

A design using BFD with 50ms detection, prefix-independent convergence (PIC), and tuned SPF timers can achieve sub-second failover. Default OSPF hello/dead timers (10s/40s) may take 40+ seconds.

graph TD F["Link or Node Failure"] --> DET["1. Detection\n(BFD: ~50ms | OSPF Dead Timer: ~40s)"] DET --> PROP["2. Propagation\n(LSA Flooding / BGP UPDATE)"] PROP --> COMP["3. Computation\n(SPF Recalculation / Best-Path Selection)"] COMP --> PROG["4. Data Plane Reprogramming\n(FIB / LFIB Update)"] PROG --> CONV["Convergence Complete\n(Traffic on New Path)"] style F fill:#c0392b,color:#fff style CONV fill:#27ae60,color:#fff
Design principle: Convergence speed must be balanced against control plane stability. Aggressive timers detect failures faster but increase the risk of false positives and protocol flapping.

Control Plane Policing (CoPP)

The control plane CPU is a shared, finite resource. If overwhelmed by an attacker or traffic burst, routing adjacencies drop, the management plane becomes unreachable, and the network collapses. CoPP treats the control plane as a logical interface with QoS-based filters to classify, rate-limit, and prioritize control plane traffic.

Two primary attack vectors:

CoPP implementation follows three steps using MQC:

  1. Traffic Classification: Define important traffic classes using class maps and ACLs
  2. Policy Definition: Assign rate limits and actions per class
  3. Application: Apply the policy to the control-plane interface
graph TD INB["Inbound Traffic\nto Control Plane CPU"] --> CLASS["CoPP Classification\n(class-map + ACL)"] CLASS --> P1["Priority 1: Routing Protocols\n(BGP, OSPF, BFD)\nPolice: 500 Kbps"] CLASS --> P2["Priority 2: Management\n(SNMP, SSH, NETCONF)\nPolice: 100 Kbps"] CLASS --> P3["Priority 3: General\n(ICMP, ARP)\nPolice: 64 Kbps"] CLASS --> P4["class-default\n(All Other Traffic)\nPolice: 50 Kbps"] P1 --> CPU["Control Plane CPU\n(Protected)"] P2 --> CPU P3 --> CPU P4 --> CPU style P1 fill:#27ae60,color:#fff style P2 fill:#2980b9,color:#fff style P3 fill:#e67e22,color:#fff style P4 fill:#c0392b,color:#fff
Animation: Simulated CoPP in action -- traffic streams arriving at the control plane CPU, being classified and rate-limited, with excess traffic being dropped while critical routing protocol traffic passes through

Layer 2 Control Plane Protection is equally important:

Layer 3 Control Plane Protection uses routing protocol authentication: BGP MD5 + TTL security, OSPF MD5/SHA (OSPFv3 uses IPsec), EIGRP/RIPv2 keychain-based MD5.

Graceful Restart, NSF, NSR, and SSO

Dual-supervisor platforms face a key design question: when one supervisor fails, should the network react as if the device failed, or mask the failure and continue forwarding?

graph TD SSO["SSO\n(Stateful Switchover)\nSyncs state between supervisors"] --> NSF["NSF\n(Non-Stop Forwarding)\nData plane continues during\ncontrol plane restart"] SSO --> NSR["NSR\n(Non-Stop Routing)\nTransparent routing failover\nNo neighbor awareness needed"] NSF --> GR["Graceful Restart\n(Protocol-level)\nNeighbors act as helpers"] GR --> RESTART["Restarting Device\n(NSF-capable router)"] GR --> HELPER["Helper Node\n(Adjacent router maintains routes)"] BFD["BFD\n(Sub-second failure detection)"] -.->|"CONFLICTS WITH"| GR style SSO fill:#2980b9,color:#fff style BFD fill:#c0392b,color:#fff style GR fill:#8e44ad,color:#fff

The BFD and Graceful Restart Tension

A critical design conflict: BFD detects forwarding failures rapidly (sub-second), assuming data and control planes share fate. GR/NSF/NSR/SSO mask control plane failures to preserve forwarding, assuming the planes are independent. These are fundamentally opposite goals.

Device RoleRecommended ApproachRationale
Leaf switchesNSF + GR enabledControl plane resilience during upgrades; hitless software updates
Spine switchesSimple control plane + BFDRapid failover via BFD; redundancy through path diversity
AlternativeRedundant paths + simple routers + BFDAvoid NSF/NSR/SSO complexity; rely on topology redundancy

Control Plane Scaling Challenges

Mitigations: route summarization, hierarchical OSPF areas, BGP route reflectors, prefix filtering, and dedicated control plane hardware.

Key Points -- Control Plane Architecture

Section 3: Management Plane Design

The management plane provides the operational interface to the network -- how engineers configure, monitor, collect telemetry, and respond to incidents. While it carries no revenue traffic, a well-designed management plane is the difference between a network that can be operated efficiently at scale and one that becomes an operational burden.

In-Band vs. Out-of-Band Management

In-Band Management routes management traffic across the same interfaces and links that carry production data. Simpler and less expensive, but management access is lost when the production network fails -- precisely when it is needed most.

Out-of-Band (OOB) Management provides a completely separate management path using dedicated interfaces, switches, and routers. The primary objective: ensuring authorized personnel can manage infrastructure even when the production network is disrupted.

flowchart LR subgraph PROD["Production Network"] R1["Router A"] <--> R2["Router B"] R2 <--> R3["Router C"] end subgraph OOB["Out-of-Band Management Network"] MS["Management\nStation"] --> OOBS["OOB Switch"] OOBS --> R1M["Router A\nmgmt0"] OOBS --> R2M["Router B\nmgmt0"] OOBS --> R3M["Router C\nmgmt0"] end ENG["Network\nEngineer"] --> MS ENG -.->|"In-Band Path\n(lost during outage)"| R1 style OOB fill:#d5f5e3,stroke:#27ae60 style PROD fill:#fadbd8,stroke:#c0392b
AspectIn-BandOut-of-Band
CostLower -- uses existing infrastructureHigher -- dedicated hardware and links
Availability during outagesLost when production failsIndependent of production state
SecurityShares attack surface with productionIsolated attack surface
Best forSmall/non-critical networksData centers, SPs, critical infrastructure

OOB Design Best Practices: physical isolation via dedicated mgmt interfaces, deliberate simplicity, ACLs and RBAC for access control, strong authentication via TACACS+/RADIUS, and explicit verification that no unauthorized cross-access exists.

Animation: Split-screen showing a production network outage -- in-band management session disconnects while the OOB path remains active, allowing the engineer to diagnose and resolve the issue

Management Protocol Evolution: SNMP, NETCONF, RESTCONF, gNMI

SNMP has been the monitoring workhorse since 1988. Agent-manager model using MIB hierarchies and OIDs. Designed for monitoring, not configuration. Uses ASN.1 BER encoding over UDP. Lacks transaction support. Only SNMPv3 should be deployed in production.

NETCONF (RFC 6241) -- the most mature modern protocol. Uses XML over SSH/TLS. Provides transaction support, multiple datastores (running/candidate/startup), validation before application, and rollback on failure. Four-layer architecture: transport, messages, operations, content.

RESTCONF (RFC 8040) -- brings YANG data to the web via HTTP/HTTPS. Uses standard HTTP methods and supports XML/JSON. Stateless and web-friendly, but lacks NETCONF's transactions, locking, and candidate datastore. Unsuitable for complex multi-device workflows requiring atomicity.

gNMI -- newest entrant from the OpenConfig Working Group. Uses gRPC with Protocol Buffers over HTTP/2 (3x-10x smaller messages than NETCONF XML). Four RPCs: Capabilities, Get, Set, Subscribe. Subscribe supports STREAM, POLL, and ONCE modes for native streaming telemetry. Related protocols: gNOI (operational commands) and gRIBI (programmatic RIB injection).

AspectSNMPNETCONFRESTCONFgNMI
TransportUDP/TCP (TLS)SSH/TLSHTTP/HTTPSgRPC over HTTP/2
EncodingASN.1 BERXMLJSON or XMLProtocol Buffers
Data ModelSMI (MIB)YANGYANGYANG
TransactionsNoYesNoYes
Streaming TelemetryNoLimitedNoYes (native)
Primary StrengthMonitoringConfig managementDeveloper accessTelemetry + automation

YANG is the common thread connecting NETCONF, RESTCONF, and gNMI -- a structured, protocol-independent data modeling language defined in RFC 7950.

Management Plane Security and Access Control

Key Points -- Management Plane Design

Post-Study Assessment

Now that you have studied the material, answer the same questions again. Compare your pre and post scores to measure your learning.

Post-Quiz

1. An architect needs to select a data plane forwarding technology for a cost-sensitive data center leaf-spine fabric. Which technology best balances cost and performance?

Custom silicon ASICs designed by the switch vendor
Software forwarding on general-purpose CPUs
Merchant silicon ASICs from a third-party vendor like Broadcom
Discrete FPGAs on every line card

2. What is the primary purpose of Control Plane Policing (CoPP)?

To encrypt routing protocol traffic between peers
To classify and rate-limit traffic destined to the control plane CPU, protecting it from overload
To accelerate transit data plane traffic through QoS prioritization
To provide redundant paths for management traffic

3. Why do BFD and Graceful Restart fundamentally conflict with each other?

They both require excessive CPU resources and cannot run simultaneously
BFD detects forwarding failures to trigger rerouting, while GR masks control plane failures to continue forwarding -- opposite goals
BFD operates at Layer 2 while Graceful Restart operates at Layer 3
Graceful Restart requires SNMPv3, which is incompatible with BFD timers

4. Which management protocol provides native streaming telemetry using Protocol Buffers over HTTP/2?

SNMP with INFORM notifications
NETCONF with on-change subscriptions
RESTCONF with SSE (Server-Sent Events)
gNMI with Subscribe RPCs

5. A network engineer loses SSH access to a core router during a major outage. The organization uses in-band management. What design change would most directly address this problem?

Upgrading from SNMPv2c to SNMPv3
Deploying NETCONF instead of SSH for configuration
Implementing out-of-band management with dedicated management interfaces and a separate switch infrastructure
Increasing the CoPP rate limit for SSH traffic

6. During an OSPF Graceful Restart, what causes helper nodes to immediately terminate the GR procedure?

The restarting device sends an updated Grace LSA with a shorter wait period
Any relevant topology change occurs in the OSPF domain during the restart
The helper node's CPU utilization exceeds 80%
The restarting device's data plane drops below 50% forwarding capacity

7. What key capability does NETCONF provide that RESTCONF does not?

Support for JSON encoding
Use of YANG data models
Transaction support with candidate datastores and rollback on failure
Ability to retrieve operational state data

8. In a CoPP policy, which traffic class should receive the highest rate limit and priority?

ICMP and ARP traffic
Routing protocol traffic (BGP, OSPF, BFD)
Management traffic (SNMP, SSH)
The class-default catch-all

9. What is the relationship between SSO and NSF on a dual-supervisor platform?

NSF is a prerequisite for SSO to function
SSO provides state synchronization between supervisors, which is the foundation that enables NSF to continue forwarding during a control plane restart
They are independent mechanisms that serve unrelated purposes
SSO handles Layer 2 failover while NSF handles Layer 3 failover exclusively

10. A network architect is designing a programmable data plane that must support custom packet headers and in-network telemetry at hardware speed. Which technology is most appropriate?

DPDK on commodity x86 servers
P4 on programmable ASICs
Standard merchant silicon with fixed pipelines
Software forwarding with kernel bypass

11. What distinguishes Non-Stop Routing (NSR) from Graceful Restart (GR)?

NSR requires neighbor helper support while GR does not
NSR transparently fails over routing state without neighbor awareness, while GR requires cooperative helper nodes
GR is faster than NSR because it uses BFD for detection
NSR works only with BGP while GR works with all routing protocols

12. Why should TACACS+ be preferred over RADIUS for network device management plane authentication?

TACACS+ uses UDP which is faster than RADIUS over TCP
TACACS+ supports per-command authorization granularity, enabling fine-grained access control
TACACS+ encrypts only the password while RADIUS encrypts the entire packet
TACACS+ is open-source while RADIUS is proprietary

13. When a forwarding table on a data center leaf switch exceeds the TCAM capacity of its ASIC, what happens to traffic for entries that do not fit?

Traffic is silently dropped at wire speed
The ASIC automatically compresses entries using route summarization
Entries overflow to slower software lookup paths, degrading performance
The device redistributes excess routes to neighboring switches

14. Which three phases make up network convergence after a link failure?

Authentication, authorization, and accounting
Encapsulation, forwarding, and decapsulation
Detection, propagation, and computation
Classification, queuing, and scheduling

15. In a spine-leaf data center fabric, what is the recommended approach for control plane resilience on spine switches?

Enable NSF and Graceful Restart with aggressive BFD timers for maximum protection
Use a simple non-redundant control plane with BFD, relying on path diversity for redundancy
Deploy SSO with NSR and disable BFD entirely
Use in-band management with SNMP polling for failure detection

Your Progress

Answer Explanations