Chapter 5: Data, Control, and Management Plane Technologies
Learning Objectives
Differentiate data plane, control plane, and management plane functions and their design implications
Design control plane protection mechanisms to ensure network stability
Evaluate management plane architectures for scalability and security
Pre-Study Assessment
Answer these questions before studying to gauge your current understanding. You will see the same questions again after studying.
Pre-Quiz
1. An architect needs to select a data plane forwarding technology for a cost-sensitive data center leaf-spine fabric. Which technology best balances cost and performance?
Custom silicon ASICs designed by the switch vendor
Software forwarding on general-purpose CPUs
Merchant silicon ASICs from a third-party vendor like Broadcom
Discrete FPGAs on every line card
2. What is the primary purpose of Control Plane Policing (CoPP)?
To encrypt routing protocol traffic between peers
To classify and rate-limit traffic destined to the control plane CPU, protecting it from overload
To accelerate transit data plane traffic through QoS prioritization
To provide redundant paths for management traffic
3. Why do BFD and Graceful Restart fundamentally conflict with each other?
They both require excessive CPU resources and cannot run simultaneously
BFD detects forwarding failures to trigger rerouting, while GR masks control plane failures to continue forwarding -- opposite goals
BFD operates at Layer 2 while Graceful Restart operates at Layer 3
Graceful Restart requires SNMPv3, which is incompatible with BFD timers
4. Which management protocol provides native streaming telemetry using Protocol Buffers over HTTP/2?
SNMP with INFORM notifications
NETCONF with on-change subscriptions
RESTCONF with SSE (Server-Sent Events)
gNMI with Subscribe RPCs
5. A network engineer loses SSH access to a core router during a major outage. The organization uses in-band management. What design change would most directly address this problem?
Upgrading from SNMPv2c to SNMPv3
Deploying NETCONF instead of SSH for configuration
Implementing out-of-band management with dedicated management interfaces and a separate switch infrastructure
Increasing the CoPP rate limit for SSH traffic
6. During an OSPF Graceful Restart, what causes helper nodes to immediately terminate the GR procedure?
The restarting device sends an updated Grace LSA with a shorter wait period
Any relevant topology change occurs in the OSPF domain during the restart
The helper node's CPU utilization exceeds 80%
The restarting device's data plane drops below 50% forwarding capacity
7. What key capability does NETCONF provide that RESTCONF does not?
Support for JSON encoding
Use of YANG data models
Transaction support with candidate datastores and rollback on failure
Ability to retrieve operational state data
8. In a CoPP policy, which traffic class should receive the highest rate limit and priority?
ICMP and ARP traffic
Routing protocol traffic (BGP, OSPF, BFD)
Management traffic (SNMP, SSH)
The class-default catch-all
9. What is the relationship between SSO and NSF on a dual-supervisor platform?
NSF is a prerequisite for SSO to function
SSO provides state synchronization between supervisors, which is the foundation that enables NSF to continue forwarding during a control plane restart
They are independent mechanisms that serve unrelated purposes
10. A network architect is designing a programmable data plane that must support custom packet headers and in-network telemetry at hardware speed. Which technology is most appropriate?
DPDK on commodity x86 servers
P4 on programmable ASICs
Standard merchant silicon with fixed pipelines
Software forwarding with kernel bypass
11. What distinguishes Non-Stop Routing (NSR) from Graceful Restart (GR)?
NSR requires neighbor helper support while GR does not
NSR transparently fails over routing state without neighbor awareness, while GR requires cooperative helper nodes
GR is faster than NSR because it uses BFD for detection
NSR works only with BGP while GR works with all routing protocols
12. Why should TACACS+ be preferred over RADIUS for network device management plane authentication?
TACACS+ uses UDP which is faster than RADIUS over TCP
TACACS+ supports per-command authorization granularity, enabling fine-grained access control
TACACS+ encrypts only the password while RADIUS encrypts the entire packet
TACACS+ is open-source while RADIUS is proprietary
13. When a forwarding table on a data center leaf switch exceeds the TCAM capacity of its ASIC, what happens to traffic for entries that do not fit?
Traffic is silently dropped at wire speed
The ASIC automatically compresses entries using route summarization
Entries overflow to slower software lookup paths, degrading performance
The device redistributes excess routes to neighboring switches
14. Which three phases make up network convergence after a link failure?
Authentication, authorization, and accounting
Encapsulation, forwarding, and decapsulation
Detection, propagation, and computation
Classification, queuing, and scheduling
15. In a spine-leaf data center fabric, what is the recommended approach for control plane resilience on spine switches?
Enable NSF and Graceful Restart with aggressive BFD timers for maximum protection
Use a simple non-redundant control plane with BFD, relying on path diversity for redundancy
Deploy SSO with NSR and disable BFD entirely
Use in-band management with SNMP polling for failure detection
Section 1: Data Plane Design
The data plane -- also called the forwarding plane -- is where the actual work of moving packets happens. Every packet entering an ingress interface, being looked up against forwarding tables, and exiting an egress interface is a data plane operation. Its performance directly determines the throughput, latency, and scalability of the entire network.
The Three-Plane Model
Every network device organizes its internal functions into three planes: the data plane (forwarding), the control plane (routing decisions), and the management plane (configuration and monitoring). Think of a commercial airport: the data plane is the runway system moving aircraft; the control plane is air traffic control making routing decisions; the management plane is the administration office handling scheduling and compliance.
Software Data Planes process packets using general-purpose CPUs. They offer maximum flexibility -- any forwarding behavior can be implemented or modified through software updates -- but are orders of magnitude slower than hardware alternatives. Suitable for low-throughput applications, virtual network functions, or scenarios where programmability outweighs raw performance.
Hardware Data Planes use specialized silicon -- typically ASICs or FPGAs -- to forward packets at wire speed. ASICs are purpose-built chips that are 100 to 1,000 times faster than software solutions for packet forwarding.
Animation: Packet traversal through a hardware ASIC pipeline vs. software CPU forwarding, showing the latency and throughput difference at each stage
FPGAs as a Middle Ground: Some vendors deploy FPGAs where merchant silicon cannot deliver required performance. Embedding FPGA technology into ASICs can minimize cost by 90% and power consumption by 85% compared to discrete FPGAs.
P4 (Programming Protocol-Independent Packet Processors) is a domain-specific language that lets architects define custom headers, match-action tables, and forwarding logic at compile time on programmable ASICs. This enables use cases like in-network telemetry and custom encapsulations at hardware speeds.
DPDK (Data Plane Development Kit) optimizes software-based packet processing on commodity x86 hardware by bypassing the kernel networking stack. Widely used in NFV environments where virtual routers, firewalls, and load balancers run on standard servers.
Performance and Scalability Considerations
Forwarding table capacity: ASICs have finite TCAM. Exceeding table capacity forces entries into slower software lookup paths.
Pipeline depth vs. latency: More features mean more pipeline stages and more forwarding latency.
Buffer capacity: Shallow-buffered merchant silicon works for lossless DC fabrics but struggles with bursty WAN/campus traffic. Deep-buffered custom silicon addresses this at higher cost.
Key Points -- Data Plane Design
Data plane design is a multi-dimensional trade-off: performance, programmability, cost, power, and flexibility
Merchant silicon suits cost-effective data center fabrics; custom silicon suits differentiated service provider edge functions; software data planes suit agile NFV deployments
P4 enables hardware-speed custom forwarding logic; DPDK enables near-line-rate software forwarding by bypassing the kernel
TCAM overflow forces entries to slow software paths -- capacity planning is critical for data plane design
FPGAs provide a middle ground between full programmability and wire-speed ASIC performance
Section 2: Control Plane Architecture
The control plane is the brain of the network. It runs protocols -- BGP, OSPF, IS-IS, STP, BFD, LACP -- that discover topology, compute paths, and program the data plane's forwarding tables. While the data plane handles millions of packets per second, the control plane processes hundreds or thousands of protocol messages that shape how every subsequent packet is forwarded.
Routing Protocol Convergence
Network convergence -- the time for all routers to agree on a consistent topology view after a change -- involves three phases:
Detection: Recognizing failure (interface down events, hello timer expiry, or BFD)
Propagation: Distributing failure information (LSAs in OSPF, UPDATEs in BGP)
Computation: Recalculating paths and reprogramming the data plane (SPF in OSPF, best-path in BGP)
A design using BFD with 50ms detection, prefix-independent convergence (PIC), and tuned SPF timers can achieve sub-second failover. Default OSPF hello/dead timers (10s/40s) may take 40+ seconds.
graph TD
F["Link or Node Failure"] --> DET["1. Detection\n(BFD: ~50ms | OSPF Dead Timer: ~40s)"]
DET --> PROP["2. Propagation\n(LSA Flooding / BGP UPDATE)"]
PROP --> COMP["3. Computation\n(SPF Recalculation / Best-Path Selection)"]
COMP --> PROG["4. Data Plane Reprogramming\n(FIB / LFIB Update)"]
PROG --> CONV["Convergence Complete\n(Traffic on New Path)"]
style F fill:#c0392b,color:#fff
style CONV fill:#27ae60,color:#fff
Design principle: Convergence speed must be balanced against control plane stability. Aggressive timers detect failures faster but increase the risk of false positives and protocol flapping.
Control Plane Policing (CoPP)
The control plane CPU is a shared, finite resource. If overwhelmed by an attacker or traffic burst, routing adjacencies drop, the management plane becomes unreachable, and the network collapses. CoPP treats the control plane as a logical interface with QoS-based filters to classify, rate-limit, and prioritize control plane traffic.
Two primary attack vectors:
Overwhelming attacks: DoS attempts flooding the CPU with control packets (e.g., spoofed BGP SYN packets)
Data corruption attacks: Malicious packets injecting false routing information
CoPP implementation follows three steps using MQC:
Traffic Classification: Define important traffic classes using class maps and ACLs
Policy Definition: Assign rate limits and actions per class
Application: Apply the policy to the control-plane interface
graph TD
INB["Inbound Traffic\nto Control Plane CPU"] --> CLASS["CoPP Classification\n(class-map + ACL)"]
CLASS --> P1["Priority 1: Routing Protocols\n(BGP, OSPF, BFD)\nPolice: 500 Kbps"]
CLASS --> P2["Priority 2: Management\n(SNMP, SSH, NETCONF)\nPolice: 100 Kbps"]
CLASS --> P3["Priority 3: General\n(ICMP, ARP)\nPolice: 64 Kbps"]
CLASS --> P4["class-default\n(All Other Traffic)\nPolice: 50 Kbps"]
P1 --> CPU["Control Plane CPU\n(Protected)"]
P2 --> CPU
P3 --> CPU
P4 --> CPU
style P1 fill:#27ae60,color:#fff
style P2 fill:#2980b9,color:#fff
style P3 fill:#e67e22,color:#fff
style P4 fill:#c0392b,color:#fff
Animation: Simulated CoPP in action -- traffic streams arriving at the control plane CPU, being classified and rate-limited, with excess traffic being dropped while critical routing protocol traffic passes through
Layer 2 Control Plane Protection is equally important:
BPDU Guard: Shuts down access ports receiving unexpected BPDUs
BPDU Filter: Suppresses BPDU transmission on specific ports
DTP Disablement: Prevents trunk negotiation attacks via switchport mode access and switchport nonegotiate
Dual-supervisor platforms face a key design question: when one supervisor fails, should the network react as if the device failed, or mask the failure and continue forwarding?
SSO (Stateful Switchover): Real-time state sync between supervisors -- the foundation for NSF and NSR
NSF (Non-Stop Forwarding): Data plane continues forwarding using existing tables while the control plane restarts
NSR (Non-Stop Routing): Transparently fails over routing state to a redundant processor without neighbor awareness
Graceful Restart (GR): Protocol-level mechanism where neighbors (helper nodes) maintain routes during a router's control plane restart
graph TD
SSO["SSO\n(Stateful Switchover)\nSyncs state between supervisors"] --> NSF["NSF\n(Non-Stop Forwarding)\nData plane continues during\ncontrol plane restart"]
SSO --> NSR["NSR\n(Non-Stop Routing)\nTransparent routing failover\nNo neighbor awareness needed"]
NSF --> GR["Graceful Restart\n(Protocol-level)\nNeighbors act as helpers"]
GR --> RESTART["Restarting Device\n(NSF-capable router)"]
GR --> HELPER["Helper Node\n(Adjacent router maintains routes)"]
BFD["BFD\n(Sub-second failure detection)"] -.->|"CONFLICTS WITH"| GR
style SSO fill:#2980b9,color:#fff
style BFD fill:#c0392b,color:#fff
style GR fill:#8e44ad,color:#fff
The BFD and Graceful Restart Tension
A critical design conflict: BFD detects forwarding failures rapidly (sub-second), assuming data and control planes share fate. GR/NSF/NSR/SSO mask control plane failures to preserve forwarding, assuming the planes are independent. These are fundamentally opposite goals.
Device Role
Recommended Approach
Rationale
Leaf switches
NSF + GR enabled
Control plane resilience during upgrades; hitless software updates
Spine switches
Simple control plane + BFD
Rapid failover via BFD; redundancy through path diversity
Alternative
Redundant paths + simple routers + BFD
Avoid NSF/NSR/SSO complexity; rely on topology redundancy
Control Plane Scaling Challenges
Routing table size: Full Internet BGP tables exceed 1 million prefixes
Adjacency count: Every OSPF neighbor or BGP peer consumes CPU for keepalives
Convergence storms: A single link failure in a large OSPF area can trigger SPF computation on every router simultaneously
Resource contention: Shared CPU between control and management planes means heavy routing computation can lock out management access
Mitigations: route summarization, hierarchical OSPF areas, BGP route reflectors, prefix filtering, and dedicated control plane hardware.
Key Points -- Control Plane Architecture
CoPP is mandatory for production networks: classify and rate-limit control plane traffic with routing protocols at highest priority
Network convergence has three phases (detection, propagation, computation) -- each adds delay and must be tuned for the environment
BFD and Graceful Restart have fundamentally opposing goals: rapid failure detection vs. masking failures to preserve forwarding
SSO is the foundation for NSF and NSR; NSR does not require neighbor helper support unlike GR
OSPF GR terminates on topology changes; BGP GR does not -- a critical protocol-specific difference
Section 3: Management Plane Design
The management plane provides the operational interface to the network -- how engineers configure, monitor, collect telemetry, and respond to incidents. While it carries no revenue traffic, a well-designed management plane is the difference between a network that can be operated efficiently at scale and one that becomes an operational burden.
In-Band vs. Out-of-Band Management
In-Band Management routes management traffic across the same interfaces and links that carry production data. Simpler and less expensive, but management access is lost when the production network fails -- precisely when it is needed most.
Out-of-Band (OOB) Management provides a completely separate management path using dedicated interfaces, switches, and routers. The primary objective: ensuring authorized personnel can manage infrastructure even when the production network is disrupted.
flowchart LR
subgraph PROD["Production Network"]
R1["Router A"] <--> R2["Router B"]
R2 <--> R3["Router C"]
end
subgraph OOB["Out-of-Band Management Network"]
MS["Management\nStation"] --> OOBS["OOB Switch"]
OOBS --> R1M["Router A\nmgmt0"]
OOBS --> R2M["Router B\nmgmt0"]
OOBS --> R3M["Router C\nmgmt0"]
end
ENG["Network\nEngineer"] --> MS
ENG -.->|"In-Band Path\n(lost during outage)"| R1
style OOB fill:#d5f5e3,stroke:#27ae60
style PROD fill:#fadbd8,stroke:#c0392b
Aspect
In-Band
Out-of-Band
Cost
Lower -- uses existing infrastructure
Higher -- dedicated hardware and links
Availability during outages
Lost when production fails
Independent of production state
Security
Shares attack surface with production
Isolated attack surface
Best for
Small/non-critical networks
Data centers, SPs, critical infrastructure
OOB Design Best Practices: physical isolation via dedicated mgmt interfaces, deliberate simplicity, ACLs and RBAC for access control, strong authentication via TACACS+/RADIUS, and explicit verification that no unauthorized cross-access exists.
Animation: Split-screen showing a production network outage -- in-band management session disconnects while the OOB path remains active, allowing the engineer to diagnose and resolve the issue
SNMP has been the monitoring workhorse since 1988. Agent-manager model using MIB hierarchies and OIDs. Designed for monitoring, not configuration. Uses ASN.1 BER encoding over UDP. Lacks transaction support. Only SNMPv3 should be deployed in production.
NETCONF (RFC 6241) -- the most mature modern protocol. Uses XML over SSH/TLS. Provides transaction support, multiple datastores (running/candidate/startup), validation before application, and rollback on failure. Four-layer architecture: transport, messages, operations, content.
RESTCONF (RFC 8040) -- brings YANG data to the web via HTTP/HTTPS. Uses standard HTTP methods and supports XML/JSON. Stateless and web-friendly, but lacks NETCONF's transactions, locking, and candidate datastore. Unsuitable for complex multi-device workflows requiring atomicity.
gNMI -- newest entrant from the OpenConfig Working Group. Uses gRPC with Protocol Buffers over HTTP/2 (3x-10x smaller messages than NETCONF XML). Four RPCs: Capabilities, Get, Set, Subscribe. Subscribe supports STREAM, POLL, and ONCE modes for native streaming telemetry. Related protocols: gNOI (operational commands) and gRIBI (programmatic RIB injection).
Aspect
SNMP
NETCONF
RESTCONF
gNMI
Transport
UDP/TCP (TLS)
SSH/TLS
HTTP/HTTPS
gRPC over HTTP/2
Encoding
ASN.1 BER
XML
JSON or XML
Protocol Buffers
Data Model
SMI (MIB)
YANG
YANG
YANG
Transactions
No
Yes
No
Yes
Streaming Telemetry
No
Limited
No
Yes (native)
Primary Strength
Monitoring
Config management
Developer access
Telemetry + automation
YANG is the common thread connecting NETCONF, RESTCONF, and gNMI -- a structured, protocol-independent data modeling language defined in RFC 7950.
Management Plane Security and Access Control
Authentication: Centralized AAA using TACACS+ (preferred for per-command authorization) or RADIUS. Local accounts as fallback only.
Authorization: RBAC ensuring operators get appropriate access levels
Accounting: Complete audit trails of access and commands
Access restriction: ACLs on VTY lines and management interfaces
Key Points -- Management Plane Design
Out-of-band management should be the default for any environment where operational continuity is critical
NETCONF provides transactions and rollback; RESTCONF does not -- making NETCONF the choice for complex multi-device configuration
gNMI's native streaming telemetry eliminates SNMP polling overhead; devices push only changed data as it occurs
These protocols are complementary, not competitive: gNMI for telemetry, NETCONF for configuration, RESTCONF for web integration
TACACS+ is preferred over RADIUS for network device management due to per-command authorization granularity
Post-Study Assessment
Now that you have studied the material, answer the same questions again. Compare your pre and post scores to measure your learning.
Post-Quiz
1. An architect needs to select a data plane forwarding technology for a cost-sensitive data center leaf-spine fabric. Which technology best balances cost and performance?
Custom silicon ASICs designed by the switch vendor
Software forwarding on general-purpose CPUs
Merchant silicon ASICs from a third-party vendor like Broadcom
Discrete FPGAs on every line card
2. What is the primary purpose of Control Plane Policing (CoPP)?
To encrypt routing protocol traffic between peers
To classify and rate-limit traffic destined to the control plane CPU, protecting it from overload
To accelerate transit data plane traffic through QoS prioritization
To provide redundant paths for management traffic
3. Why do BFD and Graceful Restart fundamentally conflict with each other?
They both require excessive CPU resources and cannot run simultaneously
BFD detects forwarding failures to trigger rerouting, while GR masks control plane failures to continue forwarding -- opposite goals
BFD operates at Layer 2 while Graceful Restart operates at Layer 3
Graceful Restart requires SNMPv3, which is incompatible with BFD timers
4. Which management protocol provides native streaming telemetry using Protocol Buffers over HTTP/2?
SNMP with INFORM notifications
NETCONF with on-change subscriptions
RESTCONF with SSE (Server-Sent Events)
gNMI with Subscribe RPCs
5. A network engineer loses SSH access to a core router during a major outage. The organization uses in-band management. What design change would most directly address this problem?
Upgrading from SNMPv2c to SNMPv3
Deploying NETCONF instead of SSH for configuration
Implementing out-of-band management with dedicated management interfaces and a separate switch infrastructure
Increasing the CoPP rate limit for SSH traffic
6. During an OSPF Graceful Restart, what causes helper nodes to immediately terminate the GR procedure?
The restarting device sends an updated Grace LSA with a shorter wait period
Any relevant topology change occurs in the OSPF domain during the restart
The helper node's CPU utilization exceeds 80%
The restarting device's data plane drops below 50% forwarding capacity
7. What key capability does NETCONF provide that RESTCONF does not?
Support for JSON encoding
Use of YANG data models
Transaction support with candidate datastores and rollback on failure
Ability to retrieve operational state data
8. In a CoPP policy, which traffic class should receive the highest rate limit and priority?
ICMP and ARP traffic
Routing protocol traffic (BGP, OSPF, BFD)
Management traffic (SNMP, SSH)
The class-default catch-all
9. What is the relationship between SSO and NSF on a dual-supervisor platform?
NSF is a prerequisite for SSO to function
SSO provides state synchronization between supervisors, which is the foundation that enables NSF to continue forwarding during a control plane restart
They are independent mechanisms that serve unrelated purposes
10. A network architect is designing a programmable data plane that must support custom packet headers and in-network telemetry at hardware speed. Which technology is most appropriate?
DPDK on commodity x86 servers
P4 on programmable ASICs
Standard merchant silicon with fixed pipelines
Software forwarding with kernel bypass
11. What distinguishes Non-Stop Routing (NSR) from Graceful Restart (GR)?
NSR requires neighbor helper support while GR does not
NSR transparently fails over routing state without neighbor awareness, while GR requires cooperative helper nodes
GR is faster than NSR because it uses BFD for detection
NSR works only with BGP while GR works with all routing protocols
12. Why should TACACS+ be preferred over RADIUS for network device management plane authentication?
TACACS+ uses UDP which is faster than RADIUS over TCP
TACACS+ supports per-command authorization granularity, enabling fine-grained access control
TACACS+ encrypts only the password while RADIUS encrypts the entire packet
TACACS+ is open-source while RADIUS is proprietary
13. When a forwarding table on a data center leaf switch exceeds the TCAM capacity of its ASIC, what happens to traffic for entries that do not fit?
Traffic is silently dropped at wire speed
The ASIC automatically compresses entries using route summarization
Entries overflow to slower software lookup paths, degrading performance
The device redistributes excess routes to neighboring switches
14. Which three phases make up network convergence after a link failure?
Authentication, authorization, and accounting
Encapsulation, forwarding, and decapsulation
Detection, propagation, and computation
Classification, queuing, and scheduling
15. In a spine-leaf data center fabric, what is the recommended approach for control plane resilience on spine switches?
Enable NSF and Graceful Restart with aggressive BFD timers for maximum protection
Use a simple non-redundant control plane with BFD, relying on path diversity for redundancy
Deploy SSO with NSR and disable BFD entirely
Use in-band management with SNMP polling for failure detection