1. What is the primary purpose of a Requirements Traceability Matrix (RTM) in network design validation?
To document the project budget and resource allocation for the design team
To map every requirement to its corresponding design element, test case, and validation result
To create a timeline for phased network deployment across multiple sites
To list all hardware and software components required for the network build
2. In FMEA, the Risk Priority Number (RPN) is calculated as:
Impact x Probability x Cost
Severity x Occurrence x Detection
Likelihood x Consequence x Mitigation
Frequency x Duration x Scope
3. A network designer validates that under any single failure (N-1), no remaining link exceeds 80% utilization. Which capacity planning principle does this represent?
Burst absorption planning -- dimensioning for microburst traffic patterns
Failure-aware capacity planning -- ensuring surviving paths can absorb redistributed traffic
Steady-state optimization -- maintaining average utilization below operating thresholds
Growth projection planning -- reserving bandwidth for future traffic increases
4. Why is "bidirectional traceability" important in the validation process?
It ensures that traffic flows symmetrically through the network in both directions
It allows tracing forward from requirements to test results AND backward from results to requirements, ensuring no gaps
It verifies that redundant links carry traffic in both active-active and active-passive configurations
It confirms that both the primary and backup data centers meet the same design specifications
5. Which of the following is a network design anti-pattern?
Implementing spine-leaf topology with ECMP for east-west traffic optimization
Using EVPN-VXLAN overlay for multi-tenant data center segmentation
Relying on Spanning Tree Protocol for loop prevention in a modern data center fabric
Deploying redundant route reflectors in separate failure domains
6. What distinguishes a "rollback" from a "fallback" strategy in implementation planning?
Rollback is faster than fallback; fallback is more thorough
Rollback reverts all changes to the last-known-good state; fallback routes around the problem while leaving some changes in place
Rollback is for hardware changes; fallback is for software changes
Rollback requires vendor support; fallback can be done by the internal team
7. What is the primary benefit of spine-leaf architecture over traditional three-tier designs for data center east-west traffic?
Lower equipment cost due to fewer switches needed overall
Predictable latency with exactly two hops between any two servers and full ECMP utilization
Simpler management because a single control plane instance manages all switches
Better north-south traffic handling due to centralized gateway placement
8. During a phased deployment, what is the purpose of a "soak period" after the pilot phase?
To allow the vendor time to ship replacement hardware for the next phase
To observe the change under production traffic over an extended period and catch issues that only appear under sustained load
To retrain the operations team on the new configuration before expanding deployment
To wait for regulatory approval before deploying to additional sites
9. A silent route leak from a misconfigured peer scores Severity=8, Occurrence=5, Detection=8 in an FMEA analysis. Why does this warrant immediate design mitigation?
Because the occurrence score of 5 means the failure happens more than once per day
Because the high detection score means the issue is difficult to detect before it causes impact, combined with high severity, yielding an RPN of 320
Because any individual score above 7 automatically triggers mandatory mitigation
Because the severity score of 8 alone requires an immediate network redesign
10. In a Change Impact Assessment, what should a designer do FIRST after identifying a changed requirement?
Immediately begin modifying the network architecture to accommodate the change
Trace the impact through the RTM to identify every affected design element
Estimate the budget impact and submit a change request to management
Schedule a maintenance window for the required modifications
11. Which of the following represents a "shared fate" risk in failure domain analysis?
Two routers running different software versions in separate racks
Redundant links routed through the same fiber conduit that could be severed by a single dig event
Primary and backup DNS servers deployed in different availability zones
Active-active load balancers using independent power supplies from separate utility feeds
12. What is the "point of no return" in a backout plan?
The moment when the first configuration change is applied to a production device
The stage at which reversing the change becomes impractical or more disruptive than continuing forward
The deadline by which the change must be completed or it will be cancelled
The point at which the maintenance window expires and normal operations must resume
13. Why is "lift-and-shift to cloud" considered a design anti-pattern?
Because cloud infrastructure is inherently less reliable than on-premises equipment
Because it replicates on-premises architecture without leveraging cloud-native benefits like auto-scaling and managed services
Because cloud providers do not support traditional routing protocols like OSPF and BGP
Because regulatory requirements prohibit moving network functions to the cloud
14. Which design review type specifically evaluates whether a design can be built within available technology, budget, and timeline constraints?
Completeness Review
Consistency Review
Feasibility Review
Standards Compliance Review
15. A cost optimization effort replaces 100G-capable switches with 10G switches at access layer sites where only 10G is needed. Which cost optimization principle does this follow?
Consolidating functions onto multi-function platforms
Right-sizing equipment to match actual and projected requirements
Leveraging hierarchical design to concentrate expensive equipment at the core
Evaluating managed services versus owned infrastructure
Design validation is the systematic process of confirming that a proposed or existing network architecture meets all stated requirements -- functional, performance, security, and operational. It is not a single activity but a layered process spanning documentation review, analytical testing, lab verification, and staged deployment.
Requirements Traceability and the RTM
At the foundation of validation is requirements traceability -- mapping every business and technical requirement to the specific design elements that fulfill it. The Requirements Traceability Matrix (RTM) is the authoritative record proving that every agreed-upon requirement has been addressed.
An effective RTM includes these columns:
| RTM Column | Purpose | Example |
| Requirement ID | Unique identifier for tracking | REQ-HA-003 |
| Requirement Description | What must be achieved | "Core routing must recover from single node failure within 500ms" |
| Design Element | Architecture component addressing the requirement | Dual-plane IS-IS topology with BFD (50ms timers) |
| Validation Method | How compliance will be verified | Lab failover test with traffic generators |
| Validation Status | Current state of verification | Passed / Failed / Pending |
| Risk if Unmet | Business impact of non-compliance | SLA breach, revenue loss during outage |
Bidirectional traceability is critical: you must trace forward from a requirement to its design element and test case, and backward from a test result to the requirement it validates. This ensures no requirement is orphaned and no test exists without a clear purpose.
flowchart LR
A["Business/Technical\nRequirement"] --> B["Design Element\n(Architecture Component)"]
B --> C["Test Case\n(Validation Method)"]
C --> D["Validation Result\n(Pass / Fail / Pending)"]
D -->|"Backward Trace"| C
C -->|"Backward Trace"| B
B -->|"Backward Trace"| A
A -->|"Forward Trace"| D
Figure 19.1: Bidirectional Requirements Traceability -- forward tracing links requirements to validated results; backward tracing confirms every test maps to a requirement.
Animation: Interactive RTM walkthrough showing forward and backward tracing between requirements, design elements, test cases, and validation results
Design Review Process
Formal design reviews apply expert judgment to areas traceability alone cannot cover. A structured peer review includes four sequential gates:
- Completeness Review -- Does the design address every requirement in the RTM?
- Consistency Review -- Are there contradictions between design elements?
- Feasibility Review -- Can the design be implemented with available technology, budget, and timelines?
- Standards Compliance Review -- Does the design conform to organizational, vendor, and regulatory requirements?
flowchart TD
Start["Design Document\nSubmitted for Review"] --> R1["Completeness Review\nAll RTM requirements addressed?"]
R1 -->|Pass| R2["Consistency Review\nNo contradictions between elements?"]
R1 -->|Gaps Found| Fix1["Document gaps\nand return to designer"]
R2 -->|Pass| R3["Feasibility Review\nImplementable within constraints?"]
R2 -->|Conflicts Found| Fix2["Resolve contradictions\nand re-review"]
R3 -->|Pass| R4["Standards Compliance Review\nConforms to org/vendor/regulatory?"]
R3 -->|Infeasible| Fix3["Adjust scope, budget,\nor technology choices"]
R4 -->|Pass| Approved["Design Approved\nfor Implementation"]
R4 -->|Non-compliant| Fix4["Remediate compliance\ngaps and re-review"]
Figure 19.2: Design Review Process -- four sequential review gates with feedback loops for identified issues.
Failure Mode and Effects Analysis (FMEA)
FMEA systematically identifies potential failure modes, assesses their impact, and prioritizes mitigation. For each component, it asks: What can fail? What happens when it fails? How likely is it and how quickly can we detect it?
The Risk Priority Number (RPN) combines three factors:
RPN = Severity x Occurrence x Detection
| Factor | Scale | Meaning |
| Severity | 1-10 | Impact on service if failure occurs (10 = total outage) |
| Occurrence | 1-10 | Likelihood of the failure happening (10 = near certain) |
| Detection | 1-10 | Difficulty of detecting the failure before impact (10 = undetectable) |
Example: A spine switch failure in a dual-spine data center might score: Severity=4, Occurrence=3, Detection=2. RPN=24 (low priority). A silent route leak from a misconfigured peer: Severity=8, Occurrence=5, Detection=8. RPN=320 (high priority requiring prefix filters and RPKI validation).
flowchart TD
Start["Identify Component\nor Dependency"] --> FM["Define Failure Mode\nWhat can fail?"]
FM --> Effect["Assess Effect\nWhat happens when it fails?"]
Effect --> S["Rate Severity\n1-10 scale"]
Effect --> O["Rate Occurrence\n1-10 scale"]
Effect --> D["Rate Detection\n1-10 scale"]
S --> RPN["Calculate RPN\nSeverity x Occurrence x Detection"]
O --> RPN
D --> RPN
RPN --> Eval{RPN Threshold?}
Eval -->|"High RPN\n(e.g. 320)"| Mitigate["Immediate Design\nMitigation Required"]
Eval -->|"Low RPN\n(e.g. 24)"| Accept["Accept Risk\nwith Monitoring"]
Figure 19.3: FMEA Process Flow -- each component is evaluated for failure mode, effect, and three risk factors that combine into a Risk Priority Number.
Animation: Interactive FMEA calculator -- input severity, occurrence, and detection values to see the resulting RPN and recommended action
Failure Domain Mapping
Beyond individual failures, designers must map failure domains -- the blast radius of any single failure event. Key questions: Does a control plane failure affect the entire network? Can a shared-fate event take down components intended to be independent? Do redundancy mechanisms actually provide independent failure domains?
| Failure Domain | Components Affected | Shared Fate Risks | Mitigation |
| Single rack | ToR switches, servers in rack | Power feed, rack PDU | Dual-homed servers to separate racks |
| Availability zone | All racks in zone | Building power, cooling, fiber entry | Workloads span multiple AZs |
| Control plane | All devices sharing routing instance | Route reflector, SDN controller | Redundant RRs in separate failure domains |
| Management plane | All managed devices | NMS server, AAA infrastructure | Out-of-band management network |
Capacity Planning and Scalability
Capacity validation ensures the design handles both current traffic and projected growth. A critical mistake is dimensioning based on steady-state averages instead of worst-case scenarios including failure rerouting, burst behavior, and reconvergence overhead.
The N-1 Rule: Under any single failure condition, no remaining link or node should exceed a defined utilization threshold (typically 70-80%). For critical infrastructure, N-2 validation may be required.
| Capacity Scenario | Validation Question | Threshold |
| Steady state | Are all links below target utilization? | < 50% |
| N-1 failure | Can the network absorb one failure without congestion? | < 80% |
| N-2 failure | Can the network absorb two simultaneous failures? | < 95% |
| Peak + failure | Can the network handle peak traffic during a failure? | < 90% |
| Growth projection | Will the design accommodate 2-3 year traffic growth? | Varies |
Animation: Capacity utilization visualization showing how link loads shift during N-1 failure scenarios
Once a design has been validated and gaps identified, optimization addresses those gaps while improving efficiency, cost-effectiveness, and adaptability. Optimization is about making informed trade-offs that best serve business requirements.
Identifying Design Anti-Patterns
Anti-patterns are recurring design choices that appear reasonable but produce negative outcomes. Recognizing them is a core CCDE skill.
| Anti-Pattern | Description | Consequence | Resolution |
| Flat network | No segmentation or hierarchy | Poor scalability, large failure domains | Hierarchical design or spine-leaf topology |
| Nosy neighbor | Excessive polling instead of event-driven communication | Unnecessary traffic, tight coupling | Event-driven architectures, streaming telemetry |
| Lift-and-shift | Replicating on-prem architecture in cloud | Misses cloud-native benefits | Redesign for cloud-native patterns |
| Management plane bypass | Security on data plane but not management plane | Unprotected management interfaces | Consistent security across all planes; OOB with MFA |
| Over-engineering | Google-scale solutions for mid-size environments | Unnecessary complexity and cost | Right-size to actual requirements |
| STP dependency | STP for loop prevention in modern data centers | Wasted bandwidth, slow convergence | Spine-leaf with ECMP, EVPN-VXLAN |
Performance Optimization
Spine-Leaf Architecture is the standard for modern data centers because it addresses east-west traffic with predictable latency (exactly two hops), horizontal scalability, full ECMP utilization, and simplified troubleshooting.
QoS Optimization manages contention for existing bandwidth. It does not add bandwidth. The three QoS models:
| QoS Model | Mechanism | Use Case | Trade-off |
| Best Effort | No differentiation | General internet traffic | Simple but no guarantees |
| DiffServ | Per-hop behaviors (DSCP) | Enterprise WAN, campus | Scalable, good enough for most |
| IntServ (RSVP) | Per-flow reservation | Ultra-critical real-time flows | Precise but does not scale |
Traffic Engineering optimizes path selection beyond shortest-path routing: MPLS-TE / Segment Routing TE for specific path steering, traffic shaping for burst smoothing, and SD-WAN for dynamic path selection based on real-time metrics.
Animation: Side-by-side comparison of three-tier vs. spine-leaf traffic paths showing hop count and ECMP utilization differences
Cost Optimization
The key principle: optimize cost by eliminating waste, not by cutting capability. Strategies include:
- Right-size equipment -- Replace over-provisioned hardware with appropriately sized platforms
- Consolidate functions -- Use multi-function platforms where requirements permit
- Leverage hierarchical design -- Concentrate expensive equipment at core tiers, cost-effective equipment at the edge
- Evaluate managed services vs. owned infrastructure -- SD-WAN, SASE, NaaS can shift CAPEX to OPEX
- Optimize licensing -- Right-size software feature licenses to actual needs
Adapting Designs for Changed Specifications
Requirements change. A well-optimized design accommodates change gracefully using the Change Impact Assessment Framework:
- Identify the changed requirement
- Trace the impact through the RTM to identify every affected design element
- Assess design headroom -- can the current design absorb the change?
- Evaluate options with trade-off analysis
- Update the RTM and re-validate
flowchart TD
Change["Changed Requirement\nIdentified"] --> Trace["Trace Impact via RTM\nIdentify affected design elements"]
Trace --> Headroom{"Design has\nheadroom?"}
Headroom -->|Yes| Absorb["Absorb change within\nexisting architecture"]
Headroom -->|No| Modify["Requires architectural\nmodification"]
Absorb --> Options["Evaluate options\nCost / Complexity / Risk"]
Modify --> Options
Options --> Update["Update RTM\nTrace to original + new requirements"]
Update --> Revalidate["Re-validate\nmodified design"]
Figure 19.6: Change Impact Assessment Framework -- from requirement change through impact tracing, headroom evaluation, and RTM update to re-validation.
A validated and optimized design is worthless if it cannot be implemented safely. Implementation planning translates design decisions into executable, risk-managed action plans.
High-Level Implementation Plans
Implementation plans define the sequence, dependencies, and responsibilities for bringing a design change into production. Key elements:
| Plan Element | Description | Example |
| Scope | What is being changed and what is out of scope | "Upgrade core from OSPF to IS-IS in Building A; Building B is Phase 2" |
| Prerequisites | Conditions that must be met before execution | Hardware staged, configs reviewed, maintenance window approved |
| Step Sequence | Ordered actions with dependencies | 1. Backup configs, 2. Apply IS-IS config, 3. Verify adjacency... |
| Responsible Party | Named individual for each step | Network Engineer: J. Smith |
| Expected Duration | Time estimate per step and total window | Step 2: 15 min; Total: 4 hours |
| Success Criteria | Measurable outcomes confirming success | "IS-IS adjacency established, all prefixes in RIB" |
| Communication Plan | Who to notify at each phase | Stakeholder update at each phase gate |
Dependency types:
- Hard dependencies -- Step B cannot start until Step A succeeds (e.g., cannot verify adjacency before applying config)
- Soft dependencies -- Step B benefits from Step A but can proceed independently
- External dependencies -- Require action from teams outside the implementer's control
Risk Mitigation and Rollback Planning
Every implementation plan must include a rollback strategy developed in parallel with the implementation steps.
| Strategy | Action | When to Use | Limitation |
| Rollback | Revert all changes to last-known-good state | Implementation fails; partial state worse than original | Requires original state was captured and is restorable |
| Fallback | Route around the problem using feature flags or alternate paths | Partial failure where some components work | May leave mixed state requiring follow-up |
Backout Plan (required by ITIL/ISO 20000) must specify:
- Trigger criteria -- What conditions initiate a backout?
- Point of no return -- When does backout become more disruptive than pressing forward?
- Backout steps -- Step-by-step reversal mirroring implementation in reverse
- Backout duration -- Must fit within remaining maintenance window
- Verification after backout -- Confirm environment restored to pre-change state
flowchart TD
Issue["Issue Detected\nDuring Implementation"] --> Impact{"Is service\nimpacted?"}
Impact -->|No| Monitor["Continue with\ncaution, monitor"]
Impact -->|Yes| Severity{"Severity level?"}
Severity -->|High| Rollback["Rollback\nimmediately"]
Severity -->|Low| Fixable{"Can issue be\nresolved in window?"}
Fixable -->|Yes| Fix["Fix and\ncontinue"]
Fixable -->|No| RollbackReschedule["Rollback and\nreschedule"]
Figure 19.4: Risk Mitigation Decision Tree -- structured decision flow from issue detection through severity assessment to rollback or resolution.
Animation: Interactive decision tree walkthrough -- click through different issue scenarios to see the recommended path
Staged Deployment and Validation Checkpoints
Large-scale changes should never be a single monolithic event. Staged deployment limits the blast radius of problems:
| Phase | Scope | Purpose | Gate Criteria |
| Lab validation | Simulated environment | Verify configuration correctness | All test cases pass |
| Pilot deployment | Single non-critical site | Validate in production with limited exposure | Metrics stable for 48-72 hour soak |
| Limited production | Small subset of production sites | Build operational confidence | No incidents during soak |
| Full production | Remaining sites | Complete the rollout | Incremental deployment with monitoring |
stateDiagram-v2
[*] --> LabValidation: Design approved
LabValidation: Lab Validation
LabValidation --> PilotDeployment: All test cases pass
PilotDeployment: Pilot Deployment (single non-critical site)
PilotDeployment --> LimitedProduction: Metrics stable after 48-72hr soak
LimitedProduction: Limited Production (subset of sites)
LimitedProduction --> FullProduction: No incidents during soak period
FullProduction: Full Production Rollout
FullProduction --> [*]: Rollout complete
LabValidation --> Remediate: Unexpected behavior
PilotDeployment --> Remediate: Service degradation
LimitedProduction --> Remediate: Incidents detected
Remediate: Remediate and Re-validate
Remediate --> LabValidation: Fix applied
Figure 19.5: Staged Deployment Lifecycle -- four phases with gate criteria and remediation loops that return to lab validation when issues arise.
Validation Checkpoints at each phase gate:
- Smoke tests -- Quick verification of basic functionality (Can traffic pass? Do adjacencies form?)
- Functional tests -- Deeper validation of specific requirements (Does failover complete within convergence time?)
- Performance baseline comparison -- Compare post-change metrics against pre-change baseline
- Monitoring soak period -- Extended observation under production traffic
Now that you have studied the material, answer the same questions again. Compare your results to measure what you learned.
1. What is the primary purpose of a Requirements Traceability Matrix (RTM) in network design validation?
To document the project budget and resource allocation for the design team
To map every requirement to its corresponding design element, test case, and validation result
To create a timeline for phased network deployment across multiple sites
To list all hardware and software components required for the network build
2. In FMEA, the Risk Priority Number (RPN) is calculated as:
Impact x Probability x Cost
Severity x Occurrence x Detection
Likelihood x Consequence x Mitigation
Frequency x Duration x Scope
3. A network designer validates that under any single failure (N-1), no remaining link exceeds 80% utilization. Which capacity planning principle does this represent?
Burst absorption planning -- dimensioning for microburst traffic patterns
Failure-aware capacity planning -- ensuring surviving paths can absorb redistributed traffic
Steady-state optimization -- maintaining average utilization below operating thresholds
Growth projection planning -- reserving bandwidth for future traffic increases
4. Why is "bidirectional traceability" important in the validation process?
It ensures that traffic flows symmetrically through the network in both directions
It allows tracing forward from requirements to test results AND backward from results to requirements, ensuring no gaps
It verifies that redundant links carry traffic in both active-active and active-passive configurations
It confirms that both the primary and backup data centers meet the same design specifications
5. Which of the following is a network design anti-pattern?
Implementing spine-leaf topology with ECMP for east-west traffic optimization
Using EVPN-VXLAN overlay for multi-tenant data center segmentation
Relying on Spanning Tree Protocol for loop prevention in a modern data center fabric
Deploying redundant route reflectors in separate failure domains
6. What distinguishes a "rollback" from a "fallback" strategy in implementation planning?
Rollback is faster than fallback; fallback is more thorough
Rollback reverts all changes to the last-known-good state; fallback routes around the problem while leaving some changes in place
Rollback is for hardware changes; fallback is for software changes
Rollback requires vendor support; fallback can be done by the internal team
7. What is the primary benefit of spine-leaf architecture over traditional three-tier designs for data center east-west traffic?
Lower equipment cost due to fewer switches needed overall
Predictable latency with exactly two hops between any two servers and full ECMP utilization
Simpler management because a single control plane instance manages all switches
Better north-south traffic handling due to centralized gateway placement
8. During a phased deployment, what is the purpose of a "soak period" after the pilot phase?
To allow the vendor time to ship replacement hardware for the next phase
To observe the change under production traffic over an extended period and catch issues that only appear under sustained load
To retrain the operations team on the new configuration before expanding deployment
To wait for regulatory approval before deploying to additional sites
9. A silent route leak from a misconfigured peer scores Severity=8, Occurrence=5, Detection=8 in an FMEA analysis. Why does this warrant immediate design mitigation?
Because the occurrence score of 5 means the failure happens more than once per day
Because the high detection score means the issue is difficult to detect before it causes impact, combined with high severity, yielding an RPN of 320
Because any individual score above 7 automatically triggers mandatory mitigation
Because the severity score of 8 alone requires an immediate network redesign
10. In a Change Impact Assessment, what should a designer do FIRST after identifying a changed requirement?
Immediately begin modifying the network architecture to accommodate the change
Trace the impact through the RTM to identify every affected design element
Estimate the budget impact and submit a change request to management
Schedule a maintenance window for the required modifications
11. Which of the following represents a "shared fate" risk in failure domain analysis?
Two routers running different software versions in separate racks
Redundant links routed through the same fiber conduit that could be severed by a single dig event
Primary and backup DNS servers deployed in different availability zones
Active-active load balancers using independent power supplies from separate utility feeds
12. What is the "point of no return" in a backout plan?
The moment when the first configuration change is applied to a production device
The stage at which reversing the change becomes impractical or more disruptive than continuing forward
The deadline by which the change must be completed or it will be cancelled
The point at which the maintenance window expires and normal operations must resume
13. Why is "lift-and-shift to cloud" considered a design anti-pattern?
Because cloud infrastructure is inherently less reliable than on-premises equipment
Because it replicates on-premises architecture without leveraging cloud-native benefits like auto-scaling and managed services
Because cloud providers do not support traditional routing protocols like OSPF and BGP
Because regulatory requirements prohibit moving network functions to the cloud
14. Which design review type specifically evaluates whether a design can be built within available technology, budget, and timeline constraints?
Completeness Review
Consistency Review
Feasibility Review
Standards Compliance Review
15. A cost optimization effort replaces 100G-capable switches with 10G switches at access layer sites where only 10G is needed. Which cost optimization principle does this follow?
Consolidating functions onto multi-function platforms
Right-sizing equipment to match actual and projected requirements
Leveraging hierarchical design to concentrate expensive equipment at the core
Evaluating managed services versus owned infrastructure