Your CFO asks why Kubernetes costs doubled last quarter. You open OpenCost and show exactly which team and application drove the increase. Finance schedules a meeting to discuss optimization.
A junior developer accidentally deletes the production namespace. Instead of panic, you restore from yesterday's Velero backup. The database comes back with consistent state because your pre-backup hooks ran pg_dump before the snapshot. Production is back in 25 minutes.
During a Game Day, you kill one Task API pod while monitoring latency. Kubernetes spawns a replacement in 4 seconds. No requests return 5xx errors. You document the RTO: 4 seconds. That's 26 seconds better than your 30-second target.
This is operational excellence. Not a checklist of installed tools, but a system that answers real questions, recovers from real disasters, and proves itself under controlled failure.
This capstone integrates everything from Chapter 89: VPA for right-sizing, cost allocation labels for visibility, Velero for backup, and Chaos Mesh for resilience validation. By the end, your Task API deployment demonstrates production-ready operational excellence.
Every production system needs explicit requirements. Before deploying anything, write a specification that defines success.
Create a file called task-api-opex-spec.md:
Without a specification, you deploy components and hope they work. With a specification:
Now implement each component from previous lessons, composed into a unified deployment.
Labels placement is critical: Labels appear in both metadata.labels (for Deployment) and spec.template.metadata.labels (for pods). OpenCost reads pod labels, not Deployment labels.
Why Off mode: Per the specification, we want recommendations without automatic restarts. After validating recommendations for 1-2 weeks, you can change to Initial or Recreate mode.
Hook purpose: The pre-backup hook runs CHECKPOINT to flush PostgreSQL buffers and pg_dump for application-level recovery. This ensures SC-006: database consistency during backup.
Why staging only: The specification requires chaos experiments in staging, not production. After validating behavior in staging, you can create a production experiment for scheduled Game Days.
You've written a specification and prepared individual components. Now use your operational-excellence skill to verify everything integrates correctly.
AI reviews your manifests against the operational-excellence skill patterns:
Cost Labels Check:
VPA Safety Check:
Velero Completeness Check:
Chaos Safety Check:
If AI identifies issues, it suggests corrections:
You provide context:
AI confirms:
Deploy all components and verify each success criterion.
Output:
Work through each success criterion systematically.
Output:
All four labels present on pods.
Output:
OpenCost aggregates costs by team label.
Output:
VPA recommends 180m CPU and 450Mi memory (lower than your 250m/512Mi requests).
Output:
Schedule is enabled with correct timing and retention.
Check the most recent backup for hook execution:
Output:
Hooks executed successfully.
Test restore in staging (never test restores against production):
Output:
Restore completed in 2 minutes 15 seconds. Well under the 30-minute RTO.
Run the PodChaos experiment and measure recovery:
Output:
Recovery time: 4 seconds. Well under the 30-second target.
During the chaos experiment, check for errors:
Output:
Zero 5xx errors during pod kill.
Output:
P95 latency: 245ms. Under the 500ms target.
All success criteria verified.
By completing this capstone, you should have created:
You built an operational-excellence skill in Lesson 0. Throughout this chapter, you've tested and improved it. Now finalize it for production use.
Your skill should produce:
If your skill missed any component:
Before adding your skill to your portfolio:
Your operational-excellence skill is now a Digital FTE component. When you build AI agents or cloud-native applications:
This is the Digital Fte's model: build once, deploy everywhere. Your operational excellence patterns are encoded, tested, and ready to manufacture into any production deployment.
What you're learning: Adapting operational excellence patterns to different service characteristics. Stateless services need different backup strategies than databases. GPU workloads need different cost monitoring than CPU-bound services.
What you're learning: Turning manual verification into automated compliance checks. Production systems need continuous validation, not just initial deployment testing.
What you're learning: Debugging operational performance when targets are missed. Chaos engineering often reveals issues that weren't visible during normal operation. Your skill should help diagnose and fix these issues.
Safety note: This capstone deploys real infrastructure that interacts with production systems. Always verify namespace targeting before applying Velero restores or Chaos Mesh experiments. The manifests in this lesson target specific namespaces, but a single typo in includedNamespaces can affect unintended resources. Review manifests carefully, and always test in staging before production.