Your Task API runs in Kubernetes. You provisioned 2 CPU cores and 4GB memory per pod because "that seemed reasonable." Three months later, you're getting invoices you can't explain. Finance asks which team is driving costs. You have no idea.
Then disaster strikes: a developer accidentally deletes the production namespace. You scramble to restore from... backups you never tested. The database comes back, but half the configuration is missing. Downtime stretches from hours to days.
This chapter teaches you to see where money goes, protect against failures, and prove your systems can survive chaos. But you won't start by reading documentation and hoping you remember which VPA mode is safe for production. You'll start by owning a skill that encodes production patterns from OpenCost, VPA, Velero, and Chaos Mesh.
When your CFO asks why Kubernetes costs doubled, you won't search Stack Overflow. You'll invoke your skill, and it will generate the exact cost allocation queries, VPA configurations, and optimization recommendations you need. That's the difference between learning FinOps and owning operational excellence.
Clone a fresh skills lab to ensure a clean starting point:
Before building the skill, define what you want to accomplish. Create a file called LEARNING-SPEC.md:
This spec guides both you and Claude on what the skill should cover.
Use the /fetching-library-docs skill (or Context7 directly) to gather authoritative sources:
Claude fetches production-relevant patterns from official sources, not outdated blog posts.
Now build your operational excellence skill with everything grounded in what you just fetched:
Claude will reference the official docs, ask clarifying questions, and create the complete skill with tested patterns at .claude/skills/operational-excellence/.
Verify the skill works by generating a valid manifest:
Test 1: VPA Manifest
Test 2: Velero Schedule
Each lesson in this chapter tests and improves your skill:
By chapter end, your skill contains production-tested patterns for the entire operational excellence lifecycle.
Test your ability to design and prioritize operational excellence strategies.
Prompt 1 (Assess Gaps):
Prompt 2 (Validate Skill):
Prompt 3 (Prioritize Sprint):
Operational excellence tools interact with production infrastructure in powerful ways. VPA in Recreate mode will restart your pods. Velero can delete resources during restore. Chaos Mesh will intentionally break things. The patterns in your skill include safety guardrails, but always start with staging environments and Off/dry-run modes before touching production.