Your Task API is running in production with three replicas. Kubernetes promises self-healing: if a pod dies, ReplicaSet spawns a replacement. But have you seen it happen? Do you know how long recovery takes? What happens to in-flight requests during the transition?
Most teams discover the answers during a real outage. At 3am. With customers waiting.
Chaos engineering inverts this pattern. Instead of waiting for failures to find you, you inject failures intentionally. You break things on purpose, in controlled conditions, during business hours, with monitoring ready. When the real outage comes, you've already seen it. You've already fixed the gaps. You sleep through the night because you've proven your system recovers.
This lesson teaches the "break things on purpose" philosophy and gives you the tools to practice it safely.
Chaos engineering isn't random destruction. It's hypothesis-driven experimentation with clear methodology.
Every chaos experiment begins with a specific, falsifiable prediction:
"If I kill one Task API pod, Kubernetes will spawn a replacement within 30 seconds, and no requests will return 5xx errors during the transition."
This is not "let's see what happens." You're stating what you expect, then testing whether reality matches expectation. If it doesn't, you've learned something valuable.
Good hypotheses are specific:
Bad hypotheses are vague:
Inject failures that actually happen in production:
Don't waste time on failures that never occur. Focus on what your monitoring has shown you (or what similar systems experience).
Start small. Always.
Never start with "kill all pods in production." That's not chaos engineering; that's chaos.
A single successful test proves nothing. Systems change. Dependencies update. New code deploys. Last month's passing test may fail today.
Schedule regular chaos experiments:
Chaos Mesh is a CNCF project for Kubernetes-native chaos engineering. It provides CRDs (Custom Resource Definitions) that let you describe experiments declaratively, the same way you describe deployments.
Install Chaos Mesh with namespace filtering enabled. This ensures experiments only run against namespaces you explicitly allow:
Output:
By default, Chaos Mesh won't inject failures into any namespace. You must explicitly annotate namespaces where experiments are allowed:
Output:
This is a safety mechanism. Even if someone accidentally applies a chaos experiment, it won't affect production unless production is annotated (which it shouldn't be until you've built confidence in staging).
Chaos Mesh provides several experiment types. This lesson focuses on PodChaos; future lessons cover advanced types.
PodChaos is the starting point for chaos engineering. It answers the fundamental question: "Does my application recover when pods die?"
Here's a complete PodChaos experiment for the Task API:
action: What to do to the pod
mode: How many pods to affect
selector: Which pods to target
duration: How long the experiment runs
Apply the experiment and observe what happens:
Output:
The sequence shows:
Total recovery time: approximately 3-5 seconds (varies by image size and cluster state).
Output:
Chaos Mesh has multiple safety layers:
As configured above, only annotated namespaces allow experiments. Production isn't annotated.
Experiments only affect pods matching the selector. A typo in labelSelectors means no pods match (experiment does nothing).
Experiments automatically stop after duration. Even if you forget, the experiment ends.
Chaos Mesh respects Kubernetes RBAC. Users without appropriate ClusterRole can't create experiments.
Using mode: one ensures only one pod is affected. Never use mode: all in your first experiments.
A Game Day is a structured resilience validation exercise. It's the chaos engineering equivalent of a fire drill: planned, observed, and documented.
Phase 1: Define Hypothesis Write your specific, falsifiable prediction:
Phase 2: Set Up Monitoring Before injecting failures, ensure you can see what happens:
Phase 3: Run in Staging First Never skip staging. This is where you find surprises:
Phase 4: Observe and Document Record what actually happened versus what you predicted:
Phase 5: Iterate Address gaps found in observations:
Phase 6: Graduate to Production (Off-Peak) Only after staging passes multiple times:
Use this checklist for every Game Day:
Here's the end-to-end workflow from chaos engineering theory to validated resilience:
This section demonstrates how you and AI collaborate to design and execute chaos experiments.
You ask AI to help you set up chaos engineering for your Task API:
AI suggests a minimal-blast-radius approach:
AI also notes: "Consider adding a Game Day checklist so you can repeat this experiment systematically."
You review AI's suggestion and add your specific requirements:
AI updates with your load testing requirement:
A chaos engineering approach that neither of you had alone:
These prompts help you apply chaos engineering to your own deployments.
Prompt 1: Design Your First Hypothesis
What you're learning: How to think hypothesis-first before running experiments. AI helps you articulate what success looks like before you break anything.
Prompt 2: Expand Experiment Types
What you're learning: How to progress from simple (pod kill) to complex (network chaos). AI demonstrates how each experiment type applies to real dependencies.
Prompt 3: Build Your Game Day Runbook
What you're learning: How to systematize chaos engineering for your team. A runbook ensures experiments are repeatable and well-documented.
Safety note: Always start chaos experiments in staging. Never run mode: all or skip namespace filtering. Have a rollback command ready (kubectl delete podchaos [name]). Schedule production experiments during low-traffic windows with team awareness.
You built an operational-excellence skill in Lesson 0. Test and improve it based on what you learned about chaos engineering.
Ask yourself:
If you found gaps: