USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookThe Resilience Validation: Chaos Engineering Basics
Previous Chapter
Velero for Kubernetes Backup and Restore
Next Chapter
Data Sovereignty and Compliance
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

54 sections

Progress0%
1 / 54

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Chaos Engineering Basics

Your Task API is running in production with three replicas. Kubernetes promises self-healing: if a pod dies, ReplicaSet spawns a replacement. But have you seen it happen? Do you know how long recovery takes? What happens to in-flight requests during the transition?

Most teams discover the answers during a real outage. At 3am. With customers waiting.

Chaos engineering inverts this pattern. Instead of waiting for failures to find you, you inject failures intentionally. You break things on purpose, in controlled conditions, during business hours, with monitoring ready. When the real outage comes, you've already seen it. You've already fixed the gaps. You sleep through the night because you've proven your system recovers.

This lesson teaches the "break things on purpose" philosophy and gives you the tools to practice it safely.


The Four Principles of Chaos Engineering

Chaos engineering isn't random destruction. It's hypothesis-driven experimentation with clear methodology.

Principle 1: Start with a Hypothesis

Every chaos experiment begins with a specific, falsifiable prediction:

"If I kill one Task API pod, Kubernetes will spawn a replacement within 30 seconds, and no requests will return 5xx errors during the transition."

This is not "let's see what happens." You're stating what you expect, then testing whether reality matches expectation. If it doesn't, you've learned something valuable.

Good hypotheses are specific:

  • "Recovery within 30 seconds" (measurable)
  • "No 5xx errors" (observable)
  • "One pod failure" (scoped)

Bad hypotheses are vague:

  • "The system should handle failures" (how?)
  • "Things should be fine" (what does "fine" mean?)

Principle 2: Vary Real-World Events

Inject failures that actually happen in production:

  • Pods crash (OOMKilled, process failure)
  • Network partitions (cloud provider issues)
  • Disk fills up (log accumulation)
  • CPU spikes (noisy neighbors)

Don't waste time on failures that never occur. Focus on what your monitoring has shown you (or what similar systems experience).

Principle 3: Minimize Blast Radius

Start small. Always.

PhaseBlast RadiusEnvironment
First testOne podStaging
Confidence building50% of podsStaging
Pre-productionOne podProduction (off-peak)
Ongoing validationOne podProduction (scheduled)

Never start with "kill all pods in production." That's not chaos engineering; that's chaos.

Principle 4: Run Experiments Continuously

A single successful test proves nothing. Systems change. Dependencies update. New code deploys. Last month's passing test may fail today.

Schedule regular chaos experiments:

  • Weekly: Core services (Task API pod kill)
  • Monthly: Cross-service failures (database unavailable)
  • Quarterly: Full Game Day exercises

Chaos Mesh: CNCF Chaos Engineering Platform

Chaos Mesh is a CNCF project for Kubernetes-native chaos engineering. It provides CRDs (Custom Resource Definitions) that let you describe experiments declaratively, the same way you describe deployments.

Installing Chaos Mesh

Install Chaos Mesh with namespace filtering enabled. This ensures experiments only run against namespaces you explicitly allow:

bash
# Add Chaos Mesh Helm repository helm repo add chaos-mesh https://charts.chaos-mesh.org helm repo update # Install with namespace filtering (experiments require annotation) helm install chaos-mesh chaos-mesh/chaos-mesh \ -n chaos-mesh \ --create-namespace \ --set controllerManager.enableFilterNamespace=true

Output:

text
NAME: chaos-mesh NAMESPACE: chaos-mesh STATUS: deployed REVISION: 1 NOTES: Chaos Mesh is now running. To enable chaos experiments in a namespace, add annotation: chaos-mesh.org/inject: enabled

Enable Namespace for Chaos Experiments

By default, Chaos Mesh won't inject failures into any namespace. You must explicitly annotate namespaces where experiments are allowed:

bash
# Enable chaos experiments in staging namespace kubectl annotate namespace staging chaos-mesh.org/inject=enabled # Verify annotation kubectl get namespace staging -o jsonpath='{.metadata.annotations.chaos-mesh\.org/inject}'

Output:

text
enabled

This is a safety mechanism. Even if someone accidentally applies a chaos experiment, it won't affect production unless production is annotated (which it shouldn't be until you've built confidence in staging).


Experiment Types Overview

Chaos Mesh provides several experiment types. This lesson focuses on PodChaos; future lessons cover advanced types.

Experiment TypeWhat It DoesUse Case
PodChaosKill pods, container failuresTest ReplicaSet self-healing
NetworkChaosLatency, packet loss, partitionTest timeout handling, circuit breakers
IOChaosDisk latency, read/write failuresTest database resilience
StressChaosCPU/memory pressureTest resource limits, OOM handling

PodChaos: Testing Pod Failure Recovery

PodChaos is the starting point for chaos engineering. It answers the fundamental question: "Does my application recover when pods die?"

PodChaos CRD Structure

Here's a complete PodChaos experiment for the Task API:

yaml
apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: task-api-pod-kill namespace: chaos-mesh spec: action: pod-kill # Kill the pod (not just container) mode: one # Kill exactly one matching pod selector: namespaces: - staging # Only target staging namespace labelSelectors: app: task-api # Pods with this label duration: "30s" # Experiment runs for 30 seconds

Key Fields Explained

action: What to do to the pod

  • pod-kill: Terminate the pod (Kubernetes respawns it)
  • pod-failure: Make pod unavailable without killing it
  • container-kill: Kill specific container (pod survives)

mode: How many pods to affect

  • one: Exactly one random matching pod
  • all: Every matching pod (dangerous!)
  • fixed: Specific count (e.g., value: "2")
  • fixed-percent: Percentage (e.g., value: "50")

selector: Which pods to target

  • namespaces: Must include namespace explicitly
  • labelSelectors: Standard Kubernetes label matching
  • fieldSelectors: Pod spec fields (node, phase)

duration: How long the experiment runs

  • Experiment automatically stops after duration
  • Killed pods are not restored (Kubernetes handles recovery)

Running Your First PodChaos Experiment

Apply the experiment and observe what happens:

bash
# Apply the chaos experiment kubectl apply -f task-api-pod-kill.yaml # Watch pods in staging namespace kubectl get pods -n staging -w

Output:

text
NAME READY STATUS RESTARTS AGE task-api-7d8f9b6c4d-abc12 1/1 Running 0 2h task-api-7d8f9b6c4d-def34 1/1 Running 0 2h task-api-7d8f9b6c4d-ghi56 1/1 Running 0 2h task-api-7d8f9b6c4d-abc12 1/1 Terminating 0 2h task-api-7d8f9b6c4d-jkl78 0/1 Pending 0 0s task-api-7d8f9b6c4d-jkl78 0/1 ContainerCreating 0 1s task-api-7d8f9b6c4d-jkl78 1/1 Running 0 3s

The sequence shows:

  1. Pod abc12 is killed (Terminating)
  2. ReplicaSet immediately creates jkl78 (Pending)
  3. Container image pulls (ContainerCreating)
  4. New pod is ready (Running)

Total recovery time: approximately 3-5 seconds (varies by image size and cluster state).

Check Experiment Status

bash
# View experiment status kubectl describe podchaos task-api-pod-kill -n chaos-mesh

Output:

text
Name: task-api-pod-kill Namespace: chaos-mesh Status: Experiment: Phase: Finished Conditions: - Type: Applied Status: True Records: - Id: abc12 Phase: Injected InjectedCount: 1

Safety Features: Why Chaos Mesh Won't Destroy Production

Chaos Mesh has multiple safety layers:

Namespace Filtering

As configured above, only annotated namespaces allow experiments. Production isn't annotated.

Selectors

Experiments only affect pods matching the selector. A typo in labelSelectors means no pods match (experiment does nothing).

Duration Limits

Experiments automatically stop after duration. Even if you forget, the experiment ends.

RBAC Authorization

Chaos Mesh respects Kubernetes RBAC. Users without appropriate ClusterRole can't create experiments.

yaml
# Example RBAC: Only SRE team can run chaos apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: chaos-engineer rules: - apiGroups: ["chaos-mesh.org"] resources: ["podchaos", "networkchaos"] verbs: ["create", "delete", "get", "list"]

Mode Selection

Using mode: one ensures only one pod is affected. Never use mode: all in your first experiments.


The Game Day Pattern

A Game Day is a structured resilience validation exercise. It's the chaos engineering equivalent of a fire drill: planned, observed, and documented.

The Six-Phase Game Day

Phase 1: Define Hypothesis Write your specific, falsifiable prediction:

text
## Game Day: Task API Pod Failure Recovery **Date**: 2025-12-30 **Participants**: SRE team, Backend team **Hypothesis**: If one Task API pod is killed: - Kubernetes spawns replacement within 30 seconds - No requests return 5xx during transition - P95 latency stays under 500ms - All in-flight requests complete or are retried by client

Phase 2: Set Up Monitoring Before injecting failures, ensure you can see what happens:

text
**Monitoring Checklist**: - [ ] Grafana dashboard open showing Task API latency - [ ] Prometheus alerts configured for error rate - [ ] kubectl watch running on staging pods - [ ] Log streaming from Task API pods

Phase 3: Run in Staging First Never skip staging. This is where you find surprises:

bash
# Staging experiment kubectl apply -f task-api-pod-kill.yaml

Phase 4: Observe and Document Record what actually happened versus what you predicted:

text
**Observations**: - Pod killed at 14:32:15 - New pod running at 14:32:19 (4 seconds, PASS) - Error rate: 0% (PASS) - P95 latency: 320ms during transition (PASS) **Unexpected**: No requests in flight during the kill. Need to run during load test to verify handling.

Phase 5: Iterate Address gaps found in observations:

text
**Action Items**: 1. Re-run during load test (100 RPS) 2. Add NetworkChaos to test graceful degradation 3. Document RTO as 5 seconds (observed)

Phase 6: Graduate to Production (Off-Peak) Only after staging passes multiple times:

bash
# Enable chaos in production (carefully!) kubectl annotate namespace production chaos-mesh.org/inject=enabled # Run during maintenance window (e.g., Sunday 3am) kubectl apply -f task-api-pod-kill-prod.yaml

Game Day Checklist Template

Use this checklist for every Game Day:

text
## Game Day Checklist ### Pre-Game - [ ] Hypothesis documented with measurable success criteria - [ ] Monitoring dashboards open and working - [ ] Staging namespace annotated for chaos - [ ] Rollback procedure documented (kubectl delete podchaos) - [ ] Communication channel open (Slack, Teams) - [ ] All participants briefed on their role ### During Game - [ ] Announce experiment start - [ ] Apply chaos experiment - [ ] Record observations in real-time - [ ] Compare to hypothesis - [ ] Stop early if unexpected behavior ### Post-Game - [ ] Document results (pass/fail for each criterion) - [ ] Create action items for gaps - [ ] Update runbooks with learnings - [ ] Schedule follow-up Game Day

Putting It Together: Complete Workflow

Here's the end-to-end workflow from chaos engineering theory to validated resilience:

text
1. Write Hypothesis "Task API recovers from pod failure in 30 seconds" ↓ 2. Prepare Monitoring Open Grafana, enable alerts, start log streaming ↓ 3. Enable Staging Namespace kubectl annotate namespace staging chaos-mesh.org/inject=enabled ↓ 4. Create PodChaos Experiment apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos ... ↓ 5. Apply Experiment kubectl apply -f task-api-pod-kill.yaml ↓ 6. Observe and Record Pod killed → New pod running → Recovery time ↓ 7. Compare to Hypothesis 30 seconds expected → 5 seconds actual (PASS) ↓ 8. Document and Iterate Update RTO, schedule next Game Day

Collaborative Chaos Design

This section demonstrates how you and AI collaborate to design and execute chaos experiments.

Your Initial Request

You ask AI to help you set up chaos engineering for your Task API:

text
My Task API has 3 replicas in staging. I want to test that it recovers from pod failures. Help me write a PodChaos experiment that's safe for staging.

AI's Recommendation

AI suggests a minimal-blast-radius approach:

yaml
# Suggested by AI: Start with mode: one, short duration apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: task-api-pod-kill namespace: chaos-mesh spec: action: pod-kill mode: one # Safe: only one pod selector: namespaces: [staging] # Safe: only staging labelSelectors: app: task-api duration: "30s" # Safe: auto-stops

AI also notes: "Consider adding a Game Day checklist so you can repeat this experiment systematically."

Your Refinement

You review AI's suggestion and add your specific requirements:

text
Good start, but I need to test recovery during actual load. Our production traffic is 100 RPS. Can you add a hypothesis template that includes latency and error rate thresholds?

AI's Adaptation

AI updates with your load testing requirement:

text
## Hypothesis Template (Updated) **Pre-condition**: Task API under 100 RPS load (use hey or k6) **Hypothesis**: If one Task API pod is killed during 100 RPS load: - Recovery time: < 30 seconds - Error rate: < 1% during transition - P95 latency: < 500ms during recovery - No cascading failures to downstream services **Load Test Command**: hey -z 5m -q 100 https://staging.example.com/tasks

What Emerged

A chaos engineering approach that neither of you had alone:

  • AI suggested the safe PodChaos defaults (mode: one, namespace filtering)
  • You refined with load testing requirements (100 RPS, latency thresholds)
  • Together you produced a complete hypothesis template with measurable success criteria

Try With AI

These prompts help you apply chaos engineering to your own deployments.

Prompt 1: Design Your First Hypothesis

text
I have a [YOUR SERVICE] deployed in Kubernetes with [N] replicas. Help me write a chaos engineering hypothesis that includes: - Specific failure scenario (pod kill, network delay, etc.) - Measurable success criteria (recovery time, error rate) - Blast radius limits (which namespace, how many pods)

What you're learning: How to think hypothesis-first before running experiments. AI helps you articulate what success looks like before you break anything.

Prompt 2: Expand Experiment Types

text
I've successfully run PodChaos experiments. Now I want to test network resilience. My Task API depends on a PostgreSQL database. What NetworkChaos experiment would test graceful degradation if database latency increases to 500ms?

What you're learning: How to progress from simple (pod kill) to complex (network chaos). AI demonstrates how each experiment type applies to real dependencies.

Prompt 3: Build Your Game Day Runbook

text
Help me create a Game Day runbook for our production environment. Include: - Pre-game safety checks - Communication templates (announcing start/end) - Observation checklists - Post-mortem template for documenting learnings

What you're learning: How to systematize chaos engineering for your team. A runbook ensures experiments are repeatable and well-documented.

Safety note: Always start chaos experiments in staging. Never run mode: all or skip namespace filtering. Have a rollback command ready (kubectl delete podchaos [name]). Schedule production experiments during low-traffic windows with team awareness.


Reflect on Your Skill

You built an operational-excellence skill in Lesson 0. Test and improve it based on what you learned about chaos engineering.

Test Your Skill

text
Using my operational-excellence skill, generate a PodChaos experiment for my Task API that kills one pod in staging for 30 seconds. Does my skill include the correct Chaos Mesh CRD structure? Does it enforce namespace filtering for safety?

Identify Gaps

Ask yourself:

  • Did my skill include the Game Day pattern with the six phases?
  • Did it explain the four chaos engineering principles?
  • Did it include the mode options (one, all, fixed, fixed-percent)?

Improve Your Skill

If you found gaps:

text
My operational-excellence skill is missing the Game Day pattern. Update it to include: 1. Six-phase Game Day structure (hypothesis, monitoring, staging, observe, iterate, graduate) 2. Game Day checklist template 3. Post-experiment documentation template