USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookA Maintenance Window Should Never Cause a Production Outage
Previous Chapter
Autoscaling with HPA VPA KEDA
Next Chapter
Envoy AI Gateway for LLM Traffic
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

68 sections

Progress0%
1 / 68

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Resilience Patterns

Module 7 takes the agent you built in Module 6 and turns it into a production cloud service. You'll containerize the stack, orchestrate it on Kubernetes, automate delivery, and operate it with observability, security, and cost controls. The goal: a reliable Digital FTE that runs 24/7 for real users.

Prerequisites: Modules 4-6. You need a working agent service to deploy.

Your service is healthy. All pods are running, health checks pass, metrics look green. Then a network glitch causes a database connection to drop for 200 milliseconds. Without retry logic, that request fails permanently. A user sees an error. They refresh, it works—but trust erodes. Meanwhile, during a Kubernetes node upgrade, all your pods get evicted simultaneously because you forgot to create a PodDisruptionBudget. Your service goes down for 90 seconds while new pods start. Both failures were preventable.

Resilience patterns prepare your services for the failures that will happen. Networks are unreliable. Pods get evicted. Dependencies slow down. The question is not whether failures occur, but whether your system handles them gracefully. This lesson teaches production-grade patterns: retry policies that recover from transient failures, timeouts that prevent resource exhaustion, PodDisruptionBudgets that protect during maintenance, and graceful shutdown that completes in-flight requests.

By the end, you will configure retry policies with exponential backoff, set request and connection timeouts, create PDBs that guarantee minimum availability, and implement graceful shutdown with preStop hooks.


The Resilience Stack

Production resilience operates at multiple layers. Each layer handles different failure modes:

LayerPatternWhat It Protects Against
RequestRetriesTransient failures (network blips, temporary overload)
RequestTimeoutsSlow dependencies (prevent thread exhaustion)
ConnectionOutlier detectionUnhealthy backends (exclude failing pods)
PodLiveness/Readiness probesApplication failures (restart unhealthy pods)
PodGraceful shutdownTermination (complete in-flight work)
DeploymentPDBMaintenance disruptions (guarantee availability)

The Resilience Stack

text
┌────────────────────────────────────────────────┐ │ Resilience Stack │ └────────────────────────────────────────────────┘ │ ┌─────────────────────────────────────┼─────────────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Request │ │ Pod │ │ Deployment │ │ Layer │ │ Layer │ │ Layer │ │ │ │ │ │ │ │ • Retries │ │ • Probes │ │ • PDB │ │ • Timeouts │ │ • preStop │ │ • Rolling │ │ • Outliers │ │ • Grace pd │ │ updates │ └──────────────┘ └──────────────┘ └──────────────┘

Retry Policies

Retries automatically resend failed requests. A 500ms network timeout does not mean the operation failed—the server may have completed successfully. Retry policies distinguish between failures worth retrying (transient) and failures that should not be retried (permanent).

Understanding Retry Behavior

Failure TypeShould Retry?Example
Network timeoutYesTCP connection dropped
502 Bad GatewayYesUpstream temporarily unavailable
503 Service UnavailableYesServer overloaded
504 Gateway TimeoutYesUpstream too slow
500 Internal Server ErrorMaybeDepends on idempotency
400 Bad RequestNoClient error, won't fix on retry
401 UnauthorizedNoAuth failure, won't fix on retry
404 Not FoundNoResource doesn't exist

Configuring Retry Policy with BackendTrafficPolicy

Configure retries using Envoy Gateway's BackendTrafficPolicy:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-retry namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route retry: numRetries: 3 perRetry: backOff: baseInterval: 100ms maxInterval: 2s timeout: 500ms retryOn: httpStatusCodes: - 502 - 503 - 504 triggers: - connect-failure - retriable-status-codes - reset

Apply and verify:

bash
kubectl apply -f task-api-retry.yaml kubectl get backendtrafficpolicy -n task-api

Output:

text
NAME AGE task-api-retry 5s

Understanding Retry Fields

FieldPurposeRecommended Values
numRetriesMaximum retry attempts2-5 (higher = more latency)
perRetry.timeoutTimeout per individual attempt100ms-1s (based on expected latency)
perRetry.backOff.baseIntervalInitial delay between retries25ms-200ms
perRetry.backOff.maxIntervalMaximum delay (exponential caps here)1s-5s
retryOn.httpStatusCodesStatus codes that trigger retry502, 503, 504 (server errors)
retryOn.triggersFailure conditions that trigger retryconnect-failure, reset

Exponential Backoff

Exponential backoff increases delay between retries to avoid overwhelming a recovering service:

text
Attempt 1: Immediate Attempt 2: Wait baseInterval (100ms) Attempt 3: Wait 2 × baseInterval (200ms) Attempt 4: Wait 4 × baseInterval (400ms), capped at maxInterval

Why exponential? If the server is overloaded, retrying immediately adds more load. Exponential backoff gives the server time to recover.

Testing Retry Behavior

Create a service that fails intermittently:

yaml
apiVersion: v1 kind: ConfigMap metadata: name: flaky-app namespace: task-api data: main.py: | from flask import Flask import random app = Flask(__name__) @app.route('/api/tasks') def tasks(): if random.random() < 0.5: # 50% failure rate return "Service temporarily unavailable", 503 return "Success", 200 if __name__ == '__main__': app.run(host='0.0.0.0', port=8080)

Test without retries:

bash
for i in {1..10}; do curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/api/tasks done | sort | uniq -c

Output (no retries):

text
5 200 5 503

With retry policy applied, same requests:

bash
for i in {1..10}; do curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/api/tasks done | sort | uniq -c

Output (with retries):

text
9 200 1 503

Most failures recovered through automatic retry.


Timeout Configuration

Timeouts prevent slow dependencies from exhausting resources. Without timeouts, a thread waiting for a response that never comes holds resources indefinitely. Multiply by concurrent requests, and your service runs out of threads.

Types of Timeouts

Timeout TypeWhat It ControlsWhen to Set
Request timeoutTotal time for complete requestAlways
Per-retry timeoutTime for single retry attemptWith retry policy
Idle timeoutTime connection can be idleLong-lived connections
Connection timeoutTime to establish TCP connectionAlways

Configuring Timeouts

Configure timeouts in BackendTrafficPolicy:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-timeouts namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route timeout: http: requestTimeout: 30s idleTimeout: 5m connectionIdleTimeout: 1h

Apply and verify:

Specification
kubectl apply -f task-api-timeouts.yaml

Output:

Specification
backendtrafficpolicy.gateway.envoyproxy.io/task-api-timeouts created

Timeout Field Meanings

FieldDescriptionRecommended
requestTimeoutMax time from first byte to last byte10-60s for APIs
idleTimeoutMax time with no data on connection30s-5m
connectionIdleTimeoutMax time to keep idle connection in pool1h-24h

Testing Timeout Behavior

Simulate slow responses:

bash
# Create slow endpoint (sleeps 45 seconds) curl http://localhost:8080/api/tasks?delay=45s

Output (with 30s timeout):

Specification
upstream request timeout

The request times out after 30 seconds instead of waiting 45 seconds.

Combining Retries and Timeouts

When using retries, set per-retry timeout shorter than request timeout:

yaml
retry: numRetries: 3 perRetry: timeout: 5s # Each attempt gets 5s timeout: http: requestTimeout: 20s # Total request gets 20s

Why? If each retry takes the full request timeout, 3 retries × 30s = 90s total wait. With 5s per-retry timeout: 3 retries × 5s + backoff = ~20s maximum.


PodDisruptionBudget

Kubernetes may evict pods for many reasons: node upgrades, scaling down, resource pressure. Without protection, Kubernetes can evict all your pods simultaneously, causing an outage. PodDisruptionBudget (PDB) guarantees minimum availability during voluntary disruptions.

What PDB Protects

Disruption TypePDB Applies?Examples
VoluntaryYesNode drain, cluster upgrade,kubectl delete
InvoluntaryNoNode crash, OOM kill, hardware failure

PDB does not protect against node failures—it protects against planned maintenance.

Creating a PodDisruptionBudget

Guarantee at least 2 pods available during disruptions:

yaml
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: task-api-pdb namespace: task-api spec: minAvailable: 2 selector: matchLabels: app: task-api

Apply and verify:

bash
kubectl apply -f task-api-pdb.yaml kubectl get pdb -n task-api

Output:

text
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE task-api-pdb 2 N/A 1 10s

PDB Field Options

FieldPurposeWhen to Use
minAvailableMinimum pods that must be availableWhen you know minimum needed for traffic
maxUnavailableMaximum pods that can be unavailableWhen you know how many can fail

Either minAvailable OR maxUnavailable, not both.

Examples:

yaml
# Option 1: At least 2 pods always available spec: minAvailable: 2 # Option 2: At most 1 pod unavailable at a time spec: maxUnavailable: 1 # Option 3: Percentage (at least 50% available) spec: minAvailable: "50%"

Testing PDB Behavior

Try to drain a node when PDB would be violated:

bash
# With 3 replicas and min Available: 2 kubectl drain node-1 --ignore-daemonsets

Output (if 2 pods on node-1):

text
error when evicting pod "task-api-xxx": Cannot evict pod as it would violate the pod's disruption budget.

Kubernetes refuses to drain the node because it would leave fewer than 2 pods.

PDB Best Practices

ScenarioRecommended PDB
3 replicasminAvailable: 2 or maxUnavailable: 1
5+ replicasminAvailable: "60%" or maxUnavailable: "40%"
Single replicaNo PDB (or maxUnavailable: 0 blocks all drains)
Stateful workloadmaxUnavailable: 1 (sequential drains)

Liveness and Readiness Probes

Probes tell Kubernetes whether your pod is healthy. Without probes, Kubernetes cannot detect application failures—a crashed process inside a running container looks healthy from outside.

Probe Types

ProbePurposeAction on Failure
LivenessIs the container alive?Restart container
ReadinessIs the container ready for traffic?Remove from Service endpoints
StartupIs the container starting up?Delay liveness/readiness checks

Configuring Probes

yaml
apiVersion: apps/v1 kind: Deployment metadata: name: task-api namespace: task-api spec: replicas: 3 selector: matchLabels: app: task-api template: metadata: labels: app: task-api spec: containers: - name: task-api image: task-api:v1 ports: - containerPort: 8080 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 15 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 2 startupProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 0 periodSeconds: 5 failureThreshold: 30 # 30 × 5s = 150s startup time

Apply and verify:

bash
kubectl apply -f task-api-deployment.yaml kubectl get pods -n task-api

Output:

text
NAME READY STATUS RESTARTS AGE task-api-7b9c4d6f5-abc12 1/1 Running 0 30s task-api-7b9c4d6f5-def34 1/1 Running 0 30s task-api-7b9c4d6f5-ghi56 1/1 Running 0 30s

Understanding Probe Fields

FieldPurposeGuidance
initialDelaySecondsWait before first probeApp startup time
periodSecondsTime between probes5-10s typical
timeoutSecondsMax time for probe response1-5s
failureThresholdFailures before action2-5 (avoid flapping)
successThresholdSuccesses to recover1 (for liveness), 1-3 (for readiness)

Liveness vs Readiness

Liveness probe failure → Container restarts. Use for detecting deadlocks, infinite loops, or crashed applications.

Readiness probe failure → Traffic stops. Use for temporary unavailability (database reconnecting, cache warming).

Common mistake: Using liveness probe for dependency health. If your database is down, restarting your app won't fix it—and creates restart loops.


Graceful Shutdown

When Kubernetes terminates a pod, it sends SIGTERM. If your application does not handle SIGTERM, in-flight requests fail. Graceful shutdown completes ongoing work before exiting.

The Termination Sequence

Specification
1. Pod marked for termination 2. Pod removed from Service endpoints (new traffic stops) 3. preStop hook executes (if configured) 4. SIGTERM sent to container 5. Wait `terminationGracePeriodSeconds` 6. SIGKILL sent (forced termination)

The race condition: Steps 2-4 happen nearly simultaneously. Traffic may still arrive while preStop is running.

Configuring Graceful Shutdown

yaml
apiVersion: apps/v1 kind: Deployment metadata: name: task-api namespace: task-api spec: template: spec: terminationGracePeriodSeconds: 60 containers: - name: task-api image: task-api:v1 lifecycle: preStop: exec: command: - /bin/sh - -c - "sleep 10" # Wait for endpoints to propagate

Why sleep 10? Kubernetes endpoints propagation is not instant. The preStop delay ensures traffic stops arriving before your app begins shutdown.

Application-Level Graceful Shutdown

Your application should handle SIGTERM:

python
# Python example import signal import sys import time # Global flag for shutdown shutting_down = False def handle_sigterm(signum, frame): global shutting_down print("Received SIGTERM, starting graceful shutdown") shutting_down = True signal.signal(signal.SIGTERM, handle_sigterm) # In your request handler def handle_request(request): if shutting_down: return Response("Service shutting down", status=503) # Process request... # In your main loop while not shutting_down: # Accept and process requests pass # Wait for in-flight requests to complete time.sleep(5) print("Graceful shutdown complete") sys.exit(0)

Output (during termination):

text
Received SIGTERM, starting graceful shutdown Completing 3 in-flight requests... Graceful shutdown complete

Graceful Shutdown Timing

SettingPurposeRecommended
terminationGracePeriodSecondsTotal time for graceful shutdown30-60s
preStop sleepWait for endpoints propagation5-15s
Application drainComplete in-flight workRemaining time

Formula: terminationGracePeriodSeconds = preStop sleep + max request duration + buffer


Outlier Detection

Outlier detection automatically excludes unhealthy backends from the load balancer. If one pod starts returning errors while others are healthy, outlier detection removes it from rotation without waiting for probes.

Configuring Outlier Detection

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-outlier namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route healthCheck: passive: consecutiveGatewayErrors: 5 interval: 10s baseEjectionTime: 30s maxEjectionPercent: 50

Apply and verify:

Specification
kubectl apply -f task-api-outlier.yaml

Output:

Specification
backendtrafficpolicy.gateway.envoyproxy.io/task-api-outlier created

Outlier Detection Fields

FieldPurposeRecommended
consecutiveGatewayErrorsErrors before ejection3-10
intervalTime between error checks5-30s
baseEjectionTimeInitial ejection duration30s-5m
maxEjectionPercentMax backends that can be ejected10-50%

How Outlier Detection Works

text
Normal State ───────────── Pod A: Serving traffic ✓ Pod B: Serving traffic ✓ Pod C: Serving traffic ✓ Pod B Returns 5 Consecutive Errors ────────────────────────────────── Pod A: Serving traffic ✓ Pod B: EJECTED (30s timeout) ✗ Pod C: Serving traffic ✓ After 30s, Pod B Allowed Back ───────────────────────────── Pod A: Serving traffic ✓ Pod B: Serving traffic (on probation) ✓ Pod C: Serving traffic ✓

Combining Outlier Detection with Retries

Outlier detection works well with retries:

  1. Request to Pod B fails (error #1)
  2. Retry policy retries to Pod A (success)
  3. After 5 errors, Pod B ejected
  4. All traffic goes to healthy pods
  5. Pod B recovers, rejoins rotation

Exercises

Exercise 1: Configure Retry Policy

Create retry policy with exponential backoff:

bash
kubectl apply -f - <<EOF apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: exercise-retry namespace: default spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route retry: numRetries: 3 perRetry: backOff: baseInterval: 100ms maxInterval: 1s timeout: 500ms retryOn: httpStatusCodes: - 503 triggers: - connect-failure EOF

Verify:

bash
kubectl get backendtrafficpolicy exercise-retry

Expected Output:

text
NAME AGE exercise-retry 5s

Exercise 2: Configure Timeouts

Add timeout configuration:

bash
kubectl apply -f - <<EOF apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: exercise-timeout namespace: default spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route timeout: http: requestTimeout: 15s idleTimeout: 60s EOF

Test with slow request:

bash
# This should timeout after 15s time curl http://localhost:8080/api/tasks?delay=20s

Expected Output:

text
upstream request timeout real 0m15.xxx

Exercise 3: Create PodDisruptionBudget

Protect your deployment with PDB:

bash
kubectl apply -f - <<EOF apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: exercise-pdb namespace: default spec: minAvailable: 2 selector: matchLabels: app: task-api EOF

Verify:

bash
kubectl get pdb exercise-pdb

Expected Output:

text
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE exercise-pdb 2 N/A 1 5s

Exercise 4: Configure Graceful Shutdown

Add preStop hook to your deployment:

bash
kubectl patch deployment task-api -n default --patch 'spec: template: spec: terminationGracePeriodSeconds: 45 containers: - name: task-api lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 10"]'

Verify:

bash
kubectl get deployment task-api -o jsonpath='{.spec.template.spec.termination Grace Period Seconds}'

Expected Output:

text
45

Reflect on Your Skill

You built a traffic-engineer skill in Lesson 0. Based on what you learned about resilience patterns:

Which Patterns Should ALWAYS Be Included?

PatternAlways Include?Reason
PDBYesZero cost, prevents maintenance outages
Graceful shutdownYesPrevents dropped requests during deploy
Readiness probeYesPrevents traffic to unready pods
Request timeoutYesPrevents resource exhaustion
RetriesAlmost alwaysHandles transient failures automatically

Which Patterns Are Situational?

PatternWhen to IncludeWhen to Skip
Liveness probeComplex apps that can deadlockSimple stateless apps
Outlier detectionMulti-replica deploymentsSingle replica
Startup probeSlow-starting apps (>30s)Fast-starting apps
Per-retry timeoutWith retry policyWithout retries

Add Resilience Templates to Your Skill

Retry policy template:

yaml
# Template: retry-policy retry: numRetries: {{ retries | default(3) }} perRetry: backOff: baseInterval: {{ base_interval | default("100ms") }} maxInterval: {{ max_interval | default("2s") }} timeout: {{ per_retry_timeout | default("500ms") }} retryOn: httpStatusCodes: [502, 503, 504] triggers: [connect-failure, retriable-status-codes]

PDB template:

yaml
# Template: pdb apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: {{ service }}-pdb namespace: {{ namespace }} spec: minAvailable: {{ min_available | default(2) }} selector: matchLabels: app: {{ service }}

Graceful shutdown template:

yaml
# Template: graceful-shutdown terminationGracePeriodSeconds: {{ grace_period | default(45) }} lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep {{ prestop_sleep | default(10) }}"]

Update Troubleshooting Guidance

SymptomCheckLikely Cause
Requests fail during deploypreStop hook configured?Missing graceful shutdown
All pods evicted at oncePDB exists?Missing PodDisruptionBudget
Slow requests never completeRequest timeout set?No timeout configured
Transient errors reach usersRetry policy configured?Missing retry configuration
Traffic to unready podsReadiness probe configured?Missing readiness probe

Try With AI

Generate Resilience Configuration

Ask your traffic-engineer skill:

Specification
Using my traffic-engineer skill, generate a complete resilience configurationfor my Task API that includes: - Retry policy: 3 retries with 100ms-2s exponential backoff - Request timeout: 30 seconds - PDB: minimum 2 pods available - Graceful shutdown: 45 second grace period with 10 second pre Stop

What you're learning: AI generates multiple resilience patterns. Review the output—did AI include all four components? Are the retry and timeout values consistent (per-retry timeout < request timeout)?

Evaluate and Refine

Check AI's output for common issues:

  • Does retry policy include retryOn conditions?
  • Is terminationGracePeriodSeconds greater than preStop sleep + expected request duration?
  • Does PDB selector match your deployment labels?

If something is missing:

Specification
The retry policy should only retry on 502, 503, 504 status codes—not 500.Please update retry On.http Status Codes.

Add Health Probes

Extend to include health probes:

Specification
Now add liveness and readiness probe configuration to my deployment: - Liveness: HTTP GET /healthz, check every 10s, 3 failures to restart - Readiness: HTTP GET /ready, check every 5s, 2 failures to remove from rotation - Startup: HTTP GET /healthz, allow 2 minutes for startup

What you're learning: AI generates probe configuration. Verify the timing makes sense—startup probe should allow enough time, liveness should not be too aggressive.

Validate Complete Configuration

Before applying:

Specification
# Validate all resourceskubectl apply --dry-run=client -f resilience-config.yaml # Check for conflictskubectl get pdb -A | grep task-apikubectl get backendtrafficpolicy -n task-api

This iteration—specifying requirements, evaluating output, validating before apply—produces production-ready resilience configurations.

Safety Note

Resilience patterns interact with each other. Aggressive retries with long timeouts can cause request amplification during outages. Start with conservative settings: lower retry counts (2-3), shorter timeouts (10-30s), and longer backoff intervals (100ms-2s). Monitor your services during incident simulations before production deployment.