Module 7 takes the agent you built in Module 6 and turns it into a production cloud service. You'll containerize the stack, orchestrate it on Kubernetes, automate delivery, and operate it with observability, security, and cost controls. The goal: a reliable Digital FTE that runs 24/7 for real users.
Prerequisites: Modules 4-6. You need a working agent service to deploy.
Your service is healthy. All pods are running, health checks pass, metrics look green. Then a network glitch causes a database connection to drop for 200 milliseconds. Without retry logic, that request fails permanently. A user sees an error. They refresh, it works—but trust erodes. Meanwhile, during a Kubernetes node upgrade, all your pods get evicted simultaneously because you forgot to create a PodDisruptionBudget. Your service goes down for 90 seconds while new pods start. Both failures were preventable.
Resilience patterns prepare your services for the failures that will happen. Networks are unreliable. Pods get evicted. Dependencies slow down. The question is not whether failures occur, but whether your system handles them gracefully. This lesson teaches production-grade patterns: retry policies that recover from transient failures, timeouts that prevent resource exhaustion, PodDisruptionBudgets that protect during maintenance, and graceful shutdown that completes in-flight requests.
By the end, you will configure retry policies with exponential backoff, set request and connection timeouts, create PDBs that guarantee minimum availability, and implement graceful shutdown with preStop hooks.
Production resilience operates at multiple layers. Each layer handles different failure modes:
Retries automatically resend failed requests. A 500ms network timeout does not mean the operation failed—the server may have completed successfully. Retry policies distinguish between failures worth retrying (transient) and failures that should not be retried (permanent).
Configure retries using Envoy Gateway's BackendTrafficPolicy:
Apply and verify:
Output:
Exponential backoff increases delay between retries to avoid overwhelming a recovering service:
Why exponential? If the server is overloaded, retrying immediately adds more load. Exponential backoff gives the server time to recover.
Create a service that fails intermittently:
Test without retries:
Output (no retries):
With retry policy applied, same requests:
Output (with retries):
Most failures recovered through automatic retry.
Timeouts prevent slow dependencies from exhausting resources. Without timeouts, a thread waiting for a response that never comes holds resources indefinitely. Multiply by concurrent requests, and your service runs out of threads.
Configure timeouts in BackendTrafficPolicy:
Apply and verify:
Output:
Simulate slow responses:
Output (with 30s timeout):
The request times out after 30 seconds instead of waiting 45 seconds.
When using retries, set per-retry timeout shorter than request timeout:
Why? If each retry takes the full request timeout, 3 retries × 30s = 90s total wait. With 5s per-retry timeout: 3 retries × 5s + backoff = ~20s maximum.
Kubernetes may evict pods for many reasons: node upgrades, scaling down, resource pressure. Without protection, Kubernetes can evict all your pods simultaneously, causing an outage. PodDisruptionBudget (PDB) guarantees minimum availability during voluntary disruptions.
PDB does not protect against node failures—it protects against planned maintenance.
Guarantee at least 2 pods available during disruptions:
Apply and verify:
Output:
Either minAvailable OR maxUnavailable, not both.
Examples:
Try to drain a node when PDB would be violated:
Output (if 2 pods on node-1):
Kubernetes refuses to drain the node because it would leave fewer than 2 pods.
Probes tell Kubernetes whether your pod is healthy. Without probes, Kubernetes cannot detect application failures—a crashed process inside a running container looks healthy from outside.
Apply and verify:
Output:
Liveness probe failure → Container restarts. Use for detecting deadlocks, infinite loops, or crashed applications.
Readiness probe failure → Traffic stops. Use for temporary unavailability (database reconnecting, cache warming).
Common mistake: Using liveness probe for dependency health. If your database is down, restarting your app won't fix it—and creates restart loops.
When Kubernetes terminates a pod, it sends SIGTERM. If your application does not handle SIGTERM, in-flight requests fail. Graceful shutdown completes ongoing work before exiting.
The race condition: Steps 2-4 happen nearly simultaneously. Traffic may still arrive while preStop is running.
Why sleep 10? Kubernetes endpoints propagation is not instant. The preStop delay ensures traffic stops arriving before your app begins shutdown.
Your application should handle SIGTERM:
Output (during termination):
Formula: terminationGracePeriodSeconds = preStop sleep + max request duration + buffer
Outlier detection automatically excludes unhealthy backends from the load balancer. If one pod starts returning errors while others are healthy, outlier detection removes it from rotation without waiting for probes.
Apply and verify:
Output:
Outlier detection works well with retries:
Create retry policy with exponential backoff:
Verify:
Expected Output:
Add timeout configuration:
Test with slow request:
Expected Output:
Protect your deployment with PDB:
Verify:
Expected Output:
Add preStop hook to your deployment:
Verify:
Expected Output:
You built a traffic-engineer skill in Lesson 0. Based on what you learned about resilience patterns:
Retry policy template:
PDB template:
Graceful shutdown template:
Ask your traffic-engineer skill:
What you're learning: AI generates multiple resilience patterns. Review the output—did AI include all four components? Are the retry and timeout values consistent (per-retry timeout < request timeout)?
Check AI's output for common issues:
If something is missing:
Extend to include health probes:
What you're learning: AI generates probe configuration. Verify the timing makes sense—startup probe should allow enough time, liveness should not be too aggressive.
Before applying:
This iteration—specifying requirements, evaluating output, validating before apply—produces production-ready resilience configurations.
Resilience patterns interact with each other. Aggressive retries with long timeouts can cause request amplification during outages. Start with conservative settings: lower retry counts (2-3), shorter timeouts (10-30s), and longer backoff intervals (100ms-2s). Monitor your services during incident simulations before production deployment.