USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookCapstone: Resilient, Cost-Aware Task API
Previous Chapter
Data Sovereignty and Compliance
Next Chapter
Real Cloud Deployment
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

45 sections

Progress0%
1 / 45

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Capstone: Resilient, Cost-Aware Task API

Your CFO asks why Kubernetes costs doubled last quarter. You open OpenCost and show exactly which team and application drove the increase. Finance schedules a meeting to discuss optimization.

A junior developer accidentally deletes the production namespace. Instead of panic, you restore from yesterday's Velero backup. The database comes back with consistent state because your pre-backup hooks ran pg_dump before the snapshot. Production is back in 25 minutes.

During a Game Day, you kill one Task API pod while monitoring latency. Kubernetes spawns a replacement in 4 seconds. No requests return 5xx errors. You document the RTO: 4 seconds. That's 26 seconds better than your 30-second target.

This is operational excellence. Not a checklist of installed tools, but a system that answers real questions, recovers from real disasters, and proves itself under controlled failure.

This capstone integrates everything from Chapter 89: VPA for right-sizing, cost allocation labels for visibility, Velero for backup, and Chaos Mesh for resilience validation. By the end, your Task API deployment demonstrates production-ready operational excellence.


Phase 1: Write the Specification

Every production system needs explicit requirements. Before deploying anything, write a specification that defines success.

Operational Excellence Spec Template

Create a file called task-api-opex-spec.md:

markdown
# Task API Operational Excellence Specification ## 1. Cost Visibility Requirements **Goal**: Answer "What does Task API cost, and who pays for it?" **Success Criteria**: - SC-001: All Task API pods carry four cost allocation labels (`team`, `app`, `environment`, `cost-center`) - SC-002: OpenCost returns cost data when queried by team label - SC-003: VPA generates right-sizing recommendations within 24 hours ## 2. Disaster Recovery Requirements **Goal**: Recover from namespace deletion within 30 minutes **RTO (Recovery Time Objective)**: 30 minutes **RPO (Recovery Point Objective)**: 24 hours (daily backups acceptable) **Success Criteria**: - SC-004: Velero Schedule runs daily at 2 AM UTC - SC-005: Backups retain for 30 days (720 hours) - SC-006: Pre-backup hook runs pg_dump for database consistency - SC-007: Restore completes within RTO (test in staging) ## 3. Resilience Requirements **Goal**: Prove Task API survives pod failures **Success Criteria**: - SC-008: Pod kill recovery under 30 seconds - SC-009: No 5xx errors during single pod failure - SC-010: P95 latency stays under 500ms during recovery ## 4. Constraints - VPA in Off mode only (recommendations, no automatic restarts) - Chaos experiments in staging only (not production) - Cost labels must not conflict with existing selectors

Why Spec First?

Without a specification, you deploy components and hope they work. With a specification:

  • You know what "done" looks like (10 success criteria)
  • You can verify each requirement independently
  • AI can implement components that meet specific criteria
  • Post-implementation validation has clear targets

Phase 2: Component Composition

Now implement each component from previous lessons, composed into a unified deployment.

Complete Deployment with Cost Labels (from L04)

yaml
# task-api-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: task-api namespace: production labels: team: product-team app: task-api environment: production cost-center: engineering spec: replicas: 3 selector: matchLabels: app: task-api template: metadata: labels: # Critical: Labels here go strictly to pods for OpenCost team: product-team app: task-api environment: production cost-center: engineering spec: containers: - name: task-api image: ghcr.io/FISTA Solution/task-api:1.0.0 ports: - containerPort: 8000 name: http resources: requests: cpu: 250m memory: 512Mi limits: cpu: 500m memory: 1Gi livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 10 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 5 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: task-api namespace: production labels: team: product-team app: task-api environment: production cost-center: engineering spec: selector: app: task-api ports: - port: 8000 targetPort: 8000 name: http

Labels placement is critical: Labels appear in both metadata.labels (for Deployment) and spec.template.metadata.labels (for pods). OpenCost reads pod labels, not Deployment labels.

VPA Configuration (from L02)

yaml
# task-api-vpa.yaml apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: task-api-vpa namespace: production labels: team: product-team app: task-api environment: production cost-center: engineering spec: targetRef: apiVersion: apps/v1 kind: Deployment name: task-api updatePolicy: updateMode: "Off" # SC-003: Recommendations only resourcePolicy: containerPolicies: - containerName: task-api minAllowed: cpu: 100m memory: 128Mi maxAllowed: cpu: 2000m memory: 2Gi controlledResources: ["cpu", "memory"]

Why Off mode: Per the specification, we want recommendations without automatic restarts. After validating recommendations for 1-2 weeks, you can change to Initial or Recreate mode.

Velero Backup Schedule (from L06)

yaml
# task-api-velero-schedule.yaml apiVersion: velero.io/v1 kind: Schedule metadata: name: task-api-production-daily namespace: velero labels: team: product-team app: task-api environment: production spec: schedule: "0 2 * * *" # SC-004: 2 AM UTC daily useOwnerReferencesInBackup: false template: includedNamespaces: - production includedResources: - "*" excludedResources: - events - pods - replicasets snapshotVolumes: true storageLocation: default ttl: 720h # SC-005: 30-day retention metadata: labels: team: product-team app: task-api backup-type: daily hooks: resources: - name: postgres-consistency includedNamespaces: - production labelSelector: matchLabels: app: postgres pre: - exec: container: postgres command: - /bin/bash - -c - | echo "Pre-backup hook started at $(date)" psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "CHECKPOINT;" pg_dump -U "$POSTGRES_USER" -d "$POSTGRES_DB" > /var/lib/postgresql/data/backup.sql echo "Pre-backup hook completed at $(date)" onError: Fail # SC-006: Fail backup if hook fails timeout: 120s post: - exec: container: postgres command: - /bin/bash - -c - | rm -f /var/lib/postgresql/data/backup.sql echo "Post-backup cleanup completed at $(date)" onError: Continue timeout: 30s

Hook purpose: The pre-backup hook runs CHECKPOINT to flush PostgreSQL buffers and pg_dump for application-level recovery. This ensures SC-006: database consistency during backup.

PodChaos Experiment (from L07)

yaml
# task-api-podchaos.yaml apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: task-api-pod-kill namespace: chaos-mesh labels: team: product-team app: task-api experiment-type: resilience spec: action: pod-kill mode: one # Kill exactly one pod selector: namespaces: - staging # Per spec: staging only labelSelectors: app: task-api duration: "60s"

Why staging only: The specification requires chaos experiments in staging, not production. After validating behavior in staging, you can create a production experiment for scheduled Game Days.


Phase 3: AI Orchestration

You've written a specification and prepared individual components. Now use your operational-excellence skill to verify everything integrates correctly.

Prompt for AI Implementation Verification

text
I have an Operational Excellence Specification for Task API with these success criteria: - SC-001 to SC-003: Cost visibility (labels, OpenCost query, VPA recommendations) - SC-004 to SC-007: Disaster recovery (Velero schedule, hooks, 30-min RTO) - SC-008 to SC-010: Resilience (pod kill recovery under 30s, no 5xx, latency) Using my operational-excellence skill, review these manifests and tell me: 1. Do the cost labels appear in the correct places for OpenCost to read them? 2. Is the VPA configuration safe for production (Off mode only)? 3. Will the Velero hooks provide database consistency? 4. Is the PodChaos experiment scoped correctly for safety? [Paste the manifests from Phase 2]

What AI Verifies

AI reviews your manifests against the operational-excellence skill patterns:

Cost Labels Check:

  • Labels in spec.template.metadata.labels (where pods get them)
  • All four required labels present: team, app, environment, cost-center
  • Labels consistent between Deployment metadata and pod template

VPA Safety Check:

  • updateMode: "Off" confirms recommendations-only
  • minAllowed and maxAllowed set reasonable bounds
  • No conflict with HPA (none defined)

Velero Completeness Check:

  • Schedule cron matches requirement (2 AM daily)
  • TTL matches requirement (720h = 30 days)
  • Hooks include onError: Fail for data consistency
  • Timeout appropriate for pg_dump (120s)

Chaos Safety Check:

  • Namespace targeting staging, not production
  • Mode is one (not all)
  • Duration limited (60s)

Iterating with AI

If AI identifies issues, it suggests corrections:

text
Your VPA is correctly in Off mode. However, I notice your Velero Schedule uses labelSelector on app: postgres, but your Deployment labels show app: task-api. If your PostgreSQL database has different labels than task-api, the hook is correct. But if your PostgreSQL is part of the task-api deployment with the same labels, the hook won't match any pods. Please verify: What labels does your PostgreSQL pod have?

You provide context:

text
PostgreSQL is a separate StatefulSet with labels: app: postgres team: product-team It's not part of the task-api Deployment.

AI confirms:

text
Good. Your Velero hook selector is correct then. The labelSelector: matchLabels: app: postgres will match the PostgreSQL StatefulSet pods. Your manifests look complete. Apply them in this order: 1. Deployment and Service (task-api-deployment.yaml) 2. VPA (task-api-vpa.yaml) 3. Velero Schedule (task-api-velero-schedule.yaml) 4. PodChaos experiment (task-api-podchaos.yaml) - in staging only

Phase 4: Convergence and Validation

Deploy all components and verify each success criterion.

Apply the Manifests

bash
# Create production namespace if needed kubectl create namespace production --dry-run=client -o yaml | kubectl apply -f - # Deploy Task API with cost labels kubectl apply -f task-api-deployment.yaml # Apply VPA kubectl apply -f task-api-vpa.yaml # Apply Velero Schedule kubectl apply -f task-api-velero-schedule.yaml # Apply PodChaos (only in staging) kubectl annotate namespace staging chaos-mesh.org/inject=enabled --overwrite kubectl apply -f task-api-podchaos.yaml

Output:

text
namespace/production configured deployment.apps/task-api created service/task-api created verticalpodautoscaler.autoscaling.k8s.io/task-api-vpa created schedule.velero.io/task-api-production-daily created podchaos.chaos-mesh.org/task-api-pod-kill created

Verification Checklist

Work through each success criterion systematically.

SC-001: Cost Allocation Labels Present

bash
kubectl get pods -n production -l app=task-api -o jsonpath='{.items[0].metadata.labels}' | jq

Output:

json
{ "app": "task-api", "cost-center": "engineering", "environment": "production", "team": "product-team", "pod-template-hash": "7d8f9c4b5f"}

All four labels present on pods.

SC-002: OpenCost Returns Cost Data by Team

bash
kubectl port-forward -n monitoring svc/opencost 9003:9003 & curl -s "localhost:9003/allocation/compute?window=1d&aggregate=label:team" | jq '.data[0]'

Output:

json
{ "product-team": { "name": "product-team", "cpu Cost": 0.0432, "memory Cost": 0.0216, "total Cost": 0.0648 }}

OpenCost aggregates costs by team label.

SC-003: VPA Generates Recommendations

bash
kubectl describe vpa task-api-vpa -n production | grep -A 20 "Recommendation:"

Output:

text
Recommendation: Container Recommendations: Container Name: task-api Lower Bound: Cpu: 100m Memory: 200Mi Target: Cpu: 180m Memory: 450Mi Upper Bound: Cpu: 500m Memory: 900Mi

VPA recommends 180m CPU and 450Mi memory (lower than your 250m/512Mi requests).

SC-004 & SC-005: Velero Schedule Active with 30-Day Retention

bash
velero schedule describe task-api-production-daily

Output:

text
Name: task-api-production-daily Namespace: velero Phase: Enabled Schedule: 0 2 * * * TTL: 720h0m0s Last Backup: 2025-12-30 02:00:00 +0000 UTC

Schedule is enabled with correct timing and retention.

SC-006: Backup Hook Runs Successfully

Check the most recent backup for hook execution:

bash
velero backup describe $(velero backup get | grep task-api | head -1 | awk '{print $1}') --details | grep -A 10 "Hooks:"

Output:

text
Hooks: Resources: Name: postgres-consistency Pre Exec Hook: Container: postgres Command: /bin/bash -c echo "Pre-backup hook started..." On Error: Fail Timeout: 2m0s Execution: Succeeded

Hooks executed successfully.

SC-007: Restore Completes Within RTO (Staging Test)

Test restore in staging (never test restores against production):

bash
# Create a test restore to staging namespace velero restore create test-restore-$(date +%s) \ --from-backup task-api-production-daily-$(date +%Y%m%d)020000 \ --namespace-mappings production:staging-restore-test \ --include-namespaces production # Time the restore time velero restore wait test-restore-*

Output:

text
Restore completed with status: Completed. Duration: 2m15s. real 2m15.234s

Restore completed in 2 minutes 15 seconds. Well under the 30-minute RTO.

SC-008: Pod Kill Recovery Under 30 Seconds

Run the PodChaos experiment and measure recovery:

bash
# Watch pods in staging kubectl get pods -n staging -l app=task-api -w & # Apply chaos experiment kubectl apply -f task-api-podchaos.yaml # Record timestamps echo "Experiment started at: $(date)"

Output:

text
NAME READY STATUS RESTARTS AGE task-api-7d8f9c4b5f-abc12 1/1 Running 0 2h task-api-7d8f9c4b5f-def34 1/1 Running 0 2h task-api-7d8f9c4b5f-ghi56 1/1 Running 0 2h task-api-7d8f9c4b5f-abc12 1/1 Terminating 0 2h task-api-7d8f9c4b5f-jkl78 0/1 Pending 0 0s task-api-7d8f9c4b5f-jkl78 0/1 ContainerCreating 0 1s task-api-7d8f9c4b5f-jkl78 1/1 Running 0 4s Experiment started at: Mon Dec 30 10:45:00 UTC 2025

Recovery time: 4 seconds. Well under the 30-second target.

SC-009 & SC-010: No 5xx Errors, Latency Within Target

During the chaos experiment, check for errors:

bash
# Query Prometheus for error rate during experiment window curl -s "localhost:9090/api/v1/query?query=sum(rate(task_api_requests_total{status=~\"5..\"}[5m]))" | jq '.data.result[0].value[1]'

Output:

json
"0"

Zero 5xx errors during pod kill.

bash
# Query P95 latency during experiment curl -s "localhost:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(task_api_request_duration_seconds_bucket[5m]))by(le))*1000" | jq '.data.result[0].value[1]'

Output:

json
"245.3"

P95 latency: 245ms. Under the 500ms target.

Complete Verification Summary

Success CriterionTargetActualStatus
SC-001: Cost labels present4 labels4 labelsPASS
SC-002: OpenCost query by teamReturns dataReturns dataPASS
SC-003: VPA recommendationsWithin 24hAvailablePASS
SC-004: Velero schedule daily2 AM UTC2 AM UTCPASS
SC-005: 30-day retention720h720hPASS
SC-006: Pre-backup hook successRunsSucceededPASS
SC-007: Restore within RTO< 30 min2m 15sPASS
SC-008: Pod recovery time< 30s4sPASS
SC-009: No 5xx during failure00PASS
SC-010: P95 latency during failure< 500ms245msPASS

All success criteria verified.

Capstone Artifacts

By completing this capstone, you should have created:

ArtifactLocationPurpose
Operational Excellence Spectask-api-opex-spec.mdDefines success criteria
Deployment with labelstask-api-deployment.yamlTask API with cost allocation
VPA configurationtask-api-vpa.yamlRight-sizing recommendations
Velero Scheduletask-api-velero-schedule.yamlDaily backup with hooks
PodChaos experimenttask-api-podchaos.yamlResilience validation
Verification results(documented)Proof of compliance

Reflect on Your Skill (Final)

You built an operational-excellence skill in Lesson 0. Throughout this chapter, you've tested and improved it. Now finalize it for production use.

Final Skill Test

text
Using my operational-excellence skill, generate a complete operational excellence implementation for a new service called "order-service" with: Requirements: - 99.9% availability SLO - RTO: 15 minutes, RPO: 4 hours - Cost allocation: team=commerce, cost-center=revenue - Must survive single pod failure without errors Generate all manifests and a verification checklist.

Verify Complete Coverage

Your skill should produce:

  • Deployment with all four cost allocation labels
  • VPA in Off mode with appropriate bounds
  • Velero Schedule with 6-hour backup frequency (RPO: 4 hours)
  • Pre-backup hooks for database consistency
  • PodChaos experiment for staging
  • Verification commands for each requirement

Identify and Fill Gaps

If your skill missed any component:

text
My operational-excellence skill generated the Deployment and VPA but forgot the Velero hooks for database consistency. Update it to always include: 1. Pre-backup hook with pg_dump (or equivalent for the database type) 2. onError: Fail to ensure consistency 3. Appropriate timeout based on database size Also add a decision tree: "Is this a stateful service with a database?" If yes, include hooks. If no, skip hooks.

Skill Finalization Checklist

Before adding your skill to your portfolio:

  • Covers all four areas: cost (VPA + labels), backup (Velero), resilience (Chaos Mesh), and visibility (OpenCost)
  • Includes safety guardrails (Off mode for VPA, staging for chaos, onError: Fail for hooks)
  • Generates correct YAML syntax for all CRDs
  • Includes verification commands for each component
  • Decision tree helps select appropriate components based on requirements

Add to Your Portfolio

Your operational-excellence skill is now a Digital FTE component. When you build AI agents or cloud-native applications:

  1. Invoke the skill with your service name and requirements
  2. Get production-ready manifests
  3. Apply and verify with the generated checklist
  4. Iterate if requirements change

This is the Digital Fte's model: build once, deploy everywhere. Your operational excellence patterns are encoded, tested, and ready to manufacture into any production deployment.

Try With AI

Prompt 1: Extend to Multi-Service Architecture

text
I've completed the Task API capstone with full operational excellence. Now I need to extend this to a microservices architecture with: - task-api (FastAPI + PostgreSQL) - notification-service (stateless, sends emails) - inference-service (GPU workloads, high cost) How should I structure cost allocation labels, backup schedules, and chaos experiments differently for each service type?

What you're learning: Adapting operational excellence patterns to different service characteristics. Stateless services need different backup strategies than databases. GPU workloads need different cost monitoring than CPU-bound services.

Prompt 2: Automate Verification

text
The verification checklist from the capstone is manual. I want to automate it. Help me create a script that: 1. Checks all pods have the four cost allocation labels 2. Verifies VPA is in Off mode 3. Confirms Velero Schedule is enabled 4. Runs a chaos experiment and measures recovery time Output should be a pass/fail report suitable for CI/CD.

What you're learning: Turning manual verification into automated compliance checks. Production systems need continuous validation, not just initial deployment testing.

Prompt 3: Handle Failure Scenarios

text
During a Game Day, my Task API took 45 seconds to recover instead of the target 30 seconds. The pod was killed, but the new pod spent 40 seconds in ContainerCreating state. Help me diagnose: 1. What could cause slow container creation? 2. How do I identify the bottleneck (image pull, resource scheduling, init containers)? 3. What changes would reduce recovery time to under 30 seconds?

What you're learning: Debugging operational performance when targets are missed. Chaos engineering often reveals issues that weren't visible during normal operation. Your skill should help diagnose and fix these issues.

Safety note: This capstone deploys real infrastructure that interacts with production systems. Always verify namespace targeting before applying Velero restores or Chaos Mesh experiments. The manifests in this lesson target specific namespaces, but a single typo in includedNamespaces can affect unintended resources. Review manifests carefully, and always test in staging before production.