USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookYour Deployment Is Running But It Isn't Ready: The 10-Point Production Checklist That Catches What You Miss
Previous Chapter
Personal Cloud Lab - Hetzner K3s
Next Chapter
Same Patterns Different Clouds
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

24 sections

Progress0%
1 / 24

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Production Checklist & Verification

You've deployed your Task API to a cloud Kubernetes cluster. But deployed doesn't mean production-ready. A deployment can run without being resilient, observable, or secure.

This chapter gives you a systematic approach: a 10-point production readiness checklist that separates "it works on my cluster" from "it's ready for real traffic."

The pattern you'll learn here applies to any Kubernetes deployment—not just DigitalOcean, not just Task API. Once you internalize this checklist, you can verify any deployment on any cloud.

Why Checklists Matter in Production

Airplane pilots use pre-flight checklists despite thousands of hours of experience. Surgeons use surgical checklists despite years of training. The reason? Humans forget things under pressure, and production deployments happen under pressure.

A deployment might fail silently in ways that only manifest under load:

  • Health probes missing means Kubernetes can't restart failing pods
  • Resource limits missing means one pod can starve others
  • HPA missing means traffic spikes cause outages instead of scale-ups

The checklist catches these issues before customers do.

The 10-Point Production Readiness Checklist

#CheckCommandPass Criteria
1Health endpoint respondscurl https://domain/healthHTTP 200
2Resource limits setkubectl describe pod <pod>Limits visible
3Replicas >= 2kubectl get deployREADY shows 2+
4Liveness probe configuredkubectl get deploy -o yamllivenessProbe present
5Readiness probe configuredkubectl get deploy -o yamlreadinessProbe present
6TLS certificate validcurl -v https://domainCertificate OK
7Secrets not in env varskubectl describe podNo sensitive values
8Pod disruption budgetkubectl get pdbPDB exists
9HPA configured (if needed)kubectl get hpaHPA exists
10Cost estimate documentedProvider dashboardMonthly cost known

Let's verify each item systematically.

Check 1: Health Endpoint Responds

The health endpoint is your deployment's vital sign. If it doesn't respond, nothing else matters.

bash
curl -s -o /dev/null -w "%{http_code}" https://your-domain.com/health

Output (Pass):

Specification
200

Output (Fail):

Specification
000

A 000 response typically means DNS isn't resolving or the service isn't reachable. Check your Ingress and DNS configuration.

For more detail:

bash
curl -v https://your-domain.com/health

Output (Pass):

Specification
< HTTP/2 200 < content-type: application/json {"status": "healthy", "database": "connected"}

What you're verifying: The entire path works—DNS resolves, Load Balancer routes, Ingress matches, Service forwards, Pod responds.

Check 2: Resource Limits Set

Without resource limits, a single misbehaving pod can consume all node resources, crashing other workloads.

bash
kubectl describe pod -l app=task-api | grep -A 5 "Limits:"

Output (Pass):

Specification
Limits: cpu: 500m memory: 512Mi Requests: cpu: 100m memory: 256Mi

Output (Fail):

Specification
Limits: <none> Requests: <none>

If you see <none>, add resource specifications to your deployment:

yaml
resources: requests: memory: "256Mi" cpu: "100m" limits: memory: "512Mi" cpu: "500m"

What you're verifying: Kubernetes knows how much CPU and memory your pods need, enabling proper scheduling and preventing resource starvation.

Check 3: Replicas >= 2

A single replica means zero redundancy. If that pod crashes or its node goes down, your service is unavailable.

bash
kubectl get deploy task-api

Output (Pass):

Specification
NAME READY UP-TO-DATE AVAILABLE AGE task-api 2/2 2 2 1h

Output (Fail):

Specification
NAME READY UP-TO-DATE AVAILABLE AGE task-api 1/1 1 1 1h

Scale up if needed:

bash
kubectl scale deploy task-api --replicas=2

What you're verifying: Your service survives the loss of any single pod or node.

Check 4: Liveness Probe Configured

Liveness probes tell Kubernetes when to restart a stuck container. Without them, a deadlocked process runs forever.

bash
kubectl get deploy task-api -o jsonpath='{.spec.template.spec.containers[0].liveness Probe}' | jq .

Output (Pass):

json
{ "httpGet": { "path": "/health", "port": 8000, "scheme": "HTTP" }, "initialDelaySeconds": 10, "periodSeconds": 30, "timeoutSeconds": 5, "failureThreshold": 3 }

Output (Fail):

Specification
null

If missing, add to your deployment spec:

yaml
livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 10 periodSeconds: 30 failureThreshold: 3

What you're verifying: Kubernetes will automatically restart containers that become unresponsive.

Check 5: Readiness Probe Configured

Readiness probes tell Kubernetes when a pod is ready to receive traffic. Without them, traffic routes to pods still initializing.

bash
kubectl get deploy task-api -o jsonpath='{.spec.template.spec.containers[0].readiness Probe}' | jq .

Output (Pass):

json
{ "httpGet": { "path": "/health", "port": 8000, "scheme": "HTTP" }, "initialDelaySeconds": 5, "periodSeconds": 10, "timeoutSeconds": 3, "successThreshold": 1, "failureThreshold": 3 }

What you're verifying: Traffic only routes to pods that are fully initialized and ready to handle requests.

Check 6: TLS Certificate Valid

HTTPS requires a valid, non-expired certificate. An invalid certificate breaks trust for browsers and API clients.

bash
curl -v https://your-domain.com/health 2>&1 | grep -E "SSL|certificate"

Output (Pass):

Specification
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * Server certificate: * subject: CN=your-domain.com * start date: Dec 30 00:00:00 2025 GMT * expire date: Mar 30 23:59:59 2026 GMT * issuer: C=US; O=Let's Encrypt; CN=R11

Output (Fail):

Specification
* SSL certificate problem: certificate has expired * Closing connection

If using cert-manager, check certificate status:

bash
kubectl get certificate

Output:

Specification
NAME READY SECRET AGE task-api-tls True task-api-tls 1h

What you're verifying: Your HTTPS endpoint is secure and trusted by clients.

Check 7: Secrets Not in Environment Variables

Sensitive values should never appear in plain text when describing pods.

bash
kubectl describe pod -l app=task-api | grep -E "(OPENAI|API_KEY|PASSWORD|SECRET)"

Output (Pass):

Specification
(no output - secrets aren't visible in describe output when using secret Key Ref)

Output (Fail):

Specification
OPENAI_API_KEY: sk-proj-abc123def456...

If secrets appear in plain text, refactor to use Kubernetes Secrets:

yaml
env: - name: OPENAI_API_KEY valueFrom: secretKeyRef: name: task-api-secrets key: openai-api-key

What you're verifying: Sensitive values aren't exposed in logs, kubectl output, or memory dumps.

Check 8: Pod Disruption Budget Exists

PodDisruptionBudgets (PDBs) prevent Kubernetes from terminating too many pods during node maintenance.

bash
kubectl get pdb

Output (Pass):

Specification
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE task-api-pdb 1 N/A 1 1h

Output (Fail):

Specification
No resources found in default namespace.

Create a PDB if missing:

yaml
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: task-api-pdb spec: minAvailable: 1 selector: matchLabels: app: task-api

What you're verifying: Your service remains available during cluster upgrades and node maintenance.

Check 9: HPA Configured (If Needed)

HorizontalPodAutoscaler (HPA) scales pods based on CPU or memory usage.

bash
kubectl get hpa

Output (Pass):

Specification
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE task-api Deployment/task-api 45%/80% 2 10 2 1h

Output (Fail for traffic-receiving services):

Specification
No resources found in default namespace.

For services expecting variable traffic, create an HPA:

yaml
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: task-api spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: task-api minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 80

What you're verifying: Your service scales automatically under load instead of becoming unresponsive.

Check 10: Cost Estimate Documented

Production readiness includes knowing what you're paying.

Access your cloud provider's dashboard:

  • DigitalOcean: cloud.digitalocean.com > Billing
  • Hetzner: console.hetzner.cloud > Cloud > Servers > monthly costs
  • AWS: Cost Explorer
  • GCP: Billing dashboard

Document:

  • Current monthly cost
  • Cost per component (nodes, load balancer, storage)
  • Projected cost at 2x scale

What you're verifying: No surprises on your cloud bill.

Common Failures and Fixes

SymptomLikely CauseFix
Health endpoint returns 503Pod not readyCheck readiness probe, pod logs
curl returns 000DNS/Ingress misconfiguredVerify DNS propagation, Ingress rules
Pods keep restartingLiveness probe failingIncrease initialDelaySeconds, check app startup
Deployment stuck at 0/2Image pull failedCheck image name, pull secret
HPA shows <unknown> targetsMetrics server missingInstall metrics-server on cluster
Certificate shows "Not Ready"cert-manager challenge failingCheck Ingress, DNS, cert-manager logs

Debugging Pod Restarts

bash
kubectl describe pod -l app=task-api | grep -A 10 "Last State:"

Output:

Specification
Last State: Terminated Reason: Error Exit Code: 1 Started: Mon, 30 Dec 2025 10:00:00 +0000 Finished: Mon, 30 Dec 2025 10:00:05 +0000

Check logs for the crash reason:

bash
kubectl logs -l app=task-api --previous

Debugging Image Pull Failures

bash
kubectl describe pod -l app=task-api | grep -A 5 "Events:"

Output (Fail):

Specification
Events: Warning Failed 1m kubelet Failed to pull image "ghcr.io/myorg/task-api:v1.0.0": unauthorized

Fix by creating or updating image pull secret:

bash
kubectl create secret docker-registry ghcr-secret \ --docker-server=ghcr.io \ --docker-username=YOUR_USERNAME \ --docker-password=YOUR_PAT

Running the Full Checklist

Here's a script that runs all checks:

bash
#!/bin/bash # production-checklist.sh DOMAIN="your-domain.com" DEPLOY="task-api" echo "=== Production Readiness Checklist ===" echo -n " 1. Health endpoint: " STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://$DOMAIN/health) [ "$STATUS" == "200" ] && echo "PASS (HTTP $STATUS)" || echo "FAIL (HTTP $STATUS)" echo -n " 2. Resource limits: " kubectl describe pod -l app=$DEPLOY | grep -q "Limits:" && echo "PASS" || echo "FAIL" echo -n " 3. Replicas >= 2: " REPLICAS=$(kubectl get deploy $DEPLOY -o jsonpath='{.status.readyReplicas}') [ "$REPLICAS" -ge 2 ] && echo "PASS ($REPLICAS replicas)" || echo "FAIL ($REPLICAS replica)" echo -n " 4. Liveness probe: " kubectl get deploy $DEPLOY -o jsonpath='{.spec.template.spec.containers[0].livenessProbe}' | grep -q "httpGet" && echo "PASS" || echo "FAIL" echo -n " 5. Readiness probe: " kubectl get deploy $DEPLOY -o jsonpath='{.spec.template.spec.containers[0].readinessProbe}' | grep -q "httpGet" && echo "PASS" || echo "FAIL" echo -n " 6. TLS certificate: " curl -v https://$DOMAIN/health 2>&1 | grep -q "SSL certificate verify ok" && echo "PASS" || echo "CHECK MANUALLY" echo -n " 7. Secrets in env: " kubectl describe pod -l app=$DEPLOY | grep -qE "(API_KEY|PASSWORD|SECRET).*=" && echo "FAIL (secrets visible)" || echo "PASS" echo -n " 8. Pod disruption budget: " kubectl get pdb | grep -q $DEPLOY && echo "PASS" || echo "FAIL (no PDB)" echo -n " 9. HPA configured: " kubectl get hpa | grep -q $DEPLOY && echo "PASS" || echo "N/A (check if needed)" echo " 10. Cost estimate: CHECK PROVIDER DASHBOARD" echo "=== Checklist Complete ==="

Output:

Specification
=== Production Readiness Checklist === 1. Health endpoint: PASS (HTTP 200) 2. Resource limits: PASS 3. Replicas >= 2: PASS (2 replicas) 4. Liveness probe: PASS 5. Readiness probe: PASS 6. TLS certificate: PASS 7. Secrets in env: PASS 8. Pod disruption budget: PASS 9. HPA configured: PASS 10. Cost estimate: CHECK PROVIDER DASHBOARD === Checklist Complete ===

Try With AI

Use your AI companion to verify your production deployment collaboratively.

Prompt 1: Checklist Review

text
I'm running a production checklist on my Kubernetes deployment. Here's the output from kubectl describe pod for my task-api: [paste your kubectl describe pod output] Review this against production best practices. What's configured correctly? What's missing? For anything missing, show me the exact YAML to add.

What you're learning: Pattern recognition—AI helps you spot configuration gaps you might overlook and generates correct fixes faster than manual YAML writing.

Prompt 2: Failure Diagnosis

text
My production checklist shows these failures: - Health endpoint returns 503 - Pods showing 1/2 ready - HPA shows <unknown> for current metrics Here are my logs and events: [paste kubectl logs and kubectl describe pod output] Diagnose these failures in order of priority. What's the root cause of each? What's the fastest path to fixing all three?

What you're learning: Systematic debugging—AI helps you prioritize issues and identify root causes when multiple things fail simultaneously.

Prompt 3: Checklist Customization

text
The 10-point checklist I learned covers general production readiness. My Task API has specific requirements: - It connects to a PostgreSQL database - It calls OpenAI API for inference - It needs to handle 100 requests/second peak What additional checks should I add to my production checklist for these specific requirements? Create kubectl commands for each check.

What you're learning: Checklist adaptation—production checklists should be customized for your application's specific dependencies and requirements.

Safety Note

Always run verification commands on your actual deployment, not just in theory. AI can generate perfect-looking commands, but only execution against real infrastructure confirms your deployment is truly production-ready.


Reflect on Your Skill

Test your multi-cloud-deployer skill:

  • Does it include a production readiness checklist?
  • Can it generate verification commands for any deployment?
  • Does it know the common failures and fixes for each check?

If gaps exist, update your skill with the 10-point checklist pattern and debugging procedures from this chapter. A deployment skill isn't complete without verification capability.