Production Checklist & Verification

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

You've deployed your Task API to a cloud Kubernetes cluster. But deployed doesn't mean production-ready. A deployment can run without being resilient, observable, or secure.

This chapter gives you a systematic approach: a 10-point production readiness checklist that separates "it works on my cluster" from "it's ready for real traffic."

The pattern you'll learn here applies to any Kubernetes deployment—not just DigitalOcean, not just Task API. Once you internalize this checklist, you can verify any deployment on any cloud.

Why Checklists Matter in Production

Airplane pilots use pre-flight checklists despite thousands of hours of experience. Surgeons use surgical checklists despite years of training. The reason? Humans forget things under pressure, and production deployments happen under pressure.

A deployment might fail silently in ways that only manifest under load:

Health probes missing means Kubernetes can't restart failing pods
Resource limits missing means one pod can starve others
HPA missing means traffic spikes cause outages instead of scale-ups

The checklist catches these issues before customers do.

The 10-Point Production Readiness Checklist

#	Check	Command	Pass Criteria
1	Health endpoint responds	curl https://domain/health	HTTP 200
2	Resource limits set	kubectl describe pod <pod>	Limits visible
3	Replicas >= 2	kubectl get deploy	READY shows 2+
4	Liveness probe configured	kubectl get deploy -o yaml	livenessProbe present
5	Readiness probe configured	kubectl get deploy -o yaml	readinessProbe present
6	TLS certificate valid	curl -v https://domain	Certificate OK
7	Secrets not in env vars	kubectl describe pod	No sensitive values
8	Pod disruption budget	kubectl get pdb	PDB exists
9	HPA configured (if needed)	kubectl get hpa	HPA exists
10	Cost estimate documented	Provider dashboard	Monthly cost known

Let's verify each item systematically.

Check 1: Health Endpoint Responds

The health endpoint is your deployment's vital sign. If it doesn't respond, nothing else matters.

bash

curl -s -o /dev/null -w "%{http_code}" https://your-domain.com/health

Output (Pass):

Specification

200

Output (Fail):

Specification

000

A 000 response typically means DNS isn't resolving or the service isn't reachable. Check your Ingress and DNS configuration.

For more detail:

bash

curl -v https://your-domain.com/health

Output (Pass):

Specification

< HTTP/2 200
< content-type: application/json
{"status": "healthy", "database": "connected"}

What you're verifying: The entire path works—DNS resolves, Load Balancer routes, Ingress matches, Service forwards, Pod responds.

Check 2: Resource Limits Set

Without resource limits, a single misbehaving pod can consume all node resources, crashing other workloads.

bash

kubectl describe pod -l app=task-api | grep -A 5 "Limits:"

Output (Pass):

Specification

Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:     100m
      memory:  256Mi

Output (Fail):

Specification

Limits:
      <none>
    Requests:
      <none>

If you see <none>, add resource specifications to your deployment:

yaml

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

What you're verifying: Kubernetes knows how much CPU and memory your pods need, enabling proper scheduling and preventing resource starvation.

Check 3: Replicas >= 2

A single replica means zero redundancy. If that pod crashes or its node goes down, your service is unavailable.

bash

kubectl get deploy task-api

Output (Pass):

Specification

NAME       READY   UP-TO-DATE   AVAILABLE   AGE
task-api   2/2     2            2           1h

Output (Fail):

Specification

NAME       READY   UP-TO-DATE   AVAILABLE   AGE
task-api   1/1     1            1           1h

Scale up if needed:

bash

kubectl scale deploy task-api --replicas=2

What you're verifying: Your service survives the loss of any single pod or node.

Check 4: Liveness Probe Configured

Liveness probes tell Kubernetes when to restart a stuck container. Without them, a deadlocked process runs forever.

bash

kubectl get deploy task-api -o jsonpath='{.spec.template.spec.containers[0].liveness
Probe}' | jq .

Output (Pass):

json

{
  "httpGet": {
    "path": "/health",
    "port": 8000,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 10,
  "periodSeconds": 30,
  "timeoutSeconds": 5,
  "failureThreshold": 3
}

Output (Fail):

Specification

null

If missing, add to your deployment spec:

yaml

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 30
  failureThreshold: 3

What you're verifying: Kubernetes will automatically restart containers that become unresponsive.

Check 5: Readiness Probe Configured

Readiness probes tell Kubernetes when a pod is ready to receive traffic. Without them, traffic routes to pods still initializing.

bash

kubectl get deploy task-api -o jsonpath='{.spec.template.spec.containers[0].readiness
Probe}' | jq .

Output (Pass):

json

{
  "httpGet": {
    "path": "/health",
    "port": 8000,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 5,
  "periodSeconds": 10,
  "timeoutSeconds": 3,
  "successThreshold": 1,
  "failureThreshold": 3
}

What you're verifying: Traffic only routes to pods that are fully initialized and ready to handle requests.

Check 6: TLS Certificate Valid

HTTPS requires a valid, non-expired certificate. An invalid certificate breaks trust for browsers and API clients.

bash

curl -v https://your-domain.com/health 2>&1 | grep -E "SSL|certificate"

Output (Pass):

Specification

* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* Server certificate:
*  subject: CN=your-domain.com
*  start date: Dec 30 00:00:00 2025 GMT
*  expire date: Mar 30 23:59:59 2026 GMT
*  issuer: C=US; O=Let's Encrypt; CN=R11

Output (Fail):

Specification

* SSL certificate problem: certificate has expired
* Closing connection

If using cert-manager, check certificate status:

bash

kubectl get certificate

Output:

Specification

NAME              READY   SECRET            AGE
task-api-tls      True    task-api-tls      1h

What you're verifying: Your HTTPS endpoint is secure and trusted by clients.

Check 7: Secrets Not in Environment Variables

Sensitive values should never appear in plain text when describing pods.

bash

kubectl describe pod -l app=task-api | grep -E "(OPENAI|API_KEY|PASSWORD|SECRET)"

Output (Pass):

Specification

(no output - secrets aren't visible in describe output when using secret
Key
Ref)

Output (Fail):

Specification

OPENAI_API_KEY:  sk-proj-abc123def456...

If secrets appear in plain text, refactor to use Kubernetes Secrets:

yaml

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: task-api-secrets
        key: openai-api-key

What you're verifying: Sensitive values aren't exposed in logs, kubectl output, or memory dumps.

Check 8: Pod Disruption Budget Exists

PodDisruptionBudgets (PDBs) prevent Kubernetes from terminating too many pods during node maintenance.

bash

kubectl get pdb

Output (Pass):

Specification

NAME              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
task-api-pdb      1               N/A               1                     1h

Output (Fail):

Specification

No resources found in default namespace.

Create a PDB if missing:

yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: task-api-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: task-api

What you're verifying: Your service remains available during cluster upgrades and node maintenance.

Check 9: HPA Configured (If Needed)

HorizontalPodAutoscaler (HPA) scales pods based on CPU or memory usage.

bash

kubectl get hpa

Output (Pass):

Specification

NAME       REFERENCE             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
task-api   Deployment/task-api   45%/80%   2         10        2          1h

Output (Fail for traffic-receiving services):

Specification

No resources found in default namespace.

For services expecting variable traffic, create an HPA:

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: task-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: task-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80

What you're verifying: Your service scales automatically under load instead of becoming unresponsive.

Check 10: Cost Estimate Documented

Production readiness includes knowing what you're paying.

Access your cloud provider's dashboard:

DigitalOcean: cloud.digitalocean.com > Billing
Hetzner: console.hetzner.cloud > Cloud > Servers > monthly costs
AWS: Cost Explorer
GCP: Billing dashboard

Document:

Current monthly cost
Cost per component (nodes, load balancer, storage)
Projected cost at 2x scale

What you're verifying: No surprises on your cloud bill.

Common Failures and Fixes

Symptom	Likely Cause	Fix
Health endpoint returns 503	Pod not ready	Check readiness probe, pod logs
curl returns 000	DNS/Ingress misconfigured	Verify DNS propagation, Ingress rules
Pods keep restarting	Liveness probe failing	Increase initialDelaySeconds, check app startup
Deployment stuck at 0/2	Image pull failed	Check image name, pull secret
HPA shows <unknown> targets	Metrics server missing	Install metrics-server on cluster
Certificate shows "Not Ready"	cert-manager challenge failing	Check Ingress, DNS, cert-manager logs

Debugging Pod Restarts

bash

kubectl describe pod -l app=task-api | grep -A 10 "Last State:"

Output:

Specification

Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 30 Dec 2025 10:00:00 +0000
      Finished:     Mon, 30 Dec 2025 10:00:05 +0000

Check logs for the crash reason:

bash

kubectl logs -l app=task-api --previous

Debugging Image Pull Failures

bash

kubectl describe pod -l app=task-api | grep -A 5 "Events:"

Output (Fail):

Specification

Events:
  Warning  Failed     1m    kubelet  Failed to pull image "ghcr.io/myorg/task-api:v1.0.0": unauthorized

Fix by creating or updating image pull secret:

bash

kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=YOUR_USERNAME \
  --docker-password=YOUR_PAT

Running the Full Checklist

Here's a script that runs all checks:

bash

#!/bin/bash
# production-checklist.sh
DOMAIN="your-domain.com"
DEPLOY="task-api"

echo "=== Production Readiness Checklist ==="

echo -n "

1. Health endpoint: "
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://$DOMAIN/health)
[ "$STATUS" == "200" ] && echo "PASS (HTTP $STATUS)" || echo "FAIL (HTTP $STATUS)"

echo -n "

2. Resource limits: "
kubectl describe pod -l app=$DEPLOY | grep -q "Limits:" && echo "PASS" || echo "FAIL"

echo -n "

3. Replicas >= 2: "
REPLICAS=$(kubectl get deploy $DEPLOY -o jsonpath='{.status.readyReplicas}')
[ "$REPLICAS" -ge 2 ] && echo "PASS ($REPLICAS replicas)" || echo "FAIL ($REPLICAS replica)"

echo -n "

4. Liveness probe: "
kubectl get deploy $DEPLOY -o jsonpath='{.spec.template.spec.containers[0].livenessProbe}' | grep -q "httpGet" && echo "PASS" || echo "FAIL"

echo -n "

5. Readiness probe: "
kubectl get deploy $DEPLOY -o jsonpath='{.spec.template.spec.containers[0].readinessProbe}' | grep -q "httpGet" && echo "PASS" || echo "FAIL"

echo -n "

6. TLS certificate: "
curl -v https://$DOMAIN/health 2>&1 | grep -q "SSL certificate verify ok" && echo "PASS" || echo "CHECK MANUALLY"

echo -n "

7. Secrets in env: "
kubectl describe pod -l app=$DEPLOY | grep -qE "(API_KEY|PASSWORD|SECRET).*=" && echo "FAIL (secrets visible)" || echo "PASS"

echo -n "

8. Pod disruption budget: "
kubectl get pdb | grep -q $DEPLOY && echo "PASS" || echo "FAIL (no PDB)"

echo -n "

9. HPA configured: "
kubectl get hpa | grep -q $DEPLOY && echo "PASS" || echo "N/A (check if needed)"

echo "

10. Cost estimate: CHECK PROVIDER DASHBOARD"
echo "=== Checklist Complete ==="

Output:

Specification

=== Production Readiness Checklist ===

Health endpoint: PASS (HTTP 200)

Resource limits: PASS

Replicas >= 2: PASS (2 replicas)

Liveness probe: PASS

Readiness probe: PASS

TLS certificate: PASS

Secrets in env: PASS

Pod disruption budget: PASS

HPA configured: PASS

Cost estimate: CHECK PROVIDER DASHBOARD
=== Checklist Complete ===

Try With AI

Use your AI companion to verify your production deployment collaboratively.

Prompt 1: Checklist Review

text

I'm running a production checklist on my Kubernetes deployment. Here's
the output from kubectl describe pod for my task-api:
[paste your kubectl describe pod output]
Review this against production best practices. What's configured correctly?
What's missing? For anything missing, show me the exact YAML to add.

What you're learning: Pattern recognition—AI helps you spot configuration gaps you might overlook and generates correct fixes faster than manual YAML writing.

Prompt 2: Failure Diagnosis

text

My production checklist shows these failures:
- Health endpoint returns 503
- Pods showing 1/2 ready
- HPA shows <unknown> for current metrics
Here are my logs and events:
[paste kubectl logs and kubectl describe pod output]
Diagnose these failures in order of priority. What's the root cause of each?
What's the fastest path to fixing all three?

What you're learning: Systematic debugging—AI helps you prioritize issues and identify root causes when multiple things fail simultaneously.

Prompt 3: Checklist Customization

text

The 10-point checklist I learned covers general production readiness.
My Task API has specific requirements:
- It connects to a PostgreSQL database
- It calls OpenAI API for inference
- It needs to handle 100 requests/second peak
What additional checks should I add to my production checklist for these
specific requirements? Create kubectl commands for each check.

What you're learning: Checklist adaptation—production checklists should be customized for your application's specific dependencies and requirements.

Safety Note

Always run verification commands on your actual deployment, not just in theory. AI can generate perfect-looking commands, but only execution against real infrastructure confirms your deployment is truly production-ready.

Reflect on Your Skill

Test your multi-cloud-deployer skill:

Does it include a production readiness checklist?
Can it generate verification commands for any deployment?
Does it know the common failures and fixes for each check?

If gaps exist, update your skill with the 10-point checklist pattern and debugging procedures from this chapter. A deployment skill isn't complete without verification capability.