USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookThe Kubernetes First-Aid Kit: Diagnosing Production Failures
Previous Chapter
ConfigMaps and Secrets
Next Chapter
Horizontal Pod Autoscaler for AI Agents
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

27 sections

Progress0%
1 / 27

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Resource Management and Debugging

Your Kubernetes cluster is running Pods. Everything works perfectly in development. Then you deploy to production.

Your Pod crashes immediately. Or it stays Pending forever. Or it consumes all memory and gets evicted. You don't know why—you just see error states and no explanation.

This lesson teaches you to read what the cluster is trying to tell you. Kubernetes provides signals about Pod failures: status fields, events, logs, and resource constraints. Learning to interpret these signals is the difference between a 5-minute fix and hours of frustration.


Concept 1: Resource Requests and Limits

Before diving into debugging, you need to understand how Kubernetes allocates resources.

The Mental Model: Requests vs Limits

Think of resource management like renting an apartment:

ConceptApartment AnalogyKubernetes Reality
Request"I need at least 2 bedrooms"—the landlord won't accept you if they have fewer.Your GUARANTEED minimum. The node must have this much free for the Pod to start.
Limit"My apartment can have at most 3 bedrooms"—you don't need more than this.Maximum allowed. If you try to use more, you get throttled or evicted.

In Kubernetes:

yaml
resources: requests: memory: "256Mi" # Guaranteed minimum cpu: "100m" # for scheduling decisions limits: memory: "512Mi" # Maximum allowed cpu: "500m" # can't exceed this

Key Principle: A Pod cannot be scheduled on a node unless that node has at least the REQUESTED amount of free resources. Limits prevent a Pod from monopolizing node resources.

CPU and Memory Units

ResourceUnitDescription
CPU1000m1 CPU core
CPU100m0.1 CPU cores (100 millicores)
CPU0.5Half a CPU core (500m)
Memory1Mi1 mebibyte (~1.05 million bytes)
Memory1Gi1 gibibyte (~1.07 billion bytes)
Memory256MiTypical for small services

Always use Mi and Gi (binary) not MB and GB (decimal) in Kubernetes manifests. They're different.


Concept 2: Quality of Service (QoS) Classes

Kubernetes prioritizes which Pods to evict when a node runs out of resources. This priority is determined by the Pod's QoS class.

The Three QoS Classes

Guaranteed (Highest Priority)

yaml
resources: requests: memory: "256Mi" cpu: "100m" limits: memory: "256Mi" cpu: "100m"

When requests equal limits, the Pod is Guaranteed. Kubernetes evicts Guaranteed Pods LAST. Use this for critical workloads.

Burstable (Medium Priority)

yaml
resources: requests: memory: "256Mi" cpu: "100m" limits: memory: "512Mi" cpu: "500m"

When requests < limits, the Pod is Burstable. Kubernetes evicts Burstable Pods second. Use this for normal workloads (most agents).

BestEffort (Lowest Priority)

yaml
resources: {} # No requests or limits

When a Pod has no requests or limits, it's BestEffort. Kubernetes evicts these FIRST. Only use this for non-critical batch jobs.


Concept 3: Common Pod Failure States

CrashLoopBackOff

CheckDetails
What you seeSTATUS: CrashLoopBackOff, RESTARTS: 5+
What it meansContainer crashes immediately after starting, repeating the cycle.
Root CausesApp bug, missing env var, missing config, port collision, OOM.
Fix Pattern1.kubectl logs, 2. kubectl describe (check OOM), 3. Fix code/manifest.

ImagePullBackOff

CheckDetails
What you seeSTATUS: ImagePullBackOff
What it meansKubernetes fails to download the container image.
Root CausesImage missing, wrong name/tag, missing credentials, network issues.
Fix Pattern1.kubectl describe (pull error), 2. Verify name, 3. docker pull locally.

Pending

CheckDetails
What you seeSTATUS: Pending
What it meansKubernetes cannot find a node with enough free resources.
Root CausesRequest too high, node affinity conflict, waiting for volume.
Fix Pattern1.kubectl describe (FailedScheduling), 2. Reduce requests, 3. Add nodes.

OOMKilled

CheckDetails
What you seeReason: OOMKilled, Exit Code: 137
What it meansContainer exceeded its memory limit and was terminated.
Root CausesMemory leak, limit too low, large dataset processing.
Fix Pattern1. Increase limit, 2. Profile app for leaks, 3. Process in chunks.

Concept 4: The Debugging Pattern

Signal 1: Pod Status

bash
kubectl get pods

Output:

text
NAME READY STATUS RESTARTS AGE nginx-good 1/1 Running 0 5m nginx-crash 0/1 CrashLoopBackOff 3 2m nginx-pending 0/1 Pending 0 1m

Signal 2: Events

bash
kubectl describe pod <pod-name>

Output (relevant section):

text
Events: Type Reason Age Message ---- ------ --- ------- Normal Created 2m20s Created container nginx Normal Started 2m19s Started container nginx Warning BackOff 2m10s Back-off restarting failed container

Signal 3: Logs

bash
kubectl logs <pod-name>

Output:

text
Traceback (most recent call last): File "app.py", line 5, in <module> connect_to_db() Exception: Database not found

Signal 4: Interactive Access

bash
kubectl exec -it <pod-name> -- /bin/bash

Inside the Pod:

bash
# Investigation commands env # Check environment variables ls -la # Check filesystem ps aux # Check running processes curl localhost:8080 # Test internal services

Putting It Together: The Debugging Workflow

StepActionCommand
1Get statuskubectl get pods
2Describekubectl describe pod <name>
3Check logskubectl logs <name>
4Investigatekubectl exec -it <name> -- /bin/bash
5Fix & ApplyModify manifest &kubectl apply -f ...

Practice 1: Diagnose CrashLoopBackOff

Manifest (crash-loop.yaml):

yaml
apiVersion: v1 kind: Pod metadata: name: crash-loop-app spec: containers: - name: app image: python:3.11-slim command: ["python", "-c"] args: - | import os db_url = os.environ['DATABASE_URL'] print(f"Connecting to {db_url}") resources: requests: memory: "64Mi" cpu: "50m" limits: memory: "128Mi" cpu: "100m" restartPolicy: Always

Troubleshooting Steps:

bash
kubectl apply -f crash-loop.yaml kubectl get pods # STATUS: CrashLoopBackOff kubectl logs crash-loop-app # KeyError: 'DATABASE_URL'

Fix: Add the environment variable to the manifest and re-apply.


Practice 2: Diagnose Pending Pod

Manifest (pending-pod.yaml):

yaml
apiVersion: v1 kind: Pod metadata: name: memory-hog spec: containers: - name: app image: python:3.11-slim command: ["sleep", "3600"] resources: requests: memory: "100Gi" # Way too high cpu: "50" limits: memory: "100Gi" cpu: "50"

Troubleshooting Steps:

bash
kubectl apply -f pending-pod.yaml kubectl describe pod memory-hog # Message: 0/1 nodes are available: 1 Insufficient memory.

Fix: Reduce memory/CPU requests to reasonable values.


Resource Management Best Practices

Best PracticeDescription
Set requests/limitsAlways define them for all production Pods.
QoS ClassMatch requests to limits for critical workloads (Guaranteed class).
Be ConservativeStart with low requests, then increase based on kubectl top data.
Use HeadroomSet requests to baseline usage + 20%.
QuotasUse ResourceQuotas per namespace to prevent resource exhaustion.

Try With AI

Collaborate with AI to troubleshoot a complex scenario.

Step 1: Deploy a broken Pod

yaml
apiVersion: v1 kind: Pod metadata: name: multi-container-app spec: containers: - name: web image: nginx:1.25 ports: - containerPort: 8080 resources: requests: memory: "64Mi" - name: sidecar image: curlimages/curl:latest command: ["sleep", "3600"]

Step 2: Ask AI for Analysis

Prompt AI: "I've deployed a multi-container Pod. Here is the describe output: [paste output]. What QoS class is this? How do I fix the sidecar being BestEffort while the main is Burstable?"


Reflect on Your Skill

You built a kubernetes-deployment skill in Chapter 1. Test and improve it based on what you learned.

Identify Gaps

  • Does my skill include resource requests/limits and QoS classes?
  • Does it explain the status → describe → logs → exec workflow?
  • Does it cover failure states like OOMKilled and Pending?

Source: https://www.muhammadusmanakbar.com/book