Resource Management and Debugging

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

Your Kubernetes cluster is running Pods. Everything works perfectly in development. Then you deploy to production.

Your Pod crashes immediately. Or it stays Pending forever. Or it consumes all memory and gets evicted. You don't know why—you just see error states and no explanation.

This lesson teaches you to read what the cluster is trying to tell you. Kubernetes provides signals about Pod failures: status fields, events, logs, and resource constraints. Learning to interpret these signals is the difference between a 5-minute fix and hours of frustration.

Concept 1: Resource Requests and Limits

Before diving into debugging, you need to understand how Kubernetes allocates resources.

The Mental Model: Requests vs Limits

Think of resource management like renting an apartment:

Concept	Apartment Analogy	Kubernetes Reality
Request	"I need at least 2 bedrooms"—the landlord won't accept you if they have fewer.	Your GUARANTEED minimum. The node must have this much free for the Pod to start.
Limit	"My apartment can have at most 3 bedrooms"—you don't need more than this.	Maximum allowed. If you try to use more, you get throttled or evicted.

In Kubernetes:

yaml

resources:
  requests:
    memory: "256Mi"    # Guaranteed minimum
    cpu: "100m"        # for scheduling decisions
  limits:
    memory: "512Mi"    # Maximum allowed
    cpu: "500m"        # can't exceed this

Key Principle: A Pod cannot be scheduled on a node unless that node has at least the REQUESTED amount of free resources. Limits prevent a Pod from monopolizing node resources.

CPU and Memory Units

Resource	Unit	Description
CPU	1000m	1 CPU core
CPU	100m	0.1 CPU cores (100 millicores)
CPU	0.5	Half a CPU core (500m)
Memory	1Mi	1 mebibyte (~1.05 million bytes)
Memory	1Gi	1 gibibyte (~1.07 billion bytes)
Memory	256Mi	Typical for small services

Always use Mi and Gi (binary) not MB and GB (decimal) in Kubernetes manifests. They're different.

Concept 2: Quality of Service (QoS) Classes

Kubernetes prioritizes which Pods to evict when a node runs out of resources. This priority is determined by the Pod's QoS class.

The Three QoS Classes

Guaranteed (Highest Priority)

yaml

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "100m"

When requests equal limits, the Pod is Guaranteed. Kubernetes evicts Guaranteed Pods LAST. Use this for critical workloads.

Burstable (Medium Priority)

yaml

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

When requests < limits, the Pod is Burstable. Kubernetes evicts Burstable Pods second. Use this for normal workloads (most agents).

BestEffort (Lowest Priority)

yaml

resources: {}  # No requests or limits

When a Pod has no requests or limits, it's BestEffort. Kubernetes evicts these FIRST. Only use this for non-critical batch jobs.

Concept 3: Common Pod Failure States

CrashLoopBackOff

Check	Details
What you see	STATUS: CrashLoopBackOff, RESTARTS: 5+
What it means	Container crashes immediately after starting, repeating the cycle.
Root Causes	App bug, missing env var, missing config, port collision, OOM.
Fix Pattern	1.kubectl logs, 2. kubectl describe (check OOM), 3. Fix code/manifest.

ImagePullBackOff

Check	Details
What you see	STATUS: ImagePullBackOff
What it means	Kubernetes fails to download the container image.
Root Causes	Image missing, wrong name/tag, missing credentials, network issues.
Fix Pattern	1.kubectl describe (pull error), 2. Verify name, 3. docker pull locally.

Pending

Check	Details
What you see	STATUS: Pending
What it means	Kubernetes cannot find a node with enough free resources.
Root Causes	Request too high, node affinity conflict, waiting for volume.
Fix Pattern	1.kubectl describe (FailedScheduling), 2. Reduce requests, 3. Add nodes.

OOMKilled

Check	Details
What you see	Reason: OOMKilled, Exit Code: 137
What it means	Container exceeded its memory limit and was terminated.
Root Causes	Memory leak, limit too low, large dataset processing.
Fix Pattern	1. Increase limit, 2. Profile app for leaks, 3. Process in chunks.

Concept 4: The Debugging Pattern

Signal 1: Pod Status

bash

kubectl get pods

Output:

text

NAME              READY   STATUS             RESTARTS   AGE
nginx-good        1/1     Running            0          5m
nginx-crash       0/1     CrashLoopBackOff   3          2m
nginx-pending     0/1     Pending            0          1m

Signal 2: Events

bash

kubectl describe pod <pod-name>

Output (relevant section):

text

Events:
  Type    Reason     Age    Message
  ----    ------     ---    -------
  Normal  Created    2m20s  Created container nginx
  Normal  Started    2m19s  Started container nginx
  Warning BackOff    2m10s  Back-off restarting failed container

Signal 3: Logs

bash

kubectl logs <pod-name>

Output:

text

Traceback (most recent call last):
  File "app.py", line 5, in <module>
    connect_to_db()
Exception: Database not found

Signal 4: Interactive Access

bash

kubectl exec -it <pod-name> -- /bin/bash

Inside the Pod:

bash

# Investigation commands
env          # Check environment variables
ls -la       # Check filesystem
ps aux       # Check running processes
curl localhost:8080  # Test internal services

Putting It Together: The Debugging Workflow

Step	Action	Command
1	Get status	kubectl get pods
2	Describe	kubectl describe pod <name>
3	Check logs	kubectl logs <name>
4	Investigate	kubectl exec -it <name> -- /bin/bash
5	Fix & Apply	Modify manifest &kubectl apply -f ...

Practice 1: Diagnose CrashLoopBackOff

Manifest (crash-loop.yaml):

yaml

apiVersion: v1
kind: Pod
metadata:
  name: crash-loop-app
spec:
  containers:
  - name: app
    image: python:3.11-slim
    command: ["python", "-c"]
    args:
      - |
        import os
        db_url = os.environ['DATABASE_URL']
        print(f"Connecting to {db_url}")
    resources:
      requests:
        memory: "64Mi"
        cpu: "50m"
      limits:
        memory: "128Mi"
        cpu: "100m"
  restartPolicy: Always

Troubleshooting Steps:

bash

kubectl apply -f crash-loop.yaml
kubectl get pods
# STATUS: CrashLoopBackOff

kubectl logs crash-loop-app
# KeyError: 'DATABASE_URL'

Fix: Add the environment variable to the manifest and re-apply.

Practice 2: Diagnose Pending Pod

Manifest (pending-pod.yaml):

yaml

apiVersion: v1
kind: Pod
metadata:
  name: memory-hog
spec:
  containers:
  - name: app
    image: python:3.11-slim
    command: ["sleep", "3600"]
    resources:
      requests:
        memory: "100Gi"    # Way too high
        cpu: "50"
      limits:
        memory: "100Gi"
        cpu: "50"

Troubleshooting Steps:

bash

kubectl apply -f pending-pod.yaml
kubectl describe pod memory-hog
# Message: 0/1 nodes are available: 1 Insufficient memory.

Fix: Reduce memory/CPU requests to reasonable values.

Resource Management Best Practices

Best Practice	Description
Set requests/limits	Always define them for all production Pods.
QoS Class	Match requests to limits for critical workloads (Guaranteed class).
Be Conservative	Start with low requests, then increase based on kubectl top data.
Use Headroom	Set requests to baseline usage + 20%.
Quotas	Use ResourceQuotas per namespace to prevent resource exhaustion.

Try With AI

Collaborate with AI to troubleshoot a complex scenario.

Step 1: Deploy a broken Pod

yaml

apiVersion: v1
kind: Pod
metadata:
  name: multi-container-app
spec:
  containers:
  - name: web
    image: nginx:1.25
    ports:
    - containerPort: 8080
    resources:
      requests:
        memory: "64Mi"
  - name: sidecar
    image: curlimages/curl:latest
    command: ["sleep", "3600"]

Step 2: Ask AI for Analysis

Prompt AI: "I've deployed a multi-container Pod. Here is the describe output: [paste output]. What QoS class is this? How do I fix the sidecar being BestEffort while the main is Burstable?"

Reflect on Your Skill

You built a kubernetes-deployment skill in Chapter 1. Test and improve it based on what you learned.

Identify Gaps

Does my skill include resource requests/limits and QoS classes?
Does it explain the status → describe → logs → exec workflow?
Does it cover failure states like OOMKilled and Pending?

Source: https://www.muhammadusmanakbar.com/book