Your agent deployment is running smoothly on Kubernetes. Traffic arrives in waves: quiet periods, then sudden bursts when multiple users send inference requests simultaneously.
If you set replicas to handle peak load, you're wasting money on idle Pods during quiet periods. If you set replicas for average load, you're rejecting requests during spikes.
Kubernetes solves this with the Horizontal Pod Autoscaler (HPA). It watches CPU usage (or memory, or custom metrics) and automatically scales Pods up when demand rises and down when demand falls. You specify min and max replicas; HPA keeps the cluster balanced between them.
This chapter teaches you how HPA works, why it matters for AI agents, and how to configure it on Docker Desktop Kubernetes.
AI inference workloads have unpredictable demand. Unlike web servers that handle requests with minimal CPU, inference requests are CPU-intensive:
HPA solves this by providing:
The mental model: HPA is a feedback loop that keeps your cluster right-sized to current demand.
Before HPA can scale, it needs to measure Pod resource usage. This is the job of the metrics-server, a cluster component that collects CPU and memory metrics from every container.
Key insight: HPA without metrics-server cannot function. The cluster won't scale because it has nothing to measure.
On your Docker Desktop Kubernetes cluster:
Output:
If metrics-server is not listed, you need to install it.
An HPA resource is a declarative configuration that tells Kubernetes:
Breaking it down:
HPA scales in two directions. They have different behaviors to prevent problems.
When CPU exceeds the target (50% in our example):
Example:
When CPU falls below the target:
Example:
Key principle: Scale up fast (respond to spikes), scale down slow (don't over-react to dips).
A stabilization window prevents HPA from making rapid, contradictory scaling decisions.
Imagine no stabilization:
Replicas are constantly changing. This causes:
With 300-second scaleDown stabilization:
In this scenario, the brief 30-second dip doesn't trigger scale down. Only sustained low CPU causes reduction.
Docker Desktop's Kubernetes includes metrics-server by default. Verify it's running:
Output:
Wait 30 seconds for metrics to accumulate, then check if metrics are available:
Output:
If this shows CPU and memory percentages, metrics-server is working. If it shows <unknown>, wait another 30 seconds and retry.
Create a simple deployment that consumes CPU when stressed.
Manifest (save as agent-deployment.yaml):
Deploy it:
Output:
Verify Pods are running:
Output:
Check CPU usage:
Output:
Each Pod is using ~450m CPU (450 millicores). Since the limit is 500m, they're near their maximum.
Now create an HPA that scales this deployment based on CPU.
Manifest (save as agent-hpa.yaml):
Deploy the HPA:
Output:
Check the HPA status:
Output:
The TARGETS column shows 90%/50%. This means:
Since 90% > 50%, the HPA should decide to scale up. Let's watch it happen:
Output (watch continuously updates):
What happened:
Check the Pods:
Output:
Six Pods are running. CPU is now distributed: each Pod uses less CPU because work is spread across more containers.
Now let's reduce CPU load and watch HPA scale down.
Delete the Deployment (which stops the busy loop):
Output:
Check the HPA:
Output:
The TARGETS column shows <unknown> because the Deployment no longer exists. HPA can't scale it.
Let's recreate the Deployment with a lower-CPU workload:
Deploy it:
Output:
Watch HPA:
Output:
What happened:
The stabilization window prevented HPA from immediately thrashing. It waited 5 minutes of sustained low CPU before scaling down.
HPA calculates desired replicas with this formula:
Example 1: Scale up
Example 2: Scale down
This explains why HPA scales up faster than down:
HPA always prefers to provision more Pods than fewer, ensuring requests don't get rejected.
1. Set requests and limits appropriately
HPA scales based on requested CPU, not used CPU. Requests must reflect what your agent actually needs at baseline.
2. Use a target utilization between 50-80%
3. Adjust stabilization windows based on workload
For bursty inference workloads:
For steady-state services:
4. Monitor scaling events
Output (relevant section):
Events show every scaling decision and why it was made. Use this to verify HPA is responding correctly.
Problem: HPA shows <unknown> for TARGETS
Output:
Causes:
Fix:
Problem: HPA not scaling despite high CPU
Output:
Cause: HPA calculated it should scale to 10, but Deployment can't create Pods fast enough.
Fix:
When done experimenting:
Output:
You now understand how HPA scales based on CPU. Real-world systems often need more sophisticated scaling based on custom metrics — things like queue depth, inference latency, or model confidence scores.
Setup: You're deploying an agent that processes inference requests from a queue. Load varies wildly: sometimes empty, sometimes 100 requests queued.
Your challenge: Design an HPA configuration that scales based on queue depth (a custom metric) instead of CPU.
Research custom metrics Ask AI: "How do I configure Kubernetes HPA to scale based on custom metrics like queue depth? What components are needed beyond the standard metrics-server?"
Design the configuration Based on AI's explanation, design an HPA manifest that:
Understand the components Ask AI: "In a custom metrics setup, what's the relationship between Prometheus, custom-metrics-api, and HPA? How does the data flow?"
Iterate on thresholds Discuss with AI: "If an agent takes 30 seconds to process one request, and I want maximum 2-second response time, what queue depth should trigger scaling to 10?"
Expected outcome: You'll understand that CPU-based scaling is simple but crude. Custom metrics enable precise control over system behavior. You don't scale "when CPU is high"—you scale "when queue depth exceeds healthy levels."
You built a kubernetes-deployment skill in Chapter 0. Test and improve it based on what you learned.
Ask yourself:
If you found gaps: