Module 7 takes the agent you built in Module 6 and turns it into a production cloud service. You'll containerize the stack, orchestrate it on Kubernetes, automate delivery, and operate it with observability, security, and cost controls. The goal: a reliable Digital FTE that runs 24/7 for real users.
Prerequisites: Modules 4-6. You need a working agent service to deploy.
You deployed your Task API with 3 replicas. At 2 AM, all three pods sit idle, consuming resources and costing money. At noon, traffic spikes and 3 replicas cannot keep up—users see latency, requests queue, and eventually fail. Fixed replica counts waste money during quiet periods and fail during busy ones.
Autoscaling matches capacity to demand automatically. Kubernetes provides Horizontal Pod Autoscaler (HPA) for scaling replica counts based on metrics. Vertical Pod Autoscaler (VPA) right-sizes individual pods. For event-driven workloads—like AI agents processing queue messages—KEDA enables scaling based on queue depth, Prometheus metrics, or even scaling to zero when idle.
This lesson teaches you to configure HPA for CPU-based scaling, understand VPA for resource optimization, install KEDA for event-driven autoscaling, and implement scale-to-zero for cost efficiency. By the end, your services will scale up when needed and scale down (or to zero) when idle.
Kubernetes autoscaling operates through a control loop. A controller periodically checks metrics, compares them to targets, and adjusts replicas or resources accordingly.
HPA scales the number of pod replicas based on observed metrics. When CPU utilization exceeds 80%, HPA adds more pods. When utilization drops, HPA removes pods.
HPA requires metrics-server to provide CPU and memory metrics:
Output (if installed):
If not installed:
Output:
Define HPA for the Task API:
Apply and verify:
Output:
The TARGETS column shows current utilization (23%) versus target (70%).
Scale on both CPU and memory:
HPA scales based on whichever metric requires more replicas.
Generate load and watch scaling:
Output (terminal 1):
HPA detected CPU exceeding 70% and scaled from 2 to 4 replicas.
Control how aggressively HPA scales:
Behavior settings:
VPA adjusts CPU and memory requests for pods based on historical usage. Instead of adding more pods, VPA makes existing pods bigger (or smaller).
Output:
Update modes:
Output (recommendations section):
This tells you the pod currently requests too much (or too little) resources. The target is VPA's recommended setting.
VPA cannot coexist with HPA on CPU/memory. Both try to control the same resources.
Solutions:
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with support for any metric source and scale-to-zero capability. KEDA is essential for:
KEDA creates and manages HPAs automatically based on ScaledObject definitions.
Output:
Verify installation:
Output:
ScaledObject tells KEDA what to scale and based on which metrics:
Apply and verify:
Output:
The Prometheus scaler queries your Prometheus server for custom metrics—request rate, queue depth, latency percentiles, or any metric your application exposes.
Scale based on requests per second:
How it works:
Scale when response times degrade:
When p95 latency exceeds 500ms, KEDA adds pods to reduce load per instance.
Generate load and observe scaling:
Output:
For AI agents processing messages from Kafka, scale based on consumer lag—how many unprocessed messages are waiting.
How it works:
For production Kafka clusters requiring authentication:
Scale-to-zero is KEDA's defining feature. When no work exists, why pay for idle pods?
Output:
The pod terminates when there's no work. When new tasks arrive, KEDA scales back up.
Scale-to-zero introduces cold start latency. The first request waits for:
Mitigation strategies:
Create HPA for your Task API:
Verify:
Expected Output:
Install KEDA in your cluster:
Verify:
Expected Output:
Configure KEDA to scale based on request rate:
Verify:
Expected Output:
With no traffic, watch the deployment scale to zero:
Expected Output (after cooldown):
Generate traffic and watch pods scale up:
Expected Output:
You built a traffic-engineer skill in Lesson 0. Based on what you learned about autoscaling:
Your skill should ask:
Prometheus-based ScaledObject:
HPA template:
Ask your traffic-engineer skill:
What you're learning: AI generates HPA with behavior configuration. Review the output—did AI include the behavior section with stabilization windows? Are the scaling policies correct for your requirements?
Check AI's output:
If the scale-down is too aggressive, provide feedback:
Extend to event-driven scaling:
What you're learning: AI generates KEDA configuration. Verify the Prometheus query is correct and the ScaledObject references the right deployment.
Before applying:
This iteration—requirements, generation, validation, refinement—produces production-ready autoscaling configurations.
Autoscaling affects resource consumption and costs. Start with conservative settings (higher minReplicaCount, longer cooldownPeriod) and tune based on observed behavior. Monitor your cluster's node autoscaler to ensure it can provision nodes for scaled-up workloads. Test scale-to-zero behavior in staging before production—cold starts may impact user experience.