USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookFixed Replicas Mean You're Either Wasting Money or Crashing
Previous Chapter
Traffic Splitting Patterns
Next Chapter
Resilience Patterns
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

63 sections

Progress0%
1 / 63

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Autoscaling with HPA, VPA & KEDA

Module 7 takes the agent you built in Module 6 and turns it into a production cloud service. You'll containerize the stack, orchestrate it on Kubernetes, automate delivery, and operate it with observability, security, and cost controls. The goal: a reliable Digital FTE that runs 24/7 for real users.

Prerequisites: Modules 4-6. You need a working agent service to deploy.

You deployed your Task API with 3 replicas. At 2 AM, all three pods sit idle, consuming resources and costing money. At noon, traffic spikes and 3 replicas cannot keep up—users see latency, requests queue, and eventually fail. Fixed replica counts waste money during quiet periods and fail during busy ones.

Autoscaling matches capacity to demand automatically. Kubernetes provides Horizontal Pod Autoscaler (HPA) for scaling replica counts based on metrics. Vertical Pod Autoscaler (VPA) right-sizes individual pods. For event-driven workloads—like AI agents processing queue messages—KEDA enables scaling based on queue depth, Prometheus metrics, or even scaling to zero when idle.

This lesson teaches you to configure HPA for CPU-based scaling, understand VPA for resource optimization, install KEDA for event-driven autoscaling, and implement scale-to-zero for cost efficiency. By the end, your services will scale up when needed and scale down (or to zero) when idle.


How Autoscaling Works in Kubernetes

Kubernetes autoscaling operates through a control loop. A controller periodically checks metrics, compares them to targets, and adjusts replicas or resources accordingly.

The Three Autoscaling Approaches

ApproachWhat It ScalesBased OnBest For
HPAReplica countCPU, memory, custom metricsRequest-based workloads
VPAPod resources (CPU/memory)Historical usageRight-sizing pods
KEDAReplica count (including to zero)Any metric sourceEvent-driven, queues, serverless

When Each Approach Applies

text
┌─────────────────────────────────────────────┐ │ Scaling Decision │ └─────────────────────────────────────────────┘ │ ┌───────────────────────────┼───────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ HPA │ │ VPA │ │ KEDA │ │ More replicas │ │ Bigger pods │ │ Event-driven │ │ Same pod size │ │ Same replica │ │ Scale to zero │ └─────────────────┘ └─────────────────┘ └─────────────────┘

Horizontal Pod Autoscaler (HPA)

HPA scales the number of pod replicas based on observed metrics. When CPU utilization exceeds 80%, HPA adds more pods. When utilization drops, HPA removes pods.

Prerequisites: Metrics Server

HPA requires metrics-server to provide CPU and memory metrics:

Specification
# Check if metrics-server is installedkubectl get deployment metrics-server -n kube-system

Output (if installed):

text
NAME READY UP-TO-DATE AVAILABLE AGE metrics-server 1/1 1 1 30d

If not installed:

Specification
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Output:

Specification
serviceaccount/metrics-server createdclusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader createdclusterrole.rbac.authorization.k8s.io/system:metrics-server createdrolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader createdclusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator createdclusterrolebinding.rbac.authorization.k8s.io/system:metrics-server createdservice/metrics-server createddeployment.apps/metrics-server createdapiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created

Creating an HPA

Define HPA for the Task API:

yaml
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: task-api-hpa namespace: task-api spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: task-api minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70

Apply and verify:

bash
kubectl apply -f task-api-hpa.yaml kubectl get hpa -n task-api

Output:

text
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE task-api-hpa Deployment/task-api 23%/70% 2 10 2 30s

The TARGETS column shows current utilization (23%) versus target (70%).

Understanding HPA Fields

FieldPurpose
scaleTargetRefThe Deployment (or other workload) to scale
minReplicasNever scale below this count
maxReplicasNever scale above this count
metricsWhat to measure and target values
averageUtilizationTarget percentage of resource limit

Scaling Based on Multiple Metrics

Scale on both CPU and memory:

yaml
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: task-api-hpa-multi namespace: task-api spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: task-api minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80

HPA scales based on whichever metric requires more replicas.

Observing HPA Behavior

Generate load and watch scaling:

bash
# In terminal 1: Watch HPA kubectl get hpa -n task-api -w # In terminal 2: Generate load kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://task-api.task-api.svc.cluster.local:8080/api/tasks; done"

Output (terminal 1):

text
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE task-api-hpa Deployment/task-api 23%/70% 2 10 2 5m task-api-hpa Deployment/task-api 68%/70% 2 10 2 6m task-api-hpa Deployment/task-api 85%/70% 2 10 3 7m task-api-hpa Deployment/task-api 72%/70% 2 10 4 8m

HPA detected CPU exceeding 70% and scaled from 2 to 4 replicas.

HPA Scaling Behavior Configuration

Control how aggressively HPA scales:

yaml
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: task-api-hpa-tuned namespace: task-api spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: task-api minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 15 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 60

Behavior settings:

SettingEffect
scaleUp.stabilizationWindowSecondsWait before scaling up again
scaleUp.policies.type: PercentScale up by percentage of current
scaleDown.stabilizationWindowSecondsWait before scaling down (avoid flapping)
scaleDown.policiesScale down gradually (50% per minute)

Vertical Pod Autoscaler (VPA)

VPA adjusts CPU and memory requests for pods based on historical usage. Instead of adding more pods, VPA makes existing pods bigger (or smaller).

When VPA Helps

ScenarioVPA Value
Pods frequently OOMKilledVPA recommends higher memory
Pods use 10% of requested CPUVPA recommends lower requests
Initial resource sizing unknownVPA provides data-driven recommendations

Installing VPA

bash
# Clone VPA repository git clone https://github.com/kubernetes/autoscaler.git cd autoscaler/vertical-pod-autoscaler # Install VPA components ./hack/vpa-up.sh

Output:

text
customresourcedefinition.apiextensions.k8s.io/verticalpodautoscalers.autoscaling.k8s.io created customresourcedefinition.apiextensions.k8s.io/verticalpodautoscalercheckpoints.autoscaling.k8s.io created deployment.apps/vpa-recommender created deployment.apps/vpa-updater created deployment.apps/vpa-admission-controller created

Creating a VPA

yaml
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: task-api-vpa namespace: task-api spec: targetRef: apiVersion: apps/v1 kind: Deployment name: task-api updatePolicy: updateMode: "Off" # Start with recommendations only resourcePolicy: containerPolicies: - containerName: task-api minAllowed: cpu: 100m memory: 128Mi maxAllowed: cpu: 2 memory: 2Gi

Update modes:

ModeBehavior
OffRecommendations only (no changes)
InitialSet resources on pod creation only
AutoUpdate resources (requires pod restart)

Viewing VPA Recommendations

bash
kubectl get vpa task-api-vpa -n task-api -o yaml

Output (recommendations section):

yaml
status: recommendation: containerRecommendations: - containerName: task-api lowerBound: cpu: 25m memory: 262144k target: cpu: 50m memory: 524288k upperBound: cpu: 200m memory: 1Gi

This tells you the pod currently requests too much (or too little) resources. The target is VPA's recommended setting.

VPA Limitations

VPA cannot coexist with HPA on CPU/memory. Both try to control the same resources.

Solutions:

  1. Use VPA for recommendations only (updateMode: Off)
  2. Use HPA for replica scaling + VPA for right-sizing during deployments
  3. Use KEDA (which can work alongside VPA)

KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with support for any metric source and scale-to-zero capability. KEDA is essential for:

  • Queue-based workers (Kafka, RabbitMQ, SQS)
  • Cron-based scaling (scale up at 9 AM, down at 6 PM)
  • Prometheus metrics (custom application metrics)
  • Cost optimization (scale to zero when idle)

How KEDA Works

text
┌─────────────────────────────────────────────────────────────────────┐ │ KEDA Architecture │ └─────────────────────────────────────────────────────────────────────┘ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ Prometheus │ │ Kafka │ │ Cloud Queues │ │ Metrics │ │ Topics │ │ (SQS, Pub/Sub) │ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │ │ │ └────────────────────┼────────────────────────┘ │ ▼ ┌──────────────────┐ │ KEDA Operator │ │ (watches │ │ ScaledObjects) │ └────────┬─────────┘ │ ┌─────────────┼─────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ HPA │ │ HPA │ │ HPA │ │ (KEDA- │ │ (KEDA- │ │ (KEDA- │ │ managed) │ │ managed) │ │ managed) │ └──────────┘ └──────────┘ └──────────┘

KEDA creates and manages HPAs automatically based on ScaledObject definitions.

Installing KEDA

Specification
helm repo add kedacore https://kedacore.github.io/chartshelm repo updatehelm install keda kedacore/keda --namespace keda --create-namespace

Output:

Specification
NAME: kedaNAMESPACE: kedaSTATUS: deployedREVISION: 1

Verify installation:

Specification
kubectl get pods -n keda

Output:

Specification
NAME READY STATUS RESTARTS AGEkeda-admission-webhooks-5f4c6d8f7-xxxxx 1/1 Running 0 60skeda-operator-7b9c4d6f5-xxxxx 1/1 Running 0 60skeda-operator-metrics-apiserver-6c8f5d4b7-xxxxx 1/1 Running 0 60s

ScaledObject: The Core KEDA Resource

ScaledObject tells KEDA what to scale and based on which metrics:

yaml
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: task-api-scaledobject namespace: task-api spec: scaleTargetRef: name: task-api minReplicaCount: 0 # Scale to zero! maxReplicaCount: 10 pollingInterval: 15 # Check metrics every 15 seconds cooldownPeriod: 300 # Wait 5 minutes before scaling down triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 metricName: http_requests_total query: sum(rate(http_requests_total{service="task-api"}[2m])) threshold: "100"

Apply and verify:

bash
kubectl apply -f task-api-scaledobject.yaml kubectl get scaledobject -n task-api

Output:

text
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE AGE task-api-scaledobject apps/v1.Deployment task-api 0 10 prometheus True True 30s

Understanding ScaledObject Fields

FieldPurpose
scaleTargetRefDeployment to scale
minReplicaCountMinimum pods (0 = scale to zero)
maxReplicaCountMaximum pods
pollingIntervalHow often to check metrics (seconds)
cooldownPeriodHow long to wait before scaling down
triggersWhat metrics drive scaling

Prometheus Scaler

The Prometheus scaler queries your Prometheus server for custom metrics—request rate, queue depth, latency percentiles, or any metric your application exposes.

Scaling Based on Request Rate

Scale based on requests per second:

yaml
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: task-api-request-rate namespace: task-api spec: scaleTargetRef: name: task-api minReplicaCount: 1 maxReplicaCount: 20 triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 metricName: task_api_requests_per_second query: | sum(rate(http_requests_total{service="task-api"}[1m])) threshold: "50"

How it works:

  • Query calculates requests per second over the last minute
  • When requests exceed 50/second, KEDA adds pods
  • Each additional pod handles ~50 requests/second

Scaling Based on Latency

Scale when response times degrade:

yaml
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: task-api-latency namespace: task-api spec: scaleTargetRef: name: task-api minReplicaCount: 2 maxReplicaCount: 15 triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 metricName: task_api_p95_latency query: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="task-api"}[5m])) by (le)) threshold: "0.5"

When p95 latency exceeds 500ms, KEDA adds pods to reduce load per instance.

Testing Prometheus Scaler

Generate load and observe scaling:

bash
# Watch ScaledObject status kubectl get scaledobject task-api-request-rate -n task-api -w # Generate traffic hey -n 10000 -c 100 http://task-api.example.com/api/tasks

Output:

text
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX READY ACTIVE REPLICAS task-api-request-rate apps/v1.Deployment task-api 1 20 True True 1 task-api-request-rate apps/v1.Deployment task-api 1 20 True True 3 task-api-request-rate apps/v1.Deployment task-api 1 20 True True 7

Kafka Scaler for Event-Driven Workloads

For AI agents processing messages from Kafka, scale based on consumer lag—how many unprocessed messages are waiting.

Kafka Consumer Lag Scaling

yaml
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: ai-agent-kafka-scaler namespace: ai-agents spec: scaleTargetRef: name: ai-agent-worker minReplicaCount: 0 # Scale to zero when no messages maxReplicaCount: 50 triggers: - type: kafka metadata: bootstrapServers: kafka.kafka.svc.cluster.local:9092 consumerGroup: ai-agent-consumers topic: ai-tasks lagThreshold: "10"

How it works:

  • KEDA queries Kafka for consumer group lag
  • When 10+ messages are waiting per partition, scale up
  • When queue is empty, scale to zero

Kafka Scaler with Authentication

For production Kafka clusters requiring authentication:

yaml
apiVersion: keda.sh/v1alpha1 kind: TriggerAuthentication metadata: name: kafka-auth namespace: ai-agents spec: secretTargetRef: - parameter: sasl name: kafka-secrets key: sasl - parameter: username name: kafka-secrets key: username - parameter: password name: kafka-secrets key: password --- apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: ai-agent-kafka-scaler namespace: ai-agents spec: scaleTargetRef: name: ai-agent-worker minReplicaCount: 0 maxReplicaCount: 50 triggers: - type: kafka authenticationRef: name: kafka-auth metadata: bootstrapServers: kafka.kafka.svc.cluster.local:9092 consumerGroup: ai-agent-consumers topic: ai-tasks lagThreshold: "10"

Scale-to-Zero Pattern

Scale-to-zero is KEDA's defining feature. When no work exists, why pay for idle pods?

Configuring Scale-to-Zero

yaml
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: task-worker-scale-to-zero namespace: task-api spec: scaleTargetRef: name: task-worker minReplicaCount: 0 # Key setting maxReplicaCount: 10 cooldownPeriod: 300 # Wait 5 minutes before scaling to zero triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 metricName: pending_tasks query: sum(task_queue_depth{service="task-worker"}) threshold: "1"

Observing Scale-to-Zero

bash
# Watch pods kubectl get pods -n task-api -w # After 5 minutes of no activity...

Output:

text
NAME READY STATUS RESTARTS AGE task-worker-7b9c4d6f5-xxxxx 1/1 Running 0 10m task-worker-7b9c4d6f5-xxxxx 1/1 Terminating 0 15m

The pod terminates when there's no work. When new tasks arrive, KEDA scales back up.

Cold Start Considerations

Scale-to-zero introduces cold start latency. The first request waits for:

  1. KEDA to detect the metric change
  2. Pod scheduling and startup
  3. Container initialization
  4. Application readiness

Mitigation strategies:

StrategyImplementation
Fast startupOptimize container startup time
Readiness probesEnsure pods are ready before receiving traffic
Minimum replicasKeep minReplicaCount: 1 for latency-sensitive services
Pre-warmingScale up before expected traffic (cron trigger)

Choosing the Right Autoscaler

Workload TypeRecommended Approach
Web API (HTTP requests)HPA on CPU, or KEDA with Prometheus
Background workersKEDA with queue scaler
AI inference endpointsKEDA with scale-to-zero
Batch processingKEDA with Kafka/queue scaler
Cost optimization neededKEDA (scale-to-zero capability)
Resource right-sizingVPA (recommendations mode)

Exercises

Exercise 1: Configure HPA for CPU Scaling

Create HPA for your Task API:

bash
kubectl apply -f - <<EOF apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: exercise-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: task-api minReplicas: 1 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 EOF

Verify:

bash
kubectl get hpa exercise-hpa

Expected Output:

text
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE exercise-hpa Deployment/task-api <unknown>/50% 1 5 1 30s

Exercise 2: Install KEDA

Install KEDA in your cluster:

bash
helm repo add kedacore https://kedacore.github.io/charts helm repo update helm install keda kedacore/keda --namespace keda --create-namespace

Verify:

bash
kubectl get pods -n keda

Expected Output:

text
NAME READY STATUS RESTARTS AGE keda-admission-webhooks-xxxxx 1/1 Running 0 60s keda-operator-xxxxx 1/1 Running 0 60s keda-operator-metrics-apiserver-xxxxx 1/1 Running 0 60s

Exercise 3: Create ScaledObject with Prometheus

Configure KEDA to scale based on request rate:

bash
kubectl apply -f - <<EOF apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: exercise-scaledobject namespace: default spec: scaleTargetRef: name: task-api minReplicaCount: 0 maxReplicaCount: 10 cooldownPeriod: 60 triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 metricName: http_requests query: sum(task_queue_depth{service="task-api"}) threshold: "10" EOF

Verify:

bash
kubectl get scaledobject exercise-scaledobject

Expected Output:

Specification
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX READY ACTIVEexercise-scaledobject apps/v1.Deployment task-api 0 10 True False

Exercise 4: Observe Scale-to-Zero

With no traffic, watch the deployment scale to zero:

bash
# Watch pods (wait for cooldown Period to pass) kubectl get pods -l app=task-api -w

Expected Output (after cooldown):

Specification
No resources found in default namespace.

Generate traffic and watch pods scale up:

bash
curl http://localhost:8080/api/tasks kubectl get pods -l app=task-api

Expected Output:

text
NAME READY STATUS RESTARTS AGE task-api-7b9c4d6f5-xxxxx 1/1 Running 0 10s

Reflect on Your Skill

You built a traffic-engineer skill in Lesson 0. Based on what you learned about autoscaling:

Add Autoscaling Decision Logic

Your skill should ask:

QuestionIf YesIf No
Need scale-to-zero?Use KEDAHPA may suffice
Event-driven workload (queues)?Use KEDA with queue scalerUse HPA or KEDA with Prometheus
Unknown resource requirements?Add VPA in recommendation modeUse established limits
Cost-sensitive environment?KEDA with aggressive scale-downHigher minReplicas for stability

Add ScaledObject Templates

Prometheus-based ScaledObject:

yaml
# Template: prometheus-scaledobject apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: {{ service }}-scaledobject namespace: {{ namespace }} spec: scaleTargetRef: name: {{ deployment }} minReplicaCount: {{ min_replicas }} maxReplicaCount: {{ max_replicas }} cooldownPeriod: {{ cooldown_seconds }} triggers: - type: prometheus metadata: serverAddress: {{ prometheus_url }} metricName: {{ metric_name }} query: {{ query }} threshold: "{{ threshold }}"

HPA template:

yaml
# Template: cpu-hpa apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: {{ service }}-hpa namespace: {{ namespace }} spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: {{ deployment }} minReplicas: {{ min_replicas }} maxReplicas: {{ max_replicas }} metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: {{ cpu_target }}

Update Troubleshooting Guidance

SymptomCheckLikely Cause
HPA shows <unknown> targetsmetrics-server running?Install metrics-server
KEDA not scalingScaledObject READY statusCheck trigger configuration
Pods not scaling to zerocooldownPeriod too long?Reduce cooldownPeriod
Cold start too slowContainer startup timeOptimize image, add readiness probe

Try With AI

Generate HPA Configuration

Ask your traffic-engineer skill:

Specification
Using my traffic-engineer skill, generate HPA configuration for my Task API: - Target 60% CPU utilization - Minimum 2 replicas (always available) - Maximum 20 replicas - Scale up quickly (double every 15 seconds) - Scale down slowly (25% reduction per minute)

What you're learning: AI generates HPA with behavior configuration. Review the output—did AI include the behavior section with stabilization windows? Are the scaling policies correct for your requirements?

Evaluate and Refine

Check AI's output:

  • Does it use autoscaling/v2 (not v1)?
  • Is behavior.scaleUp configured for fast scaling?
  • Is behavior.scaleDown configured for gradual reduction?
  • Are stabilization windows appropriate?

If the scale-down is too aggressive, provide feedback:

Specification
The scale-down policy removes pods too quickly. Change to:- 300 second stabilization window - Maximum 25% reduction per 60 seconds

Add KEDA ScaledObject

Extend to event-driven scaling:

Specification
Now create a KEDA Scaled Object for the same Task API that: - Scales based on Prometheus metric: sum(rate(http_requests_total{app="task-api"}[2m])) - Threshold: 100 requests per second - Enable scale-to-zero with 5-minute cooldown - Maximum 20 replicas

What you're learning: AI generates KEDA configuration. Verify the Prometheus query is correct and the ScaledObject references the right deployment.

Validate Configuration

Before applying:

Specification
# Validate YAMLkubectl apply --dry-run=client -f hpa.yamlkubectl apply --dry-run=client -f scaledobject.yaml # Check for conflicts (HPA and KEDA shouldn't target same deployment)kubectl get hpa -Akubectl get scaledobject -A

This iteration—requirements, generation, validation, refinement—produces production-ready autoscaling configurations.

Safety Note

Autoscaling affects resource consumption and costs. Start with conservative settings (higher minReplicaCount, longer cooldownPeriod) and tune based on observed behavior. Monitor your cluster's node autoscaler to ensure it can provision nodes for scaled-up workloads. Test scale-to-zero behavior in staging before production—cold starts may impact user experience.