USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookOne Agent Loop Can Crash Your Service and Drain Your Budget
Previous Chapter
Traffic Routing with HTTPRoute
Next Chapter
TLS Termination with CertManager
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

63 sections

Progress0%
1 / 63

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Rate Limiting & Circuit Breaking

Your API is open to the world. Without rate limiting, one bad actor can crash everyone. A single user hammering your endpoints exhausts database connections, starves other users, and eventually brings down the service. For AI agents, the stakes are higher: LLM calls cost money. A runaway loop hitting your GPT-4 endpoint can generate a $10,000 surprise bill in hours.

BackendTrafficPolicy is Envoy Gateway's extension for protecting services. It controls how traffic flows to your backends—limiting request rates, breaking circuits when services fail, and retrying transient errors. This lesson teaches you to configure these protections so your Task API survives abuse, controls costs, and recovers gracefully from failures.

By the end, you will protect your services with rate limits, configure per-user quotas using headers, implement circuit breakers that exclude failing backends, and understand when to use local versus global rate limiting.


Understanding BackendTrafficPolicy

BackendTrafficPolicy is an Envoy Gateway extension CRD that configures traffic behavior between Envoy proxies and your backend services. While HTTPRoute controls where traffic goes, BackendTrafficPolicy controls how that traffic behaves.

What BackendTrafficPolicy Controls

FeaturePurpose
Rate LimitingLimit requests per time unit
Circuit BreakerStop sending to failing backends
Retry PolicyAutomatically retry failed requests
TimeoutsFail requests that take too long
Load BalancingControl backend selection strategy

How Policy Targeting Works

BackendTrafficPolicy uses targetRefs to specify which resources it applies to:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: my-policy namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route

Target options:

Target KindEffect
HTTPRoutePolicy applies to that specific route
GatewayPolicy applies to all routes through that Gateway
GRPCRoutePolicy applies to gRPC traffic

Important constraint: A BackendTrafficPolicy can only target resources in the same namespace as the policy itself.


Local Rate Limiting

Local rate limiting applies limits per Envoy proxy instance. If you have 3 proxy replicas and set a limit of 100 requests/minute, each replica allows 100 requests/minute independently—total cluster capacity is 300 requests/minute.

Basic Local Rate Limit

Apply a simple rate limit to all requests:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-ratelimit namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route rateLimit: local: rules: - limit: requests: 100 unit: Minute

Apply and test:

bash
kubectl apply -f task-api-ratelimit.yaml

Output:

text
backendtrafficpolicy.gateway.envoyproxy.io/task-api-ratelimit created

Generate load to test the limit:

bash
for i in {1..120}; do curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/api/tasks done | sort | uniq -c

Output:

text
100 200 20 429

The first 100 requests succeed (200). Requests 101-120 are rate limited (429 Too Many Requests).

Understanding Rate Limit Units

The unit field accepts these values:

UnitMeaning
SecondRequests per second
MinuteRequests per minute
HourRequests per hour

Choose based on your use case:

  • API endpoints: Minute (smooth traffic over time)
  • Health checks: Second (low volume, quick reset)
  • Expensive operations: Hour (strict daily quotas)

Per-User Rate Limits

Global rate limits protect your infrastructure, but they punish all users equally. When one user hits the limit, everyone gets blocked. Per-user rate limits give each user their own quota.

Using Header-Based Descriptors

Rate limit based on the x-user-id header:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-per-user namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route rateLimit: local: rules: - clientSelectors: - headers: - name: x-user-id value: "*" limit: requests: 50 unit: Minute

[!NOTE] The value: "*" matches any value of the header. Each unique header value gets its own rate limit bucket.

Test with different users:

bash
# User "alice" makes 60 requests for i in {1..60}; do curl -s -o /dev/null -w "%{http_code}\n" \ -H "x-user-id: alice" http://localhost:8080/api/tasks done | tail -15 # User "bob" makes 60 requests (independent quota) for i in {1..60}; do curl -s -o /dev/null -w "%{http_code}\n" \ -H "x-user-id: bob" http://localhost:8080/api/tasks done | tail -15

Output for alice:

text
200 200 200 429 429 ...

Output for bob:

text
200 200 200 200 200 ...

Alice hits her 50-request limit. Bob's quota is independent—he can still make requests.

Matching Specific Header Values

Rate limit a specific user more aggressively:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: heavy-user-limit namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route rateLimit: local: rules: - clientSelectors: - headers: - name: x-user-id value: "heavy-user-123" limit: requests: 10 unit: Minute

This limits user heavy-user-123 to 10 requests/minute while other users get the default limit.


Anonymous vs Authenticated Users

Anonymous users (no x-user-id header) should get lower limits than authenticated users. Use the invert field to match requests without a header.

Different Limits by Authentication Status

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-tiered namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route rateLimit: local: rules: # Authenticated users: 100 requests/minute - clientSelectors: - headers: - name: x-user-id value: "*" limit: requests: 100 unit: Minute # Anonymous users (no x-user-id header): 10 requests/minute - clientSelectors: - headers: - name: x-user-id value: "*" invert: true limit: requests: 10 unit: Minute

Test anonymous access:

bash
# No x-user-id header for i in {1..15}; do curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/api/tasks done

Output:

text
200 200 200 200 200 200 200 200 200 200 429 429 429 429 429

Anonymous users hit the limit at 10 requests. Authenticated users get 100.


Global Rate Limiting

Local rate limiting has a limitation: limits are per proxy instance. With 5 replicas at 100 requests/minute each, your actual cluster limit is 500 requests/minute. If you need strict organization-wide quotas, use global rate limiting.

How Global Rate Limiting Works

Rendering diagram...

All proxies query the same Redis instance. When one proxy increments the counter, all proxies see the updated value.

Configuring Global Rate Limiting

First, configure Envoy Gateway to use Redis:

yaml
apiVersion: v1 kind: ConfigMap metadata: name: envoy-gateway-config namespace: envoy-gateway-system data: envoy-gateway.yaml: | apiVersion: gateway.envoyproxy.io/v1alpha1 kind: EnvoyGateway rateLimit: backend: type: Redis redis: url: redis.redis-system.svc.cluster.local:6379

Then create a BackendTrafficPolicy with global rate limiting:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-global namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route rateLimit: global: rules: - clientSelectors: - headers: - name: x-user-id value: "*" limit: requests: 100 unit: Minute

[!IMPORTANT] Key difference: Use rateLimit.global instead of rateLimit.local.

When to Use Global vs Local

ScenarioRecommendation
Development/testingLocal (simpler, no Redis needed)
Strict API quotasGlobal (exact limits across cluster)
High availability priorityLocal (no Redis dependency)
Billing-based limitsGlobal (accurate usage tracking)
Cost protection for LLM APIsGlobal (prevent runaway spending)

Circuit Breaker Pattern

Rate limiting protects against too many requests. Circuit breakers protect against failing backends. When a backend becomes unhealthy, the circuit breaker stops sending traffic—preventing cascade failures and giving the backend time to recover.

How Circuit Breakers Work

StateConditionAction
Circuit ClosedNormal operations; backend healthyRequests forwarded to backend
Circuit OpenFailure threshold exceededRequests fail fast with 503; backend excluded
RecoveryProbing health; timeout passedLimited requests allowed to verify health

Configuring Circuit Breaker

Limit concurrent connections and pending requests:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-circuitbreaker namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route circuitBreaker: maxConnections: 100 maxParallelRequests: 50 maxPendingRequests: 10

Field meanings:

FieldDescriptionWhen Exceeded
maxConnectionsMaximum TCP connections to backendNew connections rejected
maxParallelRequestsMaximum concurrent HTTP requestsRequests wait or fail
maxPendingRequestsMaximum queued requestsReturns 503 immediately

Testing Circuit Breaker Behavior

Use hey to generate concurrent load:

bash
# Install hey load testing tool go install github.com/rakyll/hey@latest # Generate 100 concurrent requests with 10 second delay hey -n 100 -c 100 -host "www.example.com" \ http://localhost:8080/api/tasks?delay=10s

Output (with circuit breaker at maxParallelRequests=10):

text
Summary: Total: 10.5 secs Slowest: 10.2 secs Fastest: 0.001 secs Average: 1.0 secs Status code distribution: [200] 10 responses [503] 90 responses

Only 10 requests reached the backend (matching maxParallelRequests). The other 90 failed fast with 503—protecting your backend from overload.

Circuit Breaker Sizing

Envoy's default thresholds (1024 connections, 1024 pending) may be too high or too low for your workload. Size based on your backend's capacity:

Backend TypeRecommended maxConnectionsRecommended maxParallelRequests
Single container50-10025-50
Replicated (3 pods)150-30075-150
Database-backed API20-5010-25
LLM inference endpoint5-205-10

Retry Policy

Transient failures happen—network blips, brief pod restarts, temporary overload. Retry policies automatically retry failed requests so clients do not see every hiccup.

Configuring Automatic Retries

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-retry namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route retry: numRetries: 3 perRetry: backOff: baseInterval: 100ms maxInterval: 1s timeout: 500ms retryOn: httpStatusCodes: - 500 - 502 - 503 - 504 triggers: - connect-failure - retriable-status-codes

Field meanings:

FieldDescription
numRetriesMaximum retry attempts
perRetry.backOff.baseIntervalInitial delay between retries
perRetry.backOff.maxIntervalMaximum delay (exponential backoff caps here)
perRetry.timeoutTimeout for each individual retry attempt
retryOn.httpStatusCodesWhich status codes trigger retry
retryOn.triggersWhich failure types trigger retry

Retry Triggers

TriggerMeaning
connect-failureTCP connection failed
retriable-status-codesStatus code matched httpStatusCodes list
resetConnection reset by backend
refused-streamHTTP/2 stream refused

Retry Safety

[!CAUTION] Only retry idempotent operations. Retrying a POST that creates a record may create duplicates. Configure retry policies on read-heavy routes, not write routes.

HTTP MethodSafe to RetryReason
GETYesRead-only; no state change
HEADYesRead-only
PUTYesIdempotent full replacement
POSTNoMay create duplicate records
DELETENoMay fail on second attempt

Combining Policies

You can combine rate limiting, circuit breaking, and retries in a single BackendTrafficPolicy:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: task-api-comprehensive namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route # Rate limiting rateLimit: local: rules: - limit: requests: 100 unit: Minute # Circuit breaker circuitBreaker: maxConnections: 100 maxParallelRequests: 50 maxPendingRequests: 10 # Retry policy retry: numRetries: 2 perRetry: backOff: baseInterval: 50ms maxInterval: 500ms timeout: 250ms retryOn: httpStatusCodes: - 503 triggers: - connect-failure

Order of evaluation:

  1. Rate limit check (429 if exceeded)
  2. Circuit breaker check (503 if circuit open)
  3. Forward to backend
  4. Retry on failure (if policy matches)

Policy Merging

When policies target different levels (Gateway vs HTTPRoute), they merge with specific precedence.

Merging Behavior

Policy LevelPrecedenceUse Case
HTTPRouteHigher (overrides)Route-specific settings
GatewayLower (defaults)Cluster-wide defaults

Example: Default plus override

Gateway-level default (applies to all routes):

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: gateway-defaults namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: Gateway name: task-api-gateway rateLimit: local: rules: - limit: requests: 100 unit: Minute

HTTPRoute-level override (applies only to expensive operations):

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: llm-route-override namespace: task-api spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: llm-inference-route rateLimit: local: rules: - limit: requests: 10 unit: Minute

The LLM inference route gets 10 requests/minute (override). All other routes get 100 requests/minute (default).


Observing Rate Limiting

Monitor rate limiting effectiveness with Prometheus metrics.

Key Metrics

MetricMeaning
envoy_http_ratelimit_over_limitRequests that exceeded rate limit
envoy_cluster_circuit_breakers_cx_openCircuit breaker open (connections)
envoy_cluster_circuit_breakers_rq_pending_openCircuit breaker open (pending requests)

Prometheus Query Examples

Rate limited requests per route:

promql
sum(rate(envoy_http_ratelimit_over_limit[5m])) by (route_name)

Circuit breaker activations:

promql
sum(rate(envoy_cluster_circuit_breakers_cx_open[5m])) by (envoy_cluster_name)

Grafana Dashboard Panel

Create a panel showing rate limit vs successful requests:

promql
# Successful requests sum(rate(envoy_http_downstream_rq_completed{response_code="200"}[5m])) # Rate limited requests sum(rate(envoy_http_ratelimit_over_limit[5m]))

Exercises

Exercise 1: Apply Basic Rate Limit

Apply a rate limit and observe 429 responses:

bash
kubectl apply -f - <<EOF apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: exercise-ratelimit namespace: default spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route rateLimit: local: rules: - limit: requests: 5 unit: Minute EOF

Test:

bash
for i in {1..10}; do curl -s -o /dev/null -w "%{http_code} " http://localhost:8080/api/tasks done echo

Expected Output:

text
200 200 200 200 200 429 429 429 429 429

Exercise 2: Per-User Rate Limits

Add per-user limits using x-user-id header:

bash
kubectl apply -f - <<EOF apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: exercise-peruser namespace: default spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route rateLimit: local: rules: - clientSelectors: - headers: - name: x-user-id value: "*" limit: requests: 3 unit: Minute EOF

Test with different users:

bash
# User alice for i in {1..5}; do curl -s -o /dev/null -w "%{http_code} " \ -H "x-user-id: alice" http://localhost:8080/api/tasks done echo "alice" # User bob (independent quota) for i in {1..5}; do curl -s -o /dev/null -w "%{http_code} " \ -H "x-user-id: bob" http://localhost:8080/api/tasks done echo "bob"

Expected Output:

text
200 200 200 429 429 alice 200 200 200 429 429 bob

Exercise 3: Circuit Breaker Under Load

Configure circuit breaker and observe 503 responses under load:

bash
kubectl apply -f - <<EOF apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: exercise-circuit namespace: default spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: task-api-route circuitBreaker: maxParallelRequests: 5 maxPendingRequests: 0 EOF

Generate concurrent load:

bash
# Run 20 concurrent requests for i in {1..20}; do curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/api/tasks?delay=1s & done wait

Expected: Some 200 responses (up to maxParallelRequests), rest 503.

Exercise 4: View Rate Limit Metrics

Query Prometheus for rate limiting data:

bash
# Port-forward to Prometheus kubectl port-forward -n monitoring svc/prometheus 9090:9090 & # Query rate limited requests curl -s "http://localhost:9090/api/v1/query?query=envoy_http_ratelimit_over_limit" | jq

Reflect on Your Skill

You built a traffic-engineer skill in Lesson 0. Based on what you learned about rate limiting and circuit breaking:

Add Rate Limiting Decision Logic

Your skill should ask:

QuestionIf YesIf No
Need strict cluster-wide quota?Use global (Redis)Use local
Different user tiers?Add per-user limits with headersUse uniform limit
Protecting expensive LLM endpoint?Aggressive limits (10/min)Standard limits (100/min)

Add Policy Templates

Local rate limiting template:

yaml
# Template: local-ratelimit apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: {{ service }}-ratelimit namespace: {{ namespace }} spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: {{ route }} rateLimit: local: rules: - limit: requests: {{ requests }} unit: {{ unit }}

Circuit breaker template:

yaml
# Template: circuit-breaker circuitBreaker: maxConnections: {{ max_connections }} maxParallelRequests: {{ max_parallel }} maxPendingRequests: {{ max_pending }}

Update Troubleshooting Guidance

SymptomCheckLikely Cause
429 on all requestsRate limit too lowIncrease limit
503 under normal loadCircuit breaker too aggressiveIncrease maxParallelRequests
No rate limitingPolicy not attachedCheck targetRefs

Try With AI

Generate Rate Limiting Configuration

Ask your traffic-engineer skill to generate configuration:

text
Using my traffic-engineer skill, generate BackendTrafficPolicy for my Task API: - 100 requests/minute per user (based on x-user-id header) - 10 requests/minute for anonymous users (no header) - Apply to the task-api-route HTTPRoute in task-api namespace

What you're learning: AI generates multi-rule rate limiting. Review the output—did AI use invert: true correctly for anonymous users? Are both rules in the same policy?

Evaluate the Configuration

Check AI's output:

  • Does it use clientSelectors with headers?
  • Is the invert: true field on the anonymous user rule?
  • Are the limits in the correct order (authenticated first, anonymous second)?

If something is missing, provide feedback:

text
The anonymous user rule needs invert: true on the x-user-id header to match requests WITHOUT that header. Please fix.

Add Circuit Breaker Protection

Extend the configuration:

text
Add circuit breaker protection to this policy: - Maximum 50 concurrent connections - Maximum 25 parallel requests - Maximum 5 pending requests (fail fast after that)

What you're learning: AI adapts existing configurations. Verify the circuit breaker fields are correct and added to the same BackendTrafficPolicy resource.

Validate Before Applying

Before applying AI's configuration:

bash
# Validate YAML syntax kubectl apply --dry-run=client -f policy.yaml # Check for common errors kubectl apply --dry-run=server -f policy.yaml

This iteration—specifying requirements, evaluating output, refining with constraints—builds production configurations safely.

Safety Note

Rate limiting and circuit breakers affect all traffic to your service. Test in development before production. Start with higher limits and lower circuit breaker thresholds—you can always tighten them based on observed behavior. Monitor envoy_http_ratelimit_over_limit metrics to ensure you are not blocking legitimate traffic.