USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookRequest-Based Rate Limits Don't Work for LLM APIs
Previous Chapter
Resilience Patterns
Next Chapter
Capstone Production Traffic for Task API
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

49 sections

Progress0%
1 / 49

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Envoy AI Gateway for LLM Traffic

Module 7 takes the agent you built in Module 6 and turns it into a production cloud service. You'll containerize the stack, orchestrate it on Kubernetes, automate delivery, and operate it with observability, security, and cost controls. The goal: a reliable Digital FTE that runs 24/7 for real users.

Prerequisites: Modules 4-6. You need a working agent service to deploy.

Your rate limiter allows 100 requests per minute. User A sends 100 requests, each with a simple "Hello" prompt consuming 10 tokens. User B sends 100 requests, each asking GPT-4 to "Write a comprehensive business plan with financial projections"—consuming 8,000 tokens per request. Both users stay within your request limit. But User A cost you $0.03 while User B cost you $240. Traditional rate limiting treats all requests equally. LLM traffic is not equal.

Envoy AI Gateway is purpose-built for this problem. Released as open source by Tetrate and Bloomberg in February 2025 and backed by the CNCF, it provides token-based rate limiting, provider fallback, and unified access across LLM providers. This lesson teaches you to protect your AI services from cost overruns using the currency that actually matters: tokens.

By the end, you will configure token-based rate limits that enforce daily budgets, implement provider fallback chains that route to Anthropic when OpenAI rate limits are hit, and design cost control patterns that give each user and team their own token budget.


Why Traditional Gateways Fail for LLM Traffic

Standard API gateways count requests. LLM services charge tokens. This mismatch creates three problems:

ProblemTraditional GatewayAI Gateway
Cost unpredictability100 requests = 100 requests100 requests = 1,000 to 800,000 tokens
FairnessAll users get equal request quotaHeavy prompts consume disproportionate budget
Provider lock-inSingle backend per routeAutomatic failover across providers

The Token Economy

LLM pricing operates on tokens, not requests:

ModelInput Token CostOutput Token Cost100 Requests Cost Range
GPT-4o$2.50/1M tokens$10.00/1M tokens$0.05 - $50
Claude Sonnet 4$3.00/1M tokens$15.00/1M tokens$0.06 - $60
GPT-4o-mini$0.15/1M tokens$0.60/1M tokens$0.002 - $2

Key insight: A single GPT-4 request can cost 100x more than another. Request counting cannot capture this variance.


Envoy AI Gateway Architecture

Envoy AI Gateway extends Envoy Gateway with AI-specific capabilities. It sits between your applications and LLM providers, providing a unified API regardless of which provider handles the request.

Architecture Overview

text
┌────────────────────────────────────────────────┐ │ Envoy AI Gateway │ │ ┌─────────────────────────────────────────┐ │ │ │ • Token counting │ │ │ │ • Rate limiting (tokens, not requests) │ │ │ │ • Provider abstraction │ │ │ │ • Fallback routing │ │ │ └─────────────────────────────────────────┘ │ └────────────────────────────────────────────────┘ │ ┌──────────────────────────────┼──────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ OpenAI │ │ Anthropic │ │ AWS Bedrock │ │ API │ │ API │ │ API │ └──────────────┘ └──────────────┘ └──────────────┘

Core Components

| Component | Purpose |

| :--- | :--- | | AIGatewayRoute | Defines routing rules to AI backends | | LLMRequestCost | Configures token extraction and cost calculation | | BackendTrafficPolicy | Applies token-based rate limits | | AIBackend | Configures provider credentials and endpoints |

Unified API

Applications send requests to a single endpoint. AI Gateway translates between provider formats:

bash
# Same request format works for any provider curl -X POST $GATEWAY_URL/v1/chat/completions \ -H "x-user-id: user123" \ -H "x-ai-eg-model: gpt-4o" \ -d '{ "messages": [{"role": "user", "content": "Hello"}] }'

The x-ai-eg-model header specifies which model to use. AI Gateway routes to the appropriate provider and handles format translation.


Token-Based Rate Limiting

AI Gateway extracts token counts from LLM responses and uses them for rate limiting. The system supports four token types:

Token TypeWhat It CountsUse Case
InputTokenPrompt tokensControl input costs
OutputTokenResponse tokensControl output costs
TotalTokenInput + OutputOverall budget control
CELCustom calculationWeighted pricing models

Configuring Token Extraction

First, configure AI Gateway to extract token usage from responses:

yaml
apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: LLMRequestCost metadata: name: token-tracking namespace: ai-services spec: llmRequestCosts: - metadataKey: llm_input_token type: InputToken - metadataKey: llm_output_token type: OutputToken - metadataKey: llm_total_token type: TotalToken

Apply the configuration:

bash
kubectl apply -f token-tracking.yaml

Output:

text
llmrequestcost.aigateway.envoyproxy.io/token-tracking created

AI Gateway automatically parses token counts from responses following the OpenAI schema format. For providers like AWS Bedrock, the gateway handles format translation automatically.

Custom Cost Calculation with CEL

Different models have different pricing. Use CEL expressions for accurate cost tracking:

yaml
spec: llmRequestCosts: - metadataKey: llm_cost_cents type: CEL cel: "input_tokens * 0.25 + output_tokens * 1.0"

This calculates cost in cents where output tokens cost 4x input tokens—matching GPT-4o pricing ratios.


Configuring Token Budgets Per User

Unlike request-based limits, token budgets reflect actual usage. A user who sends concise prompts consumes less budget than one who sends verbose requests.

Basic Token Limit

Limit each user to 100,000 tokens per hour:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: token-budget-per-user namespace: ai-services spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: llm-route rateLimit: type: Global global: rules: - clientSelectors: - headers: - name: x-user-id type: Distinct limit: requests: 100000 unit: Hour cost: request: from: Number number: 0 response: from: Metadata metadata: namespace: io.envoy.ai_gateway key: llm_total_token

Key configuration points:

FieldValuePurpose
x-user-id: DistinctEach user tracked separatelyPer-user budgets
cost.request.number: 0Zero request costOnly tokens count
cost.response.from: MetadataRead token count from responseActual usage tracking

Apply and verify:

bash
kubectl apply -f token-budget-per-user.yaml kubectl get backendtrafficpolicy -n ai-services

Output:

text
NAME AGE token-budget-per-user 5s

Testing Token Limits

Send requests until budget exhausted:

bash
# Each request consumes approximately 100 tokens for i in {1..1500}; do response=$(curl -s -w "\n%{http_code}" \ -H "x-user-id: test-user" \ -H "x-ai-eg-model: gpt-4o-mini" \ $GATEWAY_URL/v1/chat/completions \ -d '{"messages": [{"role": "user", "content": "Say hello"}]}') status=$(echo "$response" | tail -1) if [ "$status" = "429" ]; then echo "Rate limited at request $i" break fi done

Output:

text
Rate limited at request 1024

The user hit their 100,000 token budget (approximately 100 tokens × 1,000 requests).


Model-Specific Token Limits

Different models have different costs. Apply stricter limits to expensive models:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: model-specific-limits namespace: ai-services spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: llm-route rateLimit: type: Global global: rules: # GPT-4o: Expensive, strict limit - clientSelectors: - headers: - name: x-user-id type: Distinct - name: x-ai-eg-model type: Exact value: gpt-4o limit: requests: 50000 unit: Hour cost: request: from: Number number: 0 response: from: Metadata metadata: namespace: io.envoy.ai_gateway key: llm_total_token # GPT-4o-mini: Cheaper, higher limit - clientSelectors: - headers: - name: x-user-id type: Distinct - name: x-ai-eg-model type: Exact value: gpt-4o-mini limit: requests: 500000 unit: Hour cost: request: from: Number number: 0 response: from: Metadata metadata: namespace: io.envoy.ai_gateway key: llm_total_token

Result: Users get 50K tokens/hour for GPT-4o but 500K tokens/hour for GPT-4o-mini—reflecting the 10x price difference.


Provider Fallback Chains

When one provider hits rate limits or experiences downtime, AI Gateway can automatically route to alternatives. This provides resilience and cost optimization.

Configuring Multi-Provider Fallback

Route primarily to OpenAI, fall back to Anthropic:

yaml
apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIGatewayRoute metadata: name: llm-with-fallback namespace: ai-services spec: rules: - matches: - headers: - name: x-ai-eg-model value: gpt-4o backendRefs: - name: openai-backend weight: 100 priority: 1 - name: anthropic-backend weight: 100 priority: 2

How priority works:

text
Request arrives with model: gpt-4o │ ▼ Priority 1: Try OpenAI │ ├── Success → Return response └── Failure (rate limit, timeout, error) │ ▼ Priority 2: Try Anthropic │ ├── Success → Return response └── Failure → Return error to client

Backend Configuration

Define credentials and endpoints for each provider:

yaml
apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIBackend metadata: name: openai-backend namespace: ai-services spec: provider: OpenAI auth: apiKeySecretRef: name: openai-credentials key: api-key --- apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIBackend metadata: name: anthropic-backend namespace: ai-services spec: provider: Anthropic auth: apiKeySecretRef: name: anthropic-credentials key: api-key

Store credentials securely:

bash
kubectl create secret generic openai-credentials \ --from-literal=api-key=$OPENAI_API_KEY \ -n ai-services kubectl create secret generic anthropic-credentials \ --from-literal=api-key=$ANTHROPIC_API_KEY \ -n ai-services

Output:

text
secret/openai-credentials created secret/anthropic-credentials created

Testing Fallback Behavior

Simulate OpenAI rate limiting:

bash
# Exhaust OpenAI quota for i in {1..100}; do curl -s -H "x-user-id: fallback-test" \ -H "x-ai-eg-model: gpt-4o" \ $GATEWAY_URL/v1/chat/completions \ -d '{"messages": [{"role": "user", "content": "Test fallback"}]}' done # Check headers for routing info curl -v -H "x-user-id: fallback-test" \ -H "x-ai-eg-model: gpt-4o" \ $GATEWAY_URL/v1/chat/completions \ -d '{"messages": [{"role": "user", "content": "Which provider?"}]}' 2>&1 | grep x-ai-provider

Output (after fallback):

text
< x-ai-provider: anthropic

The request was served by Anthropic after OpenAI reached its limit.


Cost Engineering Patterns

Effective AI cost control requires organizational-level policies, not just per-user limits.

Pattern 1: Team Budgets

Allocate monthly token budgets per team:

yaml
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: team-budgets namespace: ai-services spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: llm-route rateLimit: type: Global global: rules: # Engineering team: 10M tokens/day - clientSelectors: - headers: - name: x-team-id type: Exact value: engineering limit: requests: 10000000 unit: Day cost: request: from: Number number: 0 response: from: Metadata metadata: namespace: io.envoy.ai_gateway key: llm_total_token # Marketing team: 2M tokens/day - clientSelectors: - headers: - name: x-team-id type: Exact value: marketing limit: requests: 2000000 unit: Day cost: request: from: Number number: 0 response: from: Metadata metadata: namespace: io.envoy.ai_gateway key: llm_total_token

Pattern 2: Cost Tiers with Fallback

Route expensive requests to cheaper models when budget runs low:

Budget RemainingRouting Strategy
> 50%GPT-4o (highest quality)
20-50%GPT-4o-mini (cost-efficient)
< 20%Reject or queue

This requires application-level logic to check remaining budget and adjust the x-ai-eg-model header accordingly.

Pattern 3: Daily Spending Caps

Convert token limits to dollar amounts:

Daily BudgetGPT-4o TokensGPT-4o-mini Tokens
$10/day~800,000~13,000,000
$100/day~8,000,000~130,000,000
$1,000/day~80,000,000~1,300,000,000

Set token limits that match your dollar budget.


Exercises

Exercise 1: Configure Token Tracking

Set up token extraction for your AI Gateway:

yaml
kubectl apply -f - <<EOF apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: LLMRequestCost metadata: name: exercise-tokens namespace: default spec: llmRequestCosts: - metadataKey: llm_input_token type: InputToken - metadataKey: llm_output_token type: OutputToken - metadataKey: llm_total_token type: TotalToken EOF

Verify:

bash
kubectl get llmrequestcost -n default

Expected Output:

text
NAME AGE exercise-tokens 5s

Exercise 2: Create Per-User Token Budget

Limit users to 10,000 tokens per hour:

yaml
kubectl apply -f - <<EOF apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: exercise-token-budget namespace: default spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: llm-route rateLimit: type: Global global: rules: - clientSelectors: - headers: - name: x-user-id type: Distinct limit: requests: 10000 unit: Hour cost: request: from: Number number: 0 response: from: Metadata metadata: namespace: io.envoy.ai_gateway key: llm_total_token EOF

Test with requests:

bash
curl -s -o /dev/null -w "%{http_code}\n" \ -H "x-user-id: exercise-user" \ $GATEWAY_URL/v1/chat/completions \ -d '{"messages": [{"role": "user", "content": "Hello"}]}'

Expected Output:

text
200

Exercise 3: Model-Specific Limits

Apply different limits for different models:

yaml
kubectl apply -f - <<EOF apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: exercise-model-limits namespace: default spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: llm-route rateLimit: type: Global global: rules: - clientSelectors: - headers: - name: x-user-id type: Distinct - name: x-ai-eg-model type: Exact value: gpt-4o limit: requests: 5000 unit: Hour cost: request: from: Number number: 0 response: from: Metadata metadata: namespace: io.envoy.ai_gateway key: llm_total_token - clientSelectors: - headers: - name: x-user-id type: Distinct - name: x-ai-eg-model type: Exact value: gpt-4o-mini limit: requests: 100000 unit: Hour cost: request: from: Number number: 0 response: from: Metadata metadata: namespace: io.envoy.ai_gateway key: llm_total_token EOF

Verify:

bash
kubectl get backendtrafficpolicy exercise-model-limits -o yaml | grep -A 5 "limit:"

Expected Output:

text
limit: requests: 5000 unit: Hour ... limit: requests: 100000 unit: Hour

Exercise 4: Configure Provider Fallback

Set up fallback from OpenAI to Anthropic:

yaml
kubectl apply -f - <<EOF apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIGatewayRoute metadata: name: exercise-fallback namespace: default spec: rules: - matches: - headers: - name: x-ai-eg-model value: gpt-4o backendRefs: - name: openai-backend priority: 1 - name: anthropic-backend priority: 2 EOF

Verify:

bash
kubectl get aigatewayroute exercise-fallback -o yaml | grep -A 10 "backend Refs:"

Expected Output:

text
backendRefs: - name: openai-backend priority: 1 - name: anthropic-backend priority: 2

Reflect on Your Skill

You built a traffic-engineer skill in Lesson 0. Based on what you learned about LLM traffic patterns:

Add AI Gateway Decision Logic

Your skill should now include:

QuestionIf YesIf No
Managing LLM/AI traffic?Use Envoy AI GatewayUse standard Envoy Gateway
Need token-based limits?Configure LLMRequestCost + BackendTrafficPolicyUse request-based limits
Multiple LLM providers?Configure AIGatewayRoute with fallbackSingle backend
Per-user cost control?Add x-user-id header + Distinct selectorGlobal limits

Add LLM Traffic Templates

Token budget template:

yaml
# Template: token-budget apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: {{ service }}-token-budget namespace: {{ namespace }} spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: {{ route }} rateLimit: type: Global global: rules: - clientSelectors: - headers: - name: x-user-id type: Distinct limit: requests: {{ token_limit | default(100000) }} unit: {{ unit | default("Hour") }} cost: request: from: Number number: 0 response: from: Metadata metadata: namespace: io.envoy.ai_gateway key: llm_total_token

Provider fallback template:

yaml
# Template: provider-fallback apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIGatewayRoute metadata: name: {{ route }}-fallback namespace: {{ namespace }} spec: rules: - matches: - headers: - name: x-ai-eg-model value: {{ model }} backendRefs: - name: {{ primary_provider }}-backend priority: 1 - name: {{ fallback_provider }}-backend priority: 2

Update Cost Calculation Guidance

ModelInput Cost (per 1M)Output Cost (per 1M)Suggested Daily Limit ($10 budget)
GPT-4o$2.50$10.00800K tokens
GPT-4o-mini$0.15$0.6013M tokens
Claude Sonnet 4$3.00$15.00650K tokens

Try With AI

You want to configure AI Gateway for your Task API's LLM features. The API uses GPT-4o for complex reasoning and GPT-4o-mini for simple tasks. You have a $100/day budget to protect.

Ask your traffic-engineer skill:

Specification
Using my traffic-engineer skill, configure Envoy AI Gateway for my Task API: - Daily budget: $100 across all users - Per-user limit: 100,000 tokens/hour for GPT-4o, 500,000 for GPT-4o-mini - Fallback: Route to Anthropic when OpenAI rate limits hit - Track input and output tokens separately

Review AI's configuration. Check these specifics:

  • Does the LLMRequestCost resource extract both input and output tokens?
  • Are the BackendTrafficPolicy limits set with cost.request.number: 0 to count only tokens?
  • Does the AIGatewayRoute have proper priority settings for fallback?
  • Are the token limits realistic for your $100 budget?

If the token math seems off, provide your constraint:

Specification
$100/day with GPT-4o pricing ($2.50 input, $10 output per million) means roughly 8M total tokens. Please recalculate the per-user limits so that 10 users sharing equally get 800K tokens each.

Now extend to include model-specific routing:

Specification
Add routing logic: - Requests with "priority: high" header go to GPT-4o - All other requests go to GPT-4o-mini - Both models should fall back to Anthropic on failure

Verify the complete configuration before applying:

bash
# Validate all resources kubectl apply --dry-run=client -f ai-gateway-config.yaml # Check for missing secrets kubectl get secrets -n ai-services | grep credentials # Verify route priorities kubectl get aigatewayroute -o yaml | grep priority

Compare your first request to the final configuration. The initial approach likely missed either the cost calculation details or the proper header matching. Through iteration, you specified the budget constraint, the token-to-dollar conversion, and the routing requirements—producing a configuration that actually protects your $100 daily budget rather than just counting requests.

Safety Note

Token-based rate limiting requires the AI Gateway to parse LLM responses. Ensure your gateway has sufficient resources to handle this processing overhead. Start with conservative limits (lower than calculated) and adjust based on observed usage. Monitor x-ai-gateway-tokens-used response headers to verify token counting accuracy before enforcing strict limits.

Core Concept

Envoy AI Gateway provides token-based rate limiting for LLM traffic where request counting fails (100 requests can cost $0.03 or $240), along with unified API access across providers and automatic fallback when one provider is rate-limited.

Key Mental Models

  • Token vs Request Economics: GPT-4o request costs vary 100x based on prompt length; request counting cannot capture this
  • LLMRequestCost Extraction: Gateway parses token counts (InputToken, OutputToken, TotalToken) from LLM responses
  • Priority-Based Fallback: Route to OpenAI (priority 1), fall back to Anthropic (priority 2) on failure
  • CEL Cost Calculation: Custom expression input_tokens * 0.25 + output_tokens * 1.0 for accurate cost tracking

Critical Patterns

  • Set cost.request.number: 0 so only tokens count, not requests
  • Use x-user-id: Distinct header selector for per-user token budgets
  • Apply stricter limits to expensive models (50K tokens for GPT-4o vs 500K for GPT-4o-mini)
  • Store provider credentials in Kubernetes Secrets referenced by AIBackend

AI Collaboration Keys

  • Configure LLMRequestCost for token extraction from LLM responses
  • Design BackendTrafficPolicy with token-based rate limits per user and model
  • Set up AIGatewayRoute with provider fallback chain

Common Mistakes

  • Using request-based limits for LLM traffic (fails to control costs)
  • Forgetting to configure LLMRequestCost before token-based limiting
  • Setting same token budget for expensive and cheap models

Connections

  • Builds on: Lesson 10 (Resilience Patterns) for general traffic protection
  • Leads to: Lesson 12 (Capstone) for complete traffic engineering integration

📋 Quick Reference

Unlock Lesson Summary

Access condensed key takeaways and quick reference notes for efficient review.

  • Key concepts at a glance
  • Perfect for revision
  • Save study time

Free forever. No credit card required.