Module 7 takes the agent you built in Module 6 and turns it into a production cloud service. You'll containerize the stack, orchestrate it on Kubernetes, automate delivery, and operate it with observability, security, and cost controls. The goal: a reliable Digital FTE that runs 24/7 for real users.
Prerequisites: Modules 4-6. You need a working agent service to deploy.
Your rate limiter allows 100 requests per minute. User A sends 100 requests, each with a simple "Hello" prompt consuming 10 tokens. User B sends 100 requests, each asking GPT-4 to "Write a comprehensive business plan with financial projections"—consuming 8,000 tokens per request. Both users stay within your request limit. But User A cost you $0.03 while User B cost you $240. Traditional rate limiting treats all requests equally. LLM traffic is not equal.
Envoy AI Gateway is purpose-built for this problem. Released as open source by Tetrate and Bloomberg in February 2025 and backed by the CNCF, it provides token-based rate limiting, provider fallback, and unified access across LLM providers. This lesson teaches you to protect your AI services from cost overruns using the currency that actually matters: tokens.
By the end, you will configure token-based rate limits that enforce daily budgets, implement provider fallback chains that route to Anthropic when OpenAI rate limits are hit, and design cost control patterns that give each user and team their own token budget.
Standard API gateways count requests. LLM services charge tokens. This mismatch creates three problems:
LLM pricing operates on tokens, not requests:
Key insight: A single GPT-4 request can cost 100x more than another. Request counting cannot capture this variance.
Envoy AI Gateway extends Envoy Gateway with AI-specific capabilities. It sits between your applications and LLM providers, providing a unified API regardless of which provider handles the request.
| Component | Purpose |
| :--- | :--- | | AIGatewayRoute | Defines routing rules to AI backends | | LLMRequestCost | Configures token extraction and cost calculation | | BackendTrafficPolicy | Applies token-based rate limits | | AIBackend | Configures provider credentials and endpoints |
Applications send requests to a single endpoint. AI Gateway translates between provider formats:
The x-ai-eg-model header specifies which model to use. AI Gateway routes to the appropriate provider and handles format translation.
AI Gateway extracts token counts from LLM responses and uses them for rate limiting. The system supports four token types:
First, configure AI Gateway to extract token usage from responses:
Apply the configuration:
Output:
AI Gateway automatically parses token counts from responses following the OpenAI schema format. For providers like AWS Bedrock, the gateway handles format translation automatically.
Different models have different pricing. Use CEL expressions for accurate cost tracking:
This calculates cost in cents where output tokens cost 4x input tokens—matching GPT-4o pricing ratios.
Unlike request-based limits, token budgets reflect actual usage. A user who sends concise prompts consumes less budget than one who sends verbose requests.
Limit each user to 100,000 tokens per hour:
Key configuration points:
Apply and verify:
Output:
Send requests until budget exhausted:
Output:
The user hit their 100,000 token budget (approximately 100 tokens × 1,000 requests).
Different models have different costs. Apply stricter limits to expensive models:
Result: Users get 50K tokens/hour for GPT-4o but 500K tokens/hour for GPT-4o-mini—reflecting the 10x price difference.
When one provider hits rate limits or experiences downtime, AI Gateway can automatically route to alternatives. This provides resilience and cost optimization.
Route primarily to OpenAI, fall back to Anthropic:
How priority works:
Define credentials and endpoints for each provider:
Store credentials securely:
Output:
Simulate OpenAI rate limiting:
Output (after fallback):
The request was served by Anthropic after OpenAI reached its limit.
Effective AI cost control requires organizational-level policies, not just per-user limits.
Allocate monthly token budgets per team:
Route expensive requests to cheaper models when budget runs low:
This requires application-level logic to check remaining budget and adjust the x-ai-eg-model header accordingly.
Convert token limits to dollar amounts:
Set token limits that match your dollar budget.
Set up token extraction for your AI Gateway:
Verify:
Expected Output:
Limit users to 10,000 tokens per hour:
Test with requests:
Expected Output:
Apply different limits for different models:
Verify:
Expected Output:
Set up fallback from OpenAI to Anthropic:
Verify:
Expected Output:
You built a traffic-engineer skill in Lesson 0. Based on what you learned about LLM traffic patterns:
Your skill should now include:
Token budget template:
Provider fallback template:
You want to configure AI Gateway for your Task API's LLM features. The API uses GPT-4o for complex reasoning and GPT-4o-mini for simple tasks. You have a $100/day budget to protect.
Ask your traffic-engineer skill:
Review AI's configuration. Check these specifics:
If the token math seems off, provide your constraint:
Now extend to include model-specific routing:
Verify the complete configuration before applying:
Compare your first request to the final configuration. The initial approach likely missed either the cost calculation details or the proper header matching. Through iteration, you specified the budget constraint, the token-to-dollar conversion, and the routing requirements—producing a configuration that actually protects your $100 daily budget rather than just counting requests.
Token-based rate limiting requires the AI Gateway to parse LLM responses. Ensure your gateway has sufficient resources to handle this processing overhead. Start with conservative limits (lower than calculated) and adjust based on observed usage. Monitor x-ai-gateway-tokens-used response headers to verify token counting accuracy before enforcing strict limits.
Envoy AI Gateway provides token-based rate limiting for LLM traffic where request counting fails (100 requests can cost $0.03 or $240), along with unified API access across providers and automatic fallback when one provider is rate-limited.
📋 Quick Reference
Access condensed key takeaways and quick reference notes for efficient review.
Free forever. No credit card required.