It is 3am. Your phone buzzes. "ALERT: Task API Error Rate High." You roll out of bed, open your laptop, and spend 20 minutes diagnosing... a 30-second traffic spike that already resolved itself. Back to sleep. At 3:47am, another alert. Same story. By morning, you have gotten 90 minutes of broken sleep across 6 false alarms.
This is alert fatigue, and it destroys on-call engineers. When every alert is treated as urgent, nothing is actually urgent. Teams start ignoring alerts, and when a real incident happens, nobody responds because they have been conditioned to expect false positives.
The solution is SLO-based alerting. Instead of alerting on instantaneous metrics ("error rate exceeded 1%"), you alert on error budget consumption ("we are burning our monthly budget at 14x the sustainable rate"). This approach, documented in Google's SRE Workbook, reduces alert noise while catching real problems faster.
In this lesson, you will implement multi-window, multi-burn-rate alerting for your Task API, configure Alertmanager to route alerts appropriately, and create runbooks that make 3am incidents manageable.
Before understanding the solution, you need to understand the problem with traditional alerting.
Direct threshold alerting looks like this:
This has two failure modes:
You cannot fix this by tuning thresholds. If you set the threshold high, you miss real incidents. If you set it low, you get noise. The fundamental approach is wrong.
Burn rate alerting asks a different question: "How fast are we consuming our error budget?"
A 0.5% error rate might be fine if your SLO is 99% (5x your budget). But 0.5% is a crisis if your SLO is 99.9% — you are burning budget at 5x the sustainable rate and will exhaust it in 6 days instead of 30.
Burn rate measures how fast you consume error budget relative to the sustainable rate.
The insight: a 14.4x burn rate consumes 2% of your monthly budget in 1 hour. A 6x burn rate consumes 5% in 6 hours. These numbers translate directly into urgency levels.
Google's SRE Workbook recommends using two time windows for each burn rate threshold:
Both conditions must be true to fire an alert. This eliminates false positives from brief spikes while still catching real incidents quickly.
Apply the PrometheusRule:
Output:
Verify Prometheus loaded the rules:
Navigate to http://localhost:9090/alerts. You should see TaskAPIHighErrorBudgetBurn and TaskAPIElevatedErrorRate listed (inactive if your error rate is healthy).
Alertmanager receives alerts from Prometheus and routes them to notification channels. The routing tree determines which alerts go where based on labels.
Key configuration elements:
Apply the configuration:
Output:
Alertmanager reloads configuration automatically within 30 seconds.
Not every alert should wake someone up at 3am. Define clear severity levels with concrete response expectations.
The actionability test: Before creating any alert, ask:
If any answer is "no," the alert should not page. Recategorize as warning or info, or remove entirely.
Common anti-patterns to avoid:
When an alert fires, the on-call engineer needs to diagnose and mitigate quickly. Runbooks provide step-by-step guidance.
Runbook template for TaskAPIHighErrorBudgetBurn:
Link runbooks to alerts via the runbook_url annotation:
When an engineer receives a page, they click the runbook link and have immediate context.
After every significant incident, conduct a blameless post-incident review. The goal is learning, not punishment.
Post-incident review template:
SLO impact calculation:
After an incident, calculate the error budget impact:
If this value exceeds your remaining monthly budget, you are now in "budget exhausted" mode. According to SRE principles, this means:
Now that you understand alerting patterns, test your observability skill:
Ask your skill to generate multi-burn-rate alerting rules:
Verify your skill produces rules similar to what you learned. Check whether it correctly calculates the threshold (14.4 * 0.001 = 0.0144) and includes both time windows with and operator.
Share your alerting configuration with AI:
What you're learning: SLO alerting strategy. AI can suggest whether your burn rate thresholds match your business needs and propose additional alert types.
Ask AI to help create a runbook:
What you're learning: Runbook structure and diagnostic methodology. AI suggests commands you might not know, while you validate they work in your specific environment.
Work through a routing scenario:
What you're learning: Alertmanager route hierarchy. AI explains the evaluation order (most specific first) and helps you avoid common mistakes like overly broad matchers.
Safety note: When configuring alerting for production systems, always test alert routing in a staging environment first. Misconfigured routing can result in pages going to the wrong team or notification floods that violate rate limits on PagerDuty/Slack.