USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookYour On-Call Engineer Has Learned to Ignore Your Alerts
Previous Chapter
SRE Foundations SLIs SLOs and Error Budgets
Next Chapter
Cost Engineering and FinOps
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

31 sections

Progress0%
1 / 31

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Alerting and Incident Response

It is 3am. Your phone buzzes. "ALERT: Task API Error Rate High." You roll out of bed, open your laptop, and spend 20 minutes diagnosing... a 30-second traffic spike that already resolved itself. Back to sleep. At 3:47am, another alert. Same story. By morning, you have gotten 90 minutes of broken sleep across 6 false alarms.

This is alert fatigue, and it destroys on-call engineers. When every alert is treated as urgent, nothing is actually urgent. Teams start ignoring alerts, and when a real incident happens, nobody responds because they have been conditioned to expect false positives.

The solution is SLO-based alerting. Instead of alerting on instantaneous metrics ("error rate exceeded 1%"), you alert on error budget consumption ("we are burning our monthly budget at 14x the sustainable rate"). This approach, documented in Google's SRE Workbook, reduces alert noise while catching real problems faster.

In this lesson, you will implement multi-window, multi-burn-rate alerting for your Task API, configure Alertmanager to route alerts appropriately, and create runbooks that make 3am incidents manageable.

Why Threshold Alerting Fails

Before understanding the solution, you need to understand the problem with traditional alerting.

Direct threshold alerting looks like this:

promql
# Alert when error rate exceeds 0.1% ALERT TaskAPIHighErrors IF task_api:error_ratio:rate5m > 0.001 FOR 1m

This has two failure modes:

ProblemWhat HappensImpact
Too sensitiveBrief spikes (30 seconds) trigger alertsFalse positives, alert fatigue, ignored alerts
Too slowWaiting 5 minutes misses fast incidentsReal problems go unnoticed

You cannot fix this by tuning thresholds. If you set the threshold high, you miss real incidents. If you set it low, you get noise. The fundamental approach is wrong.

Burn rate alerting asks a different question: "How fast are we consuming our error budget?"

ApproachQuestion AskedSLO Awareness
Threshold"Is error rate above X right now?"None
Burn rate"At this rate, when do we exhaust our budget?"SLO-aware, budget-aware

A 0.5% error rate might be fine if your SLO is 99% (5x your budget). But 0.5% is a crisis if your SLO is 99.9% — you are burning budget at 5x the sustainable rate and will exhaust it in 6 days instead of 30.

Burn Rate Formula

Burn rate measures how fast you consume error budget relative to the sustainable rate.

Specification
Burn Rate = Error Rate / Error Budget For a 99.9% SLO: Error Budget = 100% - 99.9% = 0.1% Sustainable Burn Rate = 1x (budget lasts exactly 30 days) If error rate = 0.1%: Burn Rate = 0.1% / 0.1% = 1x (normal) If error rate = 1.44%: Burn Rate = 1.44% / 0.1% = 14.4x (critical) If error rate = 0.6%: Burn Rate = 0.6% / 0.1% = 6x (elevated)

The insight: a 14.4x burn rate consumes 2% of your monthly budget in 1 hour. A 6x burn rate consumes 5% in 6 hours. These numbers translate directly into urgency levels.

Burn RateBudget ConsumedTime WindowResponse
14.4x2%1 hourPage immediately (critical)
6x5%6 hoursCreate ticket (warning)
1x100%30 daysNormal operation

Multi-Window, Multi-Burn-Rate Alerting

Google's SRE Workbook recommends using two time windows for each burn rate threshold:

  • Short window (5 minutes): Detects the problem quickly
  • Long window (1 hour): Confirms it is sustained, not a spike

Both conditions must be true to fire an alert. This eliminates false positives from brief spikes while still catching real incidents quickly.

yaml
# task-api-slo-alerts.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: task-api-slo-alerts namespace: monitoring labels: release: prometheus # Must match your Helm release label spec: groups: - name: task-api.slo.recording interval: 30s rules: # Recording rule: compute error ratio for clean queries - record: task_api:error_ratio:rate5m expr: | sum(rate(task_api_requests_total{status=~"5.."}[5m])) / sum(rate(task_api_requests_total[5m])) - record: task_api:error_ratio:rate1h expr: | sum(rate(task_api_requests_total{status=~"5.."}[1h])) / sum(rate(task_api_requests_total[1h])) - record: task_api:error_ratio:rate6h expr: | sum(rate(task_api_requests_total{status=~"5.."}[6h])) / sum(rate(task_api_requests_total[6h])) - name: task-api.slo.alerts rules: # CRITICAL: 14.4x burn rate (2% budget in 1 hour) - page immediately - alert: TaskAPIHighErrorBudgetBurn expr: | ( task_api:error_ratio:rate5m > (14.4 * 0.001) and task_api:error_ratio:rate1h > (14.4 * 0.001) ) for: 2m labels: severity: critical slo: availability annotations: summary: "Task API burning error budget rapidly" description: "Error rate {{ $value | humanizePercentage }} exceeds 14.4x burn rate." runbook_url: "https://runbooks.example.com/task-api-high-error-rate" dashboard_url: "https://grafana.example.com/d/task-api-slo" # WARNING: 6x burn rate (5% budget in 6 hours) - create ticket - alert: TaskAPIElevatedErrorRate expr: | ( task_api:error_ratio:rate30m > (6 * 0.001) and task_api:error_ratio:rate6h > (6 * 0.001) ) for: 5m labels: severity: warning slo: availability annotations: summary: "Task API error rate elevated" description: "Error rate {{ $value | humanizePercentage }} exceeds 6x burn rate." runbook_url: "https://runbooks.example.com/task-api-elevated-errors" # LATENCY SLO: P95 above 200ms target - alert: TaskAPIHighLatency expr: | histogram_quantile(0.95, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le) ) > 0.2 for: 5m labels: severity: warning slo: latency annotations: summary: "Task API P95 latency above SLO" description: "P95 latency is {{ $value | humanizeDuration }}, exceeds 200ms target"

Apply the PrometheusRule:

bash
kubectl apply -f task-api-slo-alerts.yaml

Output:

Specification
prometheusrule.monitoring.coreos.com/task-api-slo-alerts created

Verify Prometheus loaded the rules:

bash
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

Navigate to http://localhost:9090/alerts. You should see TaskAPIHighErrorBudgetBurn and TaskAPIElevatedErrorRate listed (inactive if your error rate is healthy).

Alertmanager Configuration

Alertmanager receives alerts from Prometheus and routes them to notification channels. The routing tree determines which alerts go where based on labels.

yaml
# alertmanager-config.yaml apiVersion: v1 kind: Secret metadata: name: alertmanager-prometheus-kube-prometheus-alertmanager namespace: monitoring stringData: alertmanager.yaml: | global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL' route: receiver: 'default-receiver' group_by: ['alertname', 'namespace'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: # Critical SLO alerts: page immediately via PagerDuty - match: severity: critical slo: availability receiver: 'pagerduty-critical' group_wait: 10s repeat_interval: 1h # Warning alerts: send to Slack - match: severity: warning receiver: 'slack-warnings' group_wait: 5m repeat_interval: 4h receivers: - name: 'default-receiver' email_configs: - to: 'team@example.com' send_resolved: true - name: 'pagerduty-critical' pagerduty_configs: - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' severity: critical description: '{{ .CommonAnnotations.summary }}' details: firing: '{{ .Alerts.Firing | len }}' dashboard: '{{ .CommonAnnotations.dashboard_url }}' runbook: '{{ .CommonAnnotations.runbook_url }}' - name: 'slack-warnings' slack_configs: - channel: '#alerts' title: '{{ .CommonLabels.alertname }}' text: '{{ .CommonAnnotations.description }}' actions: - type: button text: 'Runbook' url: '{{ .CommonAnnotations.runbook_url }}' - type: button text: 'Dashboard' url: '{{ .CommonAnnotations.dashboard_url }}'

Key configuration elements:

ElementPurpose
group_byCombine alerts with same labels into one notification
group_waitHow long to wait before sending first notification (collect related alerts)
group_intervalMinimum time between notifications for same group
repeat_intervalHow often to resend unresolved alerts
routesMatch alerts to receivers based on labels

Apply the configuration:

bash
kubectl apply -f alertmanager-config.yaml

Output:

Specification
secret/alertmanager-prometheus-kube-prometheus-alertmanager configured

Alertmanager reloads configuration automatically within 30 seconds.

Alert Hygiene: Severity Levels and Actionability

Not every alert should wake someone up at 3am. Define clear severity levels with concrete response expectations.

SeverityResponse TimeWho RespondsChannel
criticalImmediateOn-call engineerPagerDuty page
warningNext business dayOwning teamSlack channel
infoBest effortAnyone interestedDashboard only

The actionability test: Before creating any alert, ask:

  1. Is this actionable? Can someone do something about it right now?
  2. Is this urgent? Does it need attention immediately or can it wait?
  3. Is there a runbook? Does the responder know what to do?

If any answer is "no," the alert should not page. Recategorize as warning or info, or remove entirely.

Common anti-patterns to avoid:

Anti-PatternProblemSolution
Alerting on symptoms and causesDouble alerts for same incidentAlert on customer impact only
Page on every 5xxBrief spikes wake people upUse burn rate, not instantaneous rate
No runbookResponder wastes time diagnosingEvery page needs a runbook URL
Vague descriptions"Error rate high" tells nothingInclude current value, threshold, impact

Incident Response: Runbooks and Escalation

When an alert fires, the on-call engineer needs to diagnose and mitigate quickly. Runbooks provide step-by-step guidance.

Runbook template for TaskAPIHighErrorBudgetBurn:

markdown
# Runbook: TaskAPIHighErrorBudgetBurn ## Alert Meaning Task API is returning 5xx errors at a rate that will exhaust monthly error budget in approximately 2 days if sustained. ## Immediate Diagnostics (first 5 minutes) ### Step 1: Check current error rate kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 Query: task_api:error_ratio:rate5m Expected: Value should be above 0.0144 (14.4x burn rate for 99.9% SLO) ### Step 2: Identify error sources Query: sum(rate(task_api_requests_total{status=~"5.."}[5m])) by (endpoint, status) Look for: Which endpoints are failing? Is it one or all? ### Step 3: Check recent deployments kubectl rollout history deployment/task-api -n production Look for: Was there a deployment in the last hour? ### Step 4: Check pod health kubectl get pods -n production -l app=task-api kubectl logs -n production -l app=task-api --tail=50 Look for: CrashLoopBackOff, OOMKilled, connection errors ## Decision Tree - If single endpoint failing → Check database connections for that route - If all endpoints failing → Check shared dependencies (database, cache) - If after recent deployment → Rollback immediately - If external dependency → Check status page, implement circuit breaker ## Mitigation Actions ### Rollback deployment kubectl rollout undo deployment/task-api -n production ### Scale up (if load-related) kubectl scale deployment/task-api -n production --replicas=5 ### Enable circuit breaker (if external dependency) kubectl patch configmap task-api-config -n production \ --patch '{"data":{"CIRCUIT_BREAKER_ENABLED":"true"}}' ## Escalation Path - 15 minutes unresolved: Escalate to senior on-call - 30 minutes unresolved: Escalate to engineering lead - Customer impact confirmed: Initiate incident communication ## Post-Incident - Create post-incident review ticket - Update this runbook with new learnings - Adjust SLO if appropriate

Link runbooks to alerts via the runbook_url annotation:

yaml
annotations: runbook_url: "https://runbooks.example.com/task-api-high-error-rate"

When an engineer receives a page, they click the runbook link and have immediate context.

Post-Incident Reviews and SLO Impact

After every significant incident, conduct a blameless post-incident review. The goal is learning, not punishment.

Post-incident review template:

SectionContent
TimelineWhen did the issue start, detect, mitigate, resolve?
ImpactError budget consumed, customers affected, duration
Root CauseWhat actually broke? (technical, not human)
DetectionDid alerting work? How long until we knew?
MitigationWhat actions resolved the issue?
PreventionWhat changes prevent recurrence?
Action ItemsSpecific tasks with owners and deadlines

SLO impact calculation:

After an incident, calculate the error budget impact:

promql
# Error budget consumed during incident (1 hour window) 1 - ( sum(increase(task_api_requests_total{status!~"5.."}[1h])) / sum(increase(task_api_requests_total[1h])) ) / 0.001

If this value exceeds your remaining monthly budget, you are now in "budget exhausted" mode. According to SRE principles, this means:

  • No feature releases until budget replenishes
  • Focus engineering effort on reliability improvements
  • Communicate SLO status to stakeholders

Reflect on Your Skill

Now that you understand alerting patterns, test your observability skill:

Ask your skill to generate multi-burn-rate alerting rules:

Specification
Generate PrometheusRule YAML for multi-burn-rate alerting on my Task API. SLO: 99.9% availability Metrics: task_api_requests_total with status label I need: 1. Critical alert at 14.4x burn rate (short: 5m, long: 1h) 2. Warning alert at 6x burn rate (short: 30m, long: 6h) 3. Recording rules for clean queries 4. Annotations with summary, description, runbook_url

Verify your skill produces rules similar to what you learned. Check whether it correctly calculates the threshold (14.4 * 0.001 = 0.0144) and includes both time windows with and operator.

Try With AI

Step 1: Alert Design Review

Share your alerting configuration with AI:

Specification
Review my SLO alerting setup: SLO: 99.9% availability (43 minutes error budget per month) Current alerts: - Critical at 14.4x burn rate (5m AND 1h windows) - Warning at 6x burn rate (30m AND 6h windows) Questions: 1. Are these thresholds appropriate for a user-facing API? 2. What about latency SLOs - how should I alert on P95 latency? 3. Should I add a low-severity alert for 1x burn rate (sustained degradation)?

What you're learning: SLO alerting strategy. AI can suggest whether your burn rate thresholds match your business needs and propose additional alert types.

Step 2: Runbook Generation

Ask AI to help create a runbook:

Specification
Help me create an incident response runbook for this alert: Alert: TaskAPIHighLatency Condition: P95 latency > 200ms for 5 minutes Service: FastAPI application with PostgreSQL and Redis dependencies Generate a runbook with: 1. Immediate diagnostic commands (kubectl, promql) 2. Decision tree for common causes 3. Mitigation actions for each cause 4. Escalation path

What you're learning: Runbook structure and diagnostic methodology. AI suggests commands you might not know, while you validate they work in your specific environment.

Step 3: Alertmanager Routing

Work through a routing scenario:

Specification
I want to configure Alertmanager routing: - Critical SLO alerts → PagerDuty, page immediately - Warning SLO alerts → Slack #sre-alerts - Info alerts about capacity → Slack #capacity-planning - All alerts should also go to email as backup Help me design the routing tree with correct match conditions and receivers.

What you're learning: Alertmanager route hierarchy. AI explains the evaluation order (most specific first) and helps you avoid common mistakes like overly broad matchers.

Safety note: When configuring alerting for production systems, always test alert routing in a staging environment first. Misconfigured routing can result in pages going to the wrong team or notification floods that violate rate limits on PagerDuty/Slack.