USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookStop Arguing About Reliability. Start Measuring It.
Previous Chapter
Centralized Logging with Loki
Next Chapter
Alerting and Incident Response
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

28 sections

Progress0%
1 / 28

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

SRE Foundations: SLIs, SLOs, and Error Budgets

Every engineering team faces the same fundamental tension: ship features faster or keep the system stable. Product managers want new functionality yesterday. Operations teams want zero incidents ever. Both goals seem reasonable. Both cannot be achieved simultaneously.

Google's Site Reliability Engineering (SRE) practice resolved this tension with a powerful insight: reliability is a feature you can dial. Instead of arguing about whether a deployment is "safe enough," you measure exactly how much reliability you have, how much you've used, and how much you can afford to spend on velocity.

This dial has three components that work together:

ComponentWhat It IsExample
SLI (Service Level Indicator)What you measure99.95% of requests succeeded
SLO (Service Level Objective)Your targetWe aim for 99.9% success
SLA (Service Level Agreement)Your promise to customersIf we drop below 99.5%, you get credits

The gap between your SLO and 100% is your error budget — the amount of unreliability you can afford while still meeting your target. This budget becomes currency: you spend it on deployments, experiments, and infrastructure changes. When it runs low, you slow down. When it's healthy, you move fast.

By the end of this lesson, you'll define SLIs for Task API, set realistic SLO targets, calculate error budgets, and implement recording rules in Prometheus to track it all.

SLI vs SLO vs SLA: The Reliability Stack

These three terms are often confused, but they represent distinct layers of reliability thinking:

Service Level Indicators (SLIs): What You Measure

An SLI is a quantitative measure of some aspect of the service level you provide. It's raw measurement, not judgment. Common SLIs include:

SLI TypeWhat It MeasuresCalculation
AvailabilityDid the service respond successfully?successful_requests / total_requests
LatencyHow fast did it respond?requests_under_threshold / total_requests
CorrectnessWas the response right?correct_responses / total_responses

For Task API, your primary SLI is availability:

promql
# Availability SLI: ratio of successful requests sum(rate(http_requests_total{service="task-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="task-api"}[5m]))

Output:

Specification
0.9997 # 99.97% of requests succeeded in last 5 minutes

Service Level Objectives (SLOs): Your Target

An SLO is the target value for an SLI. It's your internal quality bar — the line between "we're healthy" and "we need to investigate." SLOs are strategic decisions, not technical ones.

StatementSLI or SLO?
"99.97% of requests succeeded last hour"SLI (measurement)
"We aim for 99.9% success rate"SLO (target)
"P95 latency was 180ms"SLI (measurement)
"P95 latency should be under 200ms"SLO (target)

Key insight: Your SLO should be slightly below what you typically achieve. If you measure 99.97% regularly, setting a 99.9% SLO gives you headroom. Setting 99.99% means you're always failing.

Service Level Agreements (SLAs): Your Promise

An SLA is a contractual commitment with consequences — typically financial penalties or service credits when you fail to meet it. SLAs are business decisions owned by legal and product, not engineering.

The critical relationship: SLA < SLO < typical performance

LayerTargetBuffer
Typical performance99.97%—
SLO (internal target)99.9%0.07% buffer from typical
SLA (customer promise)99.5%0.4% buffer from SLO

Why the buffer? When you breach your SLO, engineering investigates and fixes. When you breach your SLA, customers get money back. You want early warning before financial consequences.

Choosing Good SLIs: The Four Golden Signals

Not all metrics make good SLIs. A good SLI must be:

  • User-centric: Reflects what users experience, not internal implementation
  • Measurable: Can be computed from available telemetry
  • Sensitive: Changes when user experience degrades
  • Independent: Doesn't conflate unrelated failure modes

Google's SRE book recommends focusing on the Four Golden Signals:

SignalQuestion It AnswersGood SLI?
LatencyHow fast are we?Yes — users feel slow responses
TrafficHow much demand?No — high traffic isn't a problem
ErrorsHow many failures?Yes — users experience failures
SaturationHow full are we?Leading indicator, not SLI directly

Choosing SLIs for Different Service Types

Service TypePrimary SLISecondary SLIWhy
User-facing APIAvailabilityLatency (P99)Users abandon slow pages
Data pipelineCorrectnessFreshnessWrong data is worse than slow data
Background jobCompletion rateDurationJobs must finish, speed matters less
Real-time streamLatency (P50)AvailabilityEvery message must be fast

For Task API (user-facing REST API), we focus on:

  1. Availability: 99.9% of requests return 2xx or 4xx (not 5xx)
  2. Latency: 95% of requests complete under 200ms

Setting Realistic SLO Targets: The Downtime Table

The difference between 99.9% and 99.99% seems small — just 0.09%. But in operational terms, it's massive:

SLOError BudgetMonthly DowntimeYearly Downtime
99%1%7.2 hours3.65 days
99.5%0.5%3.6 hours1.83 days
99.9%0.1%43.2 minutes8.76 hours
99.95%0.05%21.6 minutes4.38 hours
99.99%0.01%4.32 minutes52.6 minutes
99.999%0.001%26 seconds5.26 minutes

The "nine" you can afford depends on your dependencies:

If your database has 99.9% availability and your cache has 99.9% availability, your maximum possible availability is approximately 99.8% (0.999 * 0.999 = 0.998). You cannot promise more reliability than your least reliable dependency.

Choosing the Right Target

SLOAppropriate ForOperational Reality
99%Internal tools, batch jobsOne 7-hour outage per month acceptable
99.5%Internal servicesOne 3.5-hour outage per month acceptable
99.9%User-facing APIs43 minutes/month — about one incident
99.99%Payment systems, auth4 minutes/month — near-zero tolerance
99.999%Life-critical systemsRequires massive redundancy investment

For Task API, we choose 99.9% availability. This gives us 43.2 minutes of error budget per month — enough to deploy safely but requiring investigation when things go wrong.

Error Budgets: The Currency of Velocity

The error budget transforms reliability from a vague aspiration into a concrete resource:

Specification
Error Budget = 100% - SLO

For a 99.9% SLO:

  • Error budget = 0.1%
  • Monthly budget = 30 days * 24 hours * 60 minutes * 0.001 = 43.2 minutes

What Consumes Error Budget?

Every reliability failure consumes budget:

EventDurationBudget Consumed
Deployment causes 1% error rate for 5 minutes5 min * 1% = 3 seconds0.12%
Complete outage for 10 minutes10 min * 100% = 10 minutes23.1%
50% error rate for 20 minutes20 min * 50% = 10 minutes23.1%

Error Budget Policies

The power of error budgets comes from policy — agreed-upon actions based on budget state:

Budget RemainingPolicy
> 50%Ship features normally; experimental deployments allowed
25–50%Increased review for risky changes; prioritize reliability work
10–25%Feature freeze except critical fixes; focus on stability
< 10%Emergency mode: no deployments except to improve reliability
ExhaustedComplete freeze until budget resets or SLO is renegotiated

This makes reliability objective. Instead of arguing "Is this deployment safe?", you ask "Do we have budget to absorb if it fails?"

Implementing SLO Recording Rules in Prometheus

Recording rules pre-compute expensive queries so dashboards load instantly. Here's a complete SLO implementation for Task API:

yaml
# task-api-slo-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: task-api-slo namespace: monitoring labels: release: prometheus # Required for kube-prometheus-stack spec: groups: - name: task-api.slo.recording interval: 30s rules: # Total request rate - record: task_api:requests:rate5m expr: sum(rate(http_requests_total{service="task-api"}[5m])) # Error rate (5xx responses) - record: task_api:errors:rate5m expr: sum(rate(http_requests_total{service="task-api",status=~"5.."}[5m])) # Error ratio (SLI) - record: task_api:error_ratio:rate5m expr: | task_api:errors:rate5m / task_api:requests:rate5m # Availability (inverse of error ratio) - record: task_api:availability:rate5m expr: 1 - task_api:error_ratio:rate5m # Error budget consumption (30-day window) - record: task_api:error_budget_consumed:ratio30d expr: | ( sum(rate(http_requests_total{service="task-api",status=~"5.."}[30d])) / sum(rate(http_requests_total{service="task-api"}[30d])) ) / 0.001 # SLO error budget = 0.1% # Error budget remaining - record: task_api:error_budget_remaining:ratio30d expr: 1 - task_api:error_budget_consumed:ratio30d

Apply the rules:

bash
kubectl apply -f task-api-slo-rules.yaml

Output:

Specification
prometheusrule.monitoring.coreos.com/task-api-slo created

Verify the recording rules are working:

bash
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

Then query in Prometheus UI (http://localhost:9090):

promql
task_api:availability:rate5m

Output:

Specification
task_api:availability:rate5m{} 0.9997

Building SLO Dashboards in Grafana

A good SLO dashboard answers three questions at a glance:

  1. Are we meeting our SLO right now?
  2. How much error budget have we consumed?
  3. At current burn rate, when will we exhaust budget?

Create the Dashboard

Create a ConfigMap with the Grafana dashboard JSON:

yaml
# task-api-slo-dashboard.yaml apiVersion: v1 kind: ConfigMap metadata: name: task-api-slo-dashboard namespace: monitoring labels: grafana_dashboard: "1" # Auto-discovery by Grafana sidecar data: task-api-slo.json: | { "title": "Task API SLO Dashboard", "uid": "task-api-slo", "panels": [ { "title": "Current Availability (SLO: 99.9%)", "type": "stat", "gridPos": {"x": 0, "y": 0, "w": 8, "h": 4}, "targets": [{"expr": "task_api:availability:rate5m * 100", "legendFormat": "Availability"}], "fieldConfig": { "defaults": { "unit": "percent", "thresholds": { "steps": [ {"color": "red", "value": 99.5}, {"color": "yellow", "value": 99.9}, {"color": "green", "value": 99.95} ] } } } }, { "title": "Error Budget Remaining", "type": "gauge", "gridPos": {"x": 8, "y": 0, "w": 8, "h": 4}, "targets": [{"expr": "task_api:error_budget_remaining:ratio30d * 100"}], "fieldConfig": { "defaults": { "unit": "percent", "min": 0, "max": 100, "thresholds": { "steps": [ {"color": "red", "value": 0}, {"color": "yellow", "value": 25}, {"color": "green", "value": 50} ] } } } } ] }

Apply the dashboard:

bash
kubectl apply -f task-api-slo-dashboard.yaml

Output:

Specification
configmap/task-api-slo-dashboard created

Access Grafana and find the dashboard under "Task API SLO Dashboard".

Interpreting the Dashboard

PanelWhat It ShowsAction Triggers
Current AvailabilityReal-time SLI vs SLOGreen = meeting SLO; Red = investigate now
Error Budget RemainingHow much budget left< 25% = slow deployments; < 10% = freeze
Burn RateHow fast consuming budget> 6x = investigate; > 14.4x = page immediately
30-Day TrendHistorical availabilityIdentifies patterns and recurring issues

Reflect on Your Skill

Your observability-cost-engineer skill now understands SRE foundations. Test and improve it:

Test Your Skill

Ask your skill to help you define SLIs for a new service:

Specification
I'm deploying a new image processing API. Users upload images and receive processed results. What SLIs should I track, and what SLO targets would you recommend?

Expected behavior: The skill should recommend:

  • Availability SLI (successful uploads / total uploads)
  • Latency SLI (processing time for P95)
  • Possibly correctness SLI (if output quality is measurable)
  • SLO targets based on user-facing vs internal nature

Identify Gaps

Does your skill know:

  • The difference between SLI, SLO, and SLA?
  • How to calculate error budget from SLO percentage?
  • When to use availability vs latency as primary SLI?
  • How to write Prometheus recording rules for SLOs?

Improve Your Skill

Add SRE foundations knowledge:

Specification
Add to your skill: SLI/SLO/SLA definitions, error budget calculation (100% - SLO), the downtime table (99.9% = 43.2 min/month), and the four-level budget policy (>50% ship normally, 25-50% increased review, 10-25% feature freeze, <10% emergency mode).

Try With AI

Prompt 1: Calculate Error Budget Impact

Specification
I have a 99.9% availability SLO for my Task API. Yesterday we had an incident where 20% of requests failed for 15 minutes. Help me calculate: 1. How much error budget this consumed 2. What percentage of my monthly budget this represents 3. Whether I should slow down deployments based on current budget state

What you're learning: Connecting abstract SLO percentages to concrete operational decisions. Error budgets only matter when you use them to make choices about velocity vs stability.

Prompt 2: Design SLIs for Your Service

Specification
I'm building [describe your service: API, data pipeline, background job, etc.]. Help me design appropriate SLIs by asking: - What do my users care about most? - What failure modes would hurt them? - What can I actually measure? Then recommend 2-3 SLIs with specific metric calculations.

What you're learning: SLI selection is a design exercise, not a checklist. The right SLIs depend on what your users actually experience.

Prompt 3: Create Recording Rules

Specification
Given this metric: http_requests_total with labels {service, status, method, path} Help me create Prometheus recording rules for: 1. Request rate by service 2. Error ratio (5xx / total) 3. 30-day availability percentage 4. Error budget remaining Use the task_api prefix and explain why pre-computing these helps.

What you're learning: Recording rules are performance optimization with semantic value. They give names to important calculations and make dashboards instantaneous.

Safety Note

When implementing SLOs in production, start with a monitoring-only phase. Track your SLIs and proposed SLOs for 2-4 weeks before making policy decisions based on them. This prevents setting unrealistic targets that immediately trigger false emergencies.