Who is Muhammad Usman Akbar?

Muhammad Usman Akbar is a world-class AI Transformation Consultant and Agentic Architect focused on achieving 30x industrial efficiency through autonomous ecosystems.

What results can an AI Transformation Consultant provide?

By replacing manual work with autonomous AI workflows, a consultant like Muhammad Usman can deliver up to 30x growth in output while reducing operational overhead by 40%.

What is Agentic AI Orchestration?

It is the engineering of multi-agent systems where autonomous AI entities collaborate to manage complex industrial operations in production environments.

SRE Foundations: SLIs, SLOs, and Error Budgets

Every engineering team faces the same fundamental tension: ship features faster or keep the system stable. Product managers want new functionality yesterday. Operations teams want zero incidents ever. Both goals seem reasonable. Both cannot be achieved simultaneously.

Google's Site Reliability Engineering (SRE) practice resolved this tension with a powerful insight: reliability is a feature you can dial. Instead of arguing about whether a deployment is "safe enough," you measure exactly how much reliability you have, how much you've used, and how much you can afford to spend on velocity.

This dial has three components that work together:

Component	What It Is	Example
SLI (Service Level Indicator)	What you measure	99.95% of requests succeeded
SLO (Service Level Objective)	Your target	We aim for 99.9% success
SLA (Service Level Agreement)	Your promise to customers	If we drop below 99.5%, you get credits

The gap between your SLO and 100% is your error budget — the amount of unreliability you can afford while still meeting your target. This budget becomes currency: you spend it on deployments, experiments, and infrastructure changes. When it runs low, you slow down. When it's healthy, you move fast.

By the end of this lesson, you'll define SLIs for Task API, set realistic SLO targets, calculate error budgets, and implement recording rules in Prometheus to track it all.

SLI vs SLO vs SLA: The Reliability Stack

These three terms are often confused, but they represent distinct layers of reliability thinking:

Service Level Indicators (SLIs): What You Measure

An SLI is a quantitative measure of some aspect of the service level you provide. It's raw measurement, not judgment. Common SLIs include:

SLI Type	What It Measures	Calculation
Availability	Did the service respond successfully?	successful_requests / total_requests
Latency	How fast did it respond?	requests_under_threshold / total_requests
Correctness	Was the response right?	correct_responses / total_responses

For Task API, your primary SLI is availability:

promql

# Availability SLI: ratio of successful requests
sum(rate(http_requests_total{service="task-api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="task-api"}[5m]))

Output:

Specification

0.9997  # 99.97% of requests succeeded in last 5 minutes

Service Level Objectives (SLOs): Your Target

An SLO is the target value for an SLI. It's your internal quality bar — the line between "we're healthy" and "we need to investigate." SLOs are strategic decisions, not technical ones.

Statement	SLI or SLO?
"99.97% of requests succeeded last hour"	SLI (measurement)
"We aim for 99.9% success rate"	SLO (target)
"P95 latency was 180ms"	SLI (measurement)
"P95 latency should be under 200ms"	SLO (target)

Key insight: Your SLO should be slightly below what you typically achieve. If you measure 99.97% regularly, setting a 99.9% SLO gives you headroom. Setting 99.99% means you're always failing.

Service Level Agreements (SLAs): Your Promise

An SLA is a contractual commitment with consequences — typically financial penalties or service credits when you fail to meet it. SLAs are business decisions owned by legal and product, not engineering.

The critical relationship: SLA < SLO < typical performance

Layer	Target	Buffer
Typical performance	99.97%	—
SLO (internal target)	99.9%	0.07% buffer from typical
SLA (customer promise)	99.5%	0.4% buffer from SLO

Why the buffer? When you breach your SLO, engineering investigates and fixes. When you breach your SLA, customers get money back. You want early warning before financial consequences.

Choosing Good SLIs: The Four Golden Signals

Not all metrics make good SLIs. A good SLI must be:

User-centric: Reflects what users experience, not internal implementation
Measurable: Can be computed from available telemetry
Sensitive: Changes when user experience degrades
Independent: Doesn't conflate unrelated failure modes

Google's SRE book recommends focusing on the Four Golden Signals:

Signal	Question It Answers	Good SLI?
Latency	How fast are we?	Yes — users feel slow responses
Traffic	How much demand?	No — high traffic isn't a problem
Errors	How many failures?	Yes — users experience failures
Saturation	How full are we?	Leading indicator, not SLI directly

Choosing SLIs for Different Service Types

Service Type	Primary SLI	Secondary SLI	Why
User-facing API	Availability	Latency (P99)	Users abandon slow pages
Data pipeline	Correctness	Freshness	Wrong data is worse than slow data
Background job	Completion rate	Duration	Jobs must finish, speed matters less
Real-time stream	Latency (P50)	Availability	Every message must be fast

For Task API (user-facing REST API), we focus on:

Availability: 99.9% of requests return 2xx or 4xx (not 5xx)
Latency: 95% of requests complete under 200ms

Setting Realistic SLO Targets: The Downtime Table

The difference between 99.9% and 99.99% seems small — just 0.09%. But in operational terms, it's massive:

SLO	Error Budget	Monthly Downtime	Yearly Downtime
99%	1%	7.2 hours	3.65 days
99.5%	0.5%	3.6 hours	1.83 days
99.9%	0.1%	43.2 minutes	8.76 hours
99.95%	0.05%	21.6 minutes	4.38 hours
99.99%	0.01%	4.32 minutes	52.6 minutes
99.999%	0.001%	26 seconds	5.26 minutes

The "nine" you can afford depends on your dependencies:

If your database has 99.9% availability and your cache has 99.9% availability, your maximum possible availability is approximately 99.8% (0.999 * 0.999 = 0.998). You cannot promise more reliability than your least reliable dependency.

Choosing the Right Target

SLO	Appropriate For	Operational Reality
99%	Internal tools, batch jobs	One 7-hour outage per month acceptable
99.5%	Internal services	One 3.5-hour outage per month acceptable
99.9%	User-facing APIs	43 minutes/month — about one incident
99.99%	Payment systems, auth	4 minutes/month — near-zero tolerance
99.999%	Life-critical systems	Requires massive redundancy investment

For Task API, we choose 99.9% availability. This gives us 43.2 minutes of error budget per month — enough to deploy safely but requiring investigation when things go wrong.

Error Budgets: The Currency of Velocity

The error budget transforms reliability from a vague aspiration into a concrete resource:

Specification

Error Budget = 100% - SLO

For a 99.9% SLO:

Error budget = 0.1%
Monthly budget = 30 days * 24 hours * 60 minutes * 0.001 = 43.2 minutes

What Consumes Error Budget?

Every reliability failure consumes budget:

Event	Duration	Budget Consumed
Deployment causes 1% error rate for 5 minutes	5 min * 1% = 3 seconds	0.12%
Complete outage for 10 minutes	10 min * 100% = 10 minutes	23.1%
50% error rate for 20 minutes	20 min * 50% = 10 minutes	23.1%

Error Budget Policies

The power of error budgets comes from policy — agreed-upon actions based on budget state:

Budget Remaining	Policy
> 50%	Ship features normally; experimental deployments allowed
25–50%	Increased review for risky changes; prioritize reliability work
10–25%	Feature freeze except critical fixes; focus on stability
< 10%	Emergency mode: no deployments except to improve reliability
Exhausted	Complete freeze until budget resets or SLO is renegotiated

This makes reliability objective. Instead of arguing "Is this deployment safe?", you ask "Do we have budget to absorb if it fails?"

Implementing SLO Recording Rules in Prometheus

Recording rules pre-compute expensive queries so dashboards load instantly. Here's a complete SLO implementation for Task API:

yaml

# task-api-slo-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: task-api-slo
  namespace: monitoring
  labels:
    release: prometheus  # Required for kube-prometheus-stack
spec:
  groups:
  - name: task-api.slo.recording
    interval: 30s
    rules:
    # Total request rate
    - record: task_api:requests:rate5m
      expr: sum(rate(http_requests_total{service="task-api"}[5m]))
    # Error rate (5xx responses)
    - record: task_api:errors:rate5m
      expr: sum(rate(http_requests_total{service="task-api",status=~"5.."}[5m]))
    # Error ratio (SLI)
    - record: task_api:error_ratio:rate5m
      expr: |
        task_api:errors:rate5m
        /
        task_api:requests:rate5m
    # Availability (inverse of error ratio)
    - record: task_api:availability:rate5m
      expr: 1 - task_api:error_ratio:rate5m
    # Error budget consumption (30-day window)
    - record: task_api:error_budget_consumed:ratio30d
      expr: |
        (
          sum(rate(http_requests_total{service="task-api",status=~"5.."}[30d]))
          /
          sum(rate(http_requests_total{service="task-api"}[30d]))
        )
        / 0.001  # SLO error budget = 0.1%
    # Error budget remaining
    - record: task_api:error_budget_remaining:ratio30d
      expr: 1 - task_api:error_budget_consumed:ratio30d

Apply the rules:

bash

kubectl apply -f task-api-slo-rules.yaml

Output:

Specification

prometheusrule.monitoring.coreos.com/task-api-slo created

Verify the recording rules are working:

bash

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

Then query in Prometheus UI (http://localhost:9090):

promql

task_api:availability:rate5m

Output:

Specification

task_api:availability:rate5m{} 0.9997

Building SLO Dashboards in Grafana

A good SLO dashboard answers three questions at a glance:

Are we meeting our SLO right now?
How much error budget have we consumed?
At current burn rate, when will we exhaust budget?

Create the Dashboard

Create a ConfigMap with the Grafana dashboard JSON:

yaml

# task-api-slo-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: task-api-slo-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"  # Auto-discovery by Grafana sidecar
data:
  task-api-slo.json: |
    {
      "title": "Task API SLO Dashboard",
      "uid": "task-api-slo",
      "panels": [
        {
          "title": "Current Availability (SLO: 99.9%)",
          "type": "stat",
          "gridPos": {"x": 0, "y": 0, "w": 8, "h": 4},
          "targets": [{"expr": "task_api:availability:rate5m * 100", "legendFormat": "Availability"}],
          "fieldConfig": {
            "defaults": {
              "unit": "percent",
              "thresholds": {
                "steps": [
                  {"color": "red", "value": 99.5},
                  {"color": "yellow", "value": 99.9},
                  {"color": "green", "value": 99.95}
                ]
              }
            }
          }
        },
        {
          "title": "Error Budget Remaining",
          "type": "gauge",
          "gridPos": {"x": 8, "y": 0, "w": 8, "h": 4},
          "targets": [{"expr": "task_api:error_budget_remaining:ratio30d * 100"}],
          "fieldConfig": {
            "defaults": {
              "unit": "percent",
              "min": 0,
              "max": 100,
              "thresholds": {
                "steps": [
                  {"color": "red", "value": 0},
                  {"color": "yellow", "value": 25},
                  {"color": "green", "value": 50}
                ]
              }
            }
          }
        }
      ]
    }

Apply the dashboard:

bash

kubectl apply -f task-api-slo-dashboard.yaml

Output:

Specification

configmap/task-api-slo-dashboard created

Access Grafana and find the dashboard under "Task API SLO Dashboard".

Interpreting the Dashboard

Panel	What It Shows	Action Triggers
Current Availability	Real-time SLI vs SLO	Green = meeting SLO; Red = investigate now
Error Budget Remaining	How much budget left	< 25% = slow deployments; < 10% = freeze
Burn Rate	How fast consuming budget	> 6x = investigate; > 14.4x = page immediately
30-Day Trend	Historical availability	Identifies patterns and recurring issues

Reflect on Your Skill

Your observability-cost-engineer skill now understands SRE foundations. Test and improve it:

Test Your Skill

Ask your skill to help you define SLIs for a new service:

Specification

I'm deploying a new image processing API. Users upload images and receive
processed results. What SLIs should I track, and what SLO targets would
you recommend?

Expected behavior: The skill should recommend:

Availability SLI (successful uploads / total uploads)
Latency SLI (processing time for P95)
Possibly correctness SLI (if output quality is measurable)
SLO targets based on user-facing vs internal nature

Identify Gaps

Does your skill know:

The difference between SLI, SLO, and SLA?
How to calculate error budget from SLO percentage?
When to use availability vs latency as primary SLI?
How to write Prometheus recording rules for SLOs?

Improve Your Skill

Add SRE foundations knowledge:

Specification

Add to your skill: SLI/SLO/SLA definitions, error budget calculation
(100% - SLO), the downtime table (99.9% = 43.2 min/month), and the
four-level budget policy (>50% ship normally, 25-50% increased review,
10-25% feature freeze, <10% emergency mode).

Try With AI

Prompt 1: Calculate Error Budget Impact

Specification

I have a 99.9% availability SLO for my Task API. Yesterday we had an
incident where 20% of requests failed for 15 minutes. Help me calculate:
1. How much error budget this consumed

2. What percentage of my monthly budget this represents

3. Whether I should slow down deployments based on current budget state

What you're learning: Connecting abstract SLO percentages to concrete operational decisions. Error budgets only matter when you use them to make choices about velocity vs stability.

Prompt 2: Design SLIs for Your Service

Specification

I'm building [describe your service: API, data pipeline, background job, etc.].
Help me design appropriate SLIs by asking:
- What do my users care about most?
- What failure modes would hurt them?
- What can I actually measure?
Then recommend 2-3 SLIs with specific metric calculations.

What you're learning: SLI selection is a design exercise, not a checklist. The right SLIs depend on what your users actually experience.

Prompt 3: Create Recording Rules

Specification

Given this metric: http_requests_total with labels {service, status, method, path}
Help me create Prometheus recording rules for:
1. Request rate by service

2. Error ratio (5xx / total)

3. 30-day availability percentage

4. Error budget remaining
Use the task_api prefix and explain why pre-computing these helps.

What you're learning: Recording rules are performance optimization with semantic value. They give names to important calculations and make dashboards instantaneous.

Safety Note

When implementing SLOs in production, start with a monitoring-only phase. Track your SLIs and proposed SLOs for 2-4 weeks before making policy decisions based on them. This prevents setting unrealistic targets that immediately trigger false emergencies.