Every engineering team faces the same fundamental tension: ship features faster or keep the system stable. Product managers want new functionality yesterday. Operations teams want zero incidents ever. Both goals seem reasonable. Both cannot be achieved simultaneously.
Google's Site Reliability Engineering (SRE) practice resolved this tension with a powerful insight: reliability is a feature you can dial. Instead of arguing about whether a deployment is "safe enough," you measure exactly how much reliability you have, how much you've used, and how much you can afford to spend on velocity.
This dial has three components that work together:
The gap between your SLO and 100% is your error budget — the amount of unreliability you can afford while still meeting your target. This budget becomes currency: you spend it on deployments, experiments, and infrastructure changes. When it runs low, you slow down. When it's healthy, you move fast.
By the end of this lesson, you'll define SLIs for Task API, set realistic SLO targets, calculate error budgets, and implement recording rules in Prometheus to track it all.
These three terms are often confused, but they represent distinct layers of reliability thinking:
An SLI is a quantitative measure of some aspect of the service level you provide. It's raw measurement, not judgment. Common SLIs include:
For Task API, your primary SLI is availability:
Output:
An SLO is the target value for an SLI. It's your internal quality bar — the line between "we're healthy" and "we need to investigate." SLOs are strategic decisions, not technical ones.
Key insight: Your SLO should be slightly below what you typically achieve. If you measure 99.97% regularly, setting a 99.9% SLO gives you headroom. Setting 99.99% means you're always failing.
An SLA is a contractual commitment with consequences — typically financial penalties or service credits when you fail to meet it. SLAs are business decisions owned by legal and product, not engineering.
The critical relationship: SLA < SLO < typical performance
Why the buffer? When you breach your SLO, engineering investigates and fixes. When you breach your SLA, customers get money back. You want early warning before financial consequences.
Not all metrics make good SLIs. A good SLI must be:
Google's SRE book recommends focusing on the Four Golden Signals:
For Task API (user-facing REST API), we focus on:
The difference between 99.9% and 99.99% seems small — just 0.09%. But in operational terms, it's massive:
The "nine" you can afford depends on your dependencies:
If your database has 99.9% availability and your cache has 99.9% availability, your maximum possible availability is approximately 99.8% (0.999 * 0.999 = 0.998). You cannot promise more reliability than your least reliable dependency.
For Task API, we choose 99.9% availability. This gives us 43.2 minutes of error budget per month — enough to deploy safely but requiring investigation when things go wrong.
The error budget transforms reliability from a vague aspiration into a concrete resource:
For a 99.9% SLO:
Every reliability failure consumes budget:
The power of error budgets comes from policy — agreed-upon actions based on budget state:
This makes reliability objective. Instead of arguing "Is this deployment safe?", you ask "Do we have budget to absorb if it fails?"
Recording rules pre-compute expensive queries so dashboards load instantly. Here's a complete SLO implementation for Task API:
Apply the rules:
Output:
Verify the recording rules are working:
Then query in Prometheus UI (http://localhost:9090):
Output:
A good SLO dashboard answers three questions at a glance:
Create a ConfigMap with the Grafana dashboard JSON:
Apply the dashboard:
Output:
Access Grafana and find the dashboard under "Task API SLO Dashboard".
Your observability-cost-engineer skill now understands SRE foundations. Test and improve it:
Ask your skill to help you define SLIs for a new service:
Expected behavior: The skill should recommend:
Does your skill know:
Add SRE foundations knowledge:
What you're learning: Connecting abstract SLO percentages to concrete operational decisions. Error budgets only matter when you use them to make choices about velocity vs stability.
What you're learning: SLI selection is a design exercise, not a checklist. The right SLIs depend on what your users actually experience.
What you're learning: Recording rules are performance optimization with semantic value. They give names to important calculations and make dashboards instantaneous.
When implementing SLOs in production, start with a monitoring-only phase. Track your SLIs and proposed SLOs for 2-4 weeks before making policy decisions based on them. This prevents setting unrealistic targets that immediately trigger false emergencies.