It's 3am. Your phone buzzes with an alert: "Task API response time exceeds threshold." You open your laptop, eyes adjusting to the screen. The AI agent that handles customer task creation is responding slowly. Users are waiting. Revenue is at risk.
You need answers. Is the database slow? Is there a network issue? Is one specific request type causing problems? Is the AI inference service overloaded? Without observability, you're guessing. You might restart services hoping something fixes itself. You might check logs randomly, scrolling through thousands of lines. You might ping the database directly, but that doesn't explain the pattern.
Observability transforms this chaos into clarity. Instead of guessing, you look at metrics: "Response time jumped from 100ms to 2 seconds at 2:47am." You drill into traces: "Requests to the inference service are waiting 1.8 seconds for responses." You check logs: "Inference service shows 'GPU memory exhausted' warnings starting at 2:45am." In minutes, you understand the problem. The inference service needs more memory. You scale it, response times recover, and you go back to sleep.
This lesson teaches the conceptual foundation of observability: the three pillars (metrics, traces, logs), the four golden signals (latency, traffic, errors, saturation), and how to choose the right signal for different debugging scenarios.
AI applications are harder to debug than traditional software. A web server returns predictable responses: the same input produces the same output. AI agents are different. They interact with language models, make decisions based on context, and chain multiple service calls together. When something goes wrong, the failure mode is often subtle: the agent returns a valid response, but it's wrong or slow.
The debugging challenge: Your Task API agent receives a request to create a task. The request flows through:
If the response takes 5 seconds, which component caused the delay? Without observability, you'd add print statements, redeploy, test, and repeat. With observability, you query existing data: "Show me the latency breakdown for requests in the last hour."
The cost of blindness: Production systems without observability operate on hope. You hope the system is healthy. You hope no users are experiencing errors. When problems occur, you spend hours or days diagnosing. With observability, you know the system state at all times. Problems are detected in minutes, often before users notice.
Observability rests on three complementary data types. Each answers different questions. You need all three for complete visibility.
What it is: Metrics are numerical measurements collected over time. They're aggregated (summed, averaged, percentiled) and stored efficiently. A metric might be "request count per second" or "P95 latency in milliseconds."
Tool: Prometheus (with PromQL query language)
What it tells you: System-wide trends and thresholds. Metrics answer: "How many requests are we handling? What's the error rate? Are we running out of memory?"
Characteristics:
Example: Your Task API handles 1,000 requests per second. The P95 latency is 150ms. The error rate is 0.1%. These three metrics describe system health without storing every request detail.
What it is: A trace follows a single request through every service it touches. It's like a detailed receipt showing: "This request entered Service A at time T1, called Service B at time T2, waited for the database at T3."
Tool: Jaeger (with OpenTelemetry for instrumentation)
What it tells you: Why a specific request was slow or failed. Traces answer: "Where did this request spend its time? Which service failed?"
Characteristics:
Example: A user reports slow task creation. You search for their trace ID, which shows: API gateway (10ms), FastAPI service (20ms), Inference service (1,800ms), Database (15ms). The inference service took 1.8 seconds. That's your bottleneck.
What it is: Logs are timestamped text records of events. They capture what happened in human-readable (or structured JSON) form: "User 123 created task with title 'Fix bug' at 2025-12-30 03:47:12 UTC."
Tool: Loki (with LogQL query language)
What it tells you: Exactly what happened and why. Logs answer: "What error message did the service produce? What parameters did the request include?"
Characteristics:
Example: The trace shows the inference service was slow. You check logs for that time period: "GPU memory exhausted, falling back to CPU inference." Now you understand why: the GPU ran out of memory, forcing slower CPU processing.
The pillars aren't alternatives; they're layers of detail. You move from broad to specific:
The debugging workflow:
Each pillar adds context. Metrics told you something was wrong. Traces showed you where. Logs explained why.
Rule of thumb:
Google's Site Reliability Engineering (SRE) book defines four signals that capture most of what matters about service health. These are the starting point for any observability strategy.
What it measures: How long requests take to complete.
Why it matters: Users experience latency directly. A service with 0% errors but 10-second response times is broken from the user's perspective.
What to track:
Example: Task API P50 is 80ms, P95 is 150ms, P99 is 800ms. The P99 is high—1% of users wait almost a second. Worth investigating.
What it measures: How much demand the system is handling.
Why it matters: Traffic correlates with resource usage, revenue, and capacity planning. A sudden traffic spike explains why latency increased.
What to track:
Example: Task API normally handles 500 RPS. At 2:45am, traffic spiked to 2,000 RPS due to a scheduled batch job. The inference service couldn't handle the load, causing latency to spike.
What it measures: How many requests fail.
Why it matters: Errors are the clearest signal that something is wrong. A 5% error rate means 1 in 20 users experiences a failure.
What to track:
Example: Task API error rate jumped from 0.1% to 5% at 2:45am. All errors are 503 (Service Unavailable) from the inference service. The inference service is the problem.
What it measures: How "full" the system is—how close resources are to their limits.
Why it matters: Saturation is a leading indicator. Before latency spikes or errors occur, saturation shows you're approaching limits.
What to track:
Example: Inference service CPU is at 95%, memory at 88%. The service is saturated. Adding more traffic will cause failures. You need to scale before that happens.
When a problem occurs, which signal do you examine first?
The investigation pattern:
Before we deploy observability tools in later lessons, internalize this framework:
Metrics are the dashboard: They tell you if the system is healthy at a glance. They power alerts. They track trends over weeks and months.
Traces are the debugger: They follow individual requests through the system. They show you exactly where time is spent.
Logs are the source code of events: They capture the details that explain behavior. They're the final level of detail when metrics and traces point to a component.
The 4 Golden Signals are your starting questions:
These prompts help you apply observability concepts to your own projects.
Prompt 1: Observability Strategy
What you're learning: How to plan observability before implementation. AI helps you think through which signals matter for your specific architecture.
Prompt 2: Debugging with Signals
What you're learning: How to apply the debugging workflow systematically. AI demonstrates the thought process of an SRE diagnosing performance issues.
Prompt 3: Signal Selection
What you're learning: How to quickly triage production issues. The right starting point saves debugging time.
Safety note: Observability systems collect sensitive data (request parameters, user IDs, error details). In production, configure retention policies, access controls, and data masking. Never expose observability dashboards publicly.
You built an observability-cost-engineer skill in Lesson 0. Test and improve it based on what you learned.
Ask yourself:
If you found gaps: