Your Task API runs in Kubernetes. Users report slow responses, but you have no visibility. Is it the database? The network? A memory leak? Without observability, you're debugging in the dark.
This chapter teaches you to instrument applications, build dashboards, and manage cloud costs. But you won't start by reading documentation and hoping you remember the right PromQL syntax. You'll start by owning a skill that encodes production patterns from Prometheus, OpenTelemetry, Jaeger, Loki, and OpenCost.
When you face a production incident at 2am, you won't search Stack Overflow. You'll invoke your skill, and it will generate the exact queries, dashboards, and alerts you need. That's the difference between learning observability and owning observability.
Clone a fresh skills lab to ensure a clean starting point:
Before building the skill, define what you want to accomplish. Create a file called LEARNING-SPEC.md:
This spec guides both you and Claude on what the skill should cover.
Use the /fetching-library-docs skill (or Context7 directly) to gather authoritative sources:
Claude fetches production-relevant patterns from official sources, not Stack Overflow answers from 2019.
Now build your observability skill with everything grounded in what you just fetched:
Claude will:
Your skill appears at .claude/skills/observability-cost-engineer/.
Verify the skill works by asking it a question:
If the skill returns a correct histogram_quantile query, it's working. If it hallucinates syntax, refine the skill with corrections.
Each lesson in this chapter tests and improves your skill:
By chapter end, your skill contains production-tested patterns for the entire observability stack.
What you're learning: Self-assessment of current observability gaps through Socratic dialogue. Your skill will encode these patterns, but understanding the "why" helps you know when to apply each pillar.
What you're learning: Skill validation through expert review. You're teaching your AI partner about your context so it can identify gaps specific to your situation.
What you're learning: Strategic prioritization under constraints. Not every stack needs every tool. Your skill contains patterns for all; this exercise teaches you which patterns apply to your situation first.
Observability tools collect sensitive data about your applications and infrastructure. When configuring Prometheus scrape targets, Loki log streams, or OpenTelemetry traces, ensure you're not capturing secrets, PII, or credentials. The patterns in your skill include security considerations, but always validate against your organization's data handling policies.