USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookYour AI Agents Are Deployed. Nobody Can See What They're Doing.
Previous Chapter
Observability Cost Engineering
Next Chapter
Three Pillars of Observability
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

15 sections

Progress0%
1 / 15

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Build Your Observability Skill

Your Task API runs in Kubernetes. Users report slow responses, but you have no visibility. Is it the database? The network? A memory leak? Without observability, you're debugging in the dark.

This chapter teaches you to instrument applications, build dashboards, and manage cloud costs. But you won't start by reading documentation and hoping you remember the right PromQL syntax. You'll start by owning a skill that encodes production patterns from Prometheus, OpenTelemetry, Jaeger, Loki, and OpenCost.

When you face a production incident at 2am, you won't search Stack Overflow. You'll invoke your skill, and it will generate the exact queries, dashboards, and alerts you need. That's the difference between learning observability and owning observability.


Step 1: Get the Skills Lab

Clone a fresh skills lab to ensure a clean starting point:

  1. Go to github.com/fistasolutions/claude-code-skills-lab
  2. Click the green Code button
  3. Select Download ZIP
  4. Extract the ZIP file
  5. Open the extracted folder in your terminal
bash
cd claude-code-skills-lab claude

Step 2: Write Your Learning Spec

Before building the skill, define what you want to accomplish. Create a file called LEARNING-SPEC.md:

markdown
# Observability Learning Spec ## What I Want to Learn - Monitor Kubernetes applications with Prometheus metrics - Trace requests across services with OpenTelemetry and Jaeger - Aggregate logs with Loki and query with LogQL - Define SLOs and create multi-burn-rate alerts - Track cloud costs with OpenCost ## Success Criteria - I can deploy a full observability stack in under 30 minutes - I can write PromQL queries for the 4 golden signals - I can instrument a FastAPI app and see traces in Jaeger - I can identify the top 3 cost drivers in my cluster

This spec guides both you and Claude on what the skill should cover.


Step 3: Fetch Official Documentation

Use the /fetching-library-docs skill (or Context7 directly) to gather authoritative sources:

Specification
Using your fetching-library-docs skill, gather official documentation for: 1. Prometheus - metrics collection and PromQL 2. OpenTelemetry Python SDK - instrumentation 3. Grafana Loki - log aggregation and LogQL I need patterns for Kubernetes monitoring, not just API references.

Claude fetches production-relevant patterns from official sources, not Stack Overflow answers from 2019.


Step 4: Create the Skill

Now build your observability skill with everything grounded in what you just fetched:

Specification
Using your skill creator skill, create a new skill for Kubernetes observability and cost engineering. I will use it to monitor applications from basic metrics to production SRE practices. Include: - Prometheus installation and ServiceMonitor configuration - PromQL patterns for 4 golden signals - OpenTelemetry FastAPI instrumentation - Loki LogQL query patterns - SLO/error budget alerting - OpenCost cost allocation Use the documentation you just fetched - no self-assumed knowledge.

Claude will:

  1. Reference the official docs it gathered
  2. Ask clarifying questions (sampling rates, retention, alert thresholds)
  3. Create the complete skill with tested patterns

Your skill appears at .claude/skills/observability-cost-engineer/.


Step 5: Test Your Skill

Verify the skill works by asking it a question:

Specification
Using your observability-cost-engineer skill, write a PromQL query that shows the P95 latency for the task-api service over the last hour, broken down by endpoint.

If the skill returns a correct histogram_quantile query, it's working. If it hallucinates syntax, refine the skill with corrections.


What Happens Next

Each lesson in this chapter tests and improves your skill:

LessonWhat You LearnSkill Improvement
L01Three Pillars of ObservabilityAdd decision framework: metrics vs traces vs logs
L02Prometheus + PromQLAdd recording rules and ServiceMonitor templates
L03Grafana DashboardsAdd dashboard JSON templates for golden signals
L04OpenTelemetry + JaegerAdd FastAPI instrumentation code and sampling config
L05Loki + LogQLAdd structured logging patterns and log-trace correlation
L06SLIs, SLOs, Error BudgetsAdd SLO calculation formulas and budget burn rates
L07AlertingAdd multi-burn-rate PrometheusRule templates
L08Cost EngineeringAdd OpenCost queries and right-sizing recommendations
L09Dapr IntegrationAdd Dapr observability Configuration CRD
L10CapstoneComplete integration testing and skill finalization

By chapter end, your skill contains production-tested patterns for the entire observability stack.


Try With AI

Prompt 1: Explore the Three Pillars

Specification
I'm about to learn Kubernetes observability. Before diving into tools, help me understand the landscape. Ask me about a recent debugging experience where I wished I had more visibility. Based on that, explain which of the three pillars (metrics, traces, logs) would have helped most and why.

What you're learning: Self-assessment of current observability gaps through Socratic dialogue. Your skill will encode these patterns, but understanding the "why" helps you know when to apply each pillar.

Prompt 2: Validate Your Skill's Coverage

Specification
I just created an observability skill. Here's what it covers: [paste your skill's key sections]. Review this against production SRE requirements. What's missing? What would a senior SRE add? Ask me about my deployment environment so your recommendations are specific to my needs.

What you're learning: Skill validation through expert review. You're teaching your AI partner about your context so it can identify gaps specific to your situation.

Prompt 3: Plan Your Observability Stack

Specification
I'm deploying Task API to Kubernetes and need to choose observability tools. I know about Prometheus, Grafana, Jaeger, and Loki from my new skill. Help me prioritize: If I can only deploy two tools this week, which two give me the most value? Ask me about my team size, budget, and biggest operational pain points.

What you're learning: Strategic prioritization under constraints. Not every stack needs every tool. Your skill contains patterns for all; this exercise teaches you which patterns apply to your situation first.

Safety Note

Observability tools collect sensitive data about your applications and infrastructure. When configuring Prometheus scrape targets, Loki log streams, or OpenTelemetry traces, ensure you're not capturing secrets, PII, or credentials. The patterns in your skill include security considerations, but always validate against your organization's data handling policies.