USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookHaving the Tools Installed Is Not the Same as Being Observable
Previous Chapter
Dapr Observability Integration
Next Chapter
Traffic Engineering
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

55 sections

Progress0%
1 / 55

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Capstone: Full Observability Stack for Task API

You've learned each piece of the observability puzzle across this chapter: Prometheus for metrics, Grafana for visualization, OpenTelemetry and Jaeger for tracing, Loki for logging, SLOs and error budgets for reliability, alerting for incident response, OpenCost for FinOps, and Dapr integration patterns. Now you bring them together.

This capstone deploys a complete, production-ready observability stack for Task API. By the end, you'll have:

  • Metrics: Prometheus collecting application and infrastructure metrics
  • Visualization: Grafana dashboards showing the four golden signals
  • Tracing: Jaeger receiving distributed traces from OpenTelemetry
  • Logging: Loki aggregating structured logs with trace correlation
  • SLOs: 99.9% availability and P95 latency targets with error budget tracking
  • Alerting: Multi-burn-rate alerts that page when SLO is at risk
  • Cost: OpenCost showing resource costs by team and service

This is the observability infrastructure your Digital FTE products need in production. Every AI agent you deploy deserves this level of visibility.

Step 1: Deploy Complete Observability Stack via Helm

Start by deploying all observability components. This is the infrastructure layer that receives telemetry from your applications.

Stack Architecture

Specification
┌───────────────────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ ├───────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌─────────────┐ ┌──────────────┐ │ │ │ Task API │───►│ Prometheus │◄───│ServiceMonitor│ │ │ │ /metrics │ │ (TSDB) │ │ (CRD) │ │ │ └──────────┘ └──────┬──────┘ └──────────────┘ │ │ │ │ │ │ │ ┌──────▼──────┐ │ │ │ │ Grafana │ ◄── Dashboards + Alerts │ │ │ │ (Visualize) │ │ │ │ └─────────────┘ │ │ │ │ │ ┌─────▼──────┐ ┌────────────┐ │ │ │ Task API │─►│ Jaeger │ ◄── Trace Analysis │ │ │ (traces) │ │(Collector) │ │ │ └────────────┘ └────────────┘ │ │ │ │ │ ┌─────▼──────┐ ┌────────────┐ │ │ │ Task API │─►│ Loki │ ◄── Log Aggregation │ │ │ (logs) │ │+ Promtail │ │ │ └────────────┘ └────────────┘ │ │ │ │ ┌─────────────┐ │ │ │ OpenCost │ ◄── Cost Allocation by Namespace/Team │ │ │ (FinOps) │ │ │ └─────────────┘ │ └───────────────────────────────────────────────────────────────────┘

Install All Helm Repositories

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo add grafana https://grafana.github.io/helm-charts helm repo add jaegertracing https://jaegertracing.github.io/helm-charts helm repo add opencost https://opencost.github.io/opencost-helm-chart helm repo update

Output:

Specification
"prometheus-community" has been added to your repositories "grafana" has been added to your repositories "jaegertracing" has been added to your repositories "opencost" has been added to your repositories Hang tight while we grab the latest from your chart repositories... Update Complete. Happy Helming!

Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)

bash
kubectl create namespace monitoring helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \ --set grafana.adminPassword=observability-demo \ --set prometheus.prometheusSpec.retention=7d

Output:

Specification
NAME: prometheus LAST DEPLOYED: Mon Dec 30 10:00:00 2025 NAMESPACE: monitoring STATUS: deployed REVISION: 1

Install Loki for Logging

bash
helm install loki grafana/loki-stack \ --namespace monitoring \ --set promtail.enabled=true \ --set loki.persistence.enabled=true \ --set loki.persistence.size=10Gi

Output:

Specification
NAME: loki NAMESPACE: monitoring STATUS: deployed REVISION: 1

Install Jaeger for Tracing

bash
helm install jaeger jaegertracing/jaeger \ --namespace monitoring \ --set collector.service.otlp.grpc.enabled=true \ --set collector.service.otlp.http.enabled=true \ --set query.ingress.enabled=false

Output:

Specification
NAME: jaeger NAMESPACE: monitoring STATUS: deployed REVISION: 1

Install OpenCost for Cost Monitoring

bash
helm install opencost opencost/opencost \ --namespace monitoring \ --set prometheus.internal.serviceName=prometheus-kube-prometheus-prometheus \ --set prometheus.internal.namespaceName=monitoring

Output:

Specification
NAME: opencost NAMESPACE: monitoring STATUS: deployed REVISION: 1

Verify All Components Running

bash
kubectl get pods -n monitoring

Output:

Specification
NAME READY STATUS RESTARTS AGE alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 3m jaeger-agent-daemonset-xxxxx 1/1 Running 0 2m jaeger-collector-yyyyy 1/1 Running 0 2m jaeger-query-zzzzz 1/1 Running 0 2m loki-0 1/1 Running 0 2m loki-promtail-xxxxx 1/1 Running 0 2m opencost-yyyyy 1/1 Running 0 1m prometheus-grafana-xxxxx 3/3 Running 0 3m prometheus-kube-prometheus-operator-yyyyy 1/1 Running 0 3m prometheus-kube-state-metrics-zzzzz 1/1 Running 0 3m prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 3m

All components are running. The observability infrastructure is ready.

Step 2: Instrument Task API with Metrics, Traces, and Logs

With the stack deployed, instrument Task API to emit telemetry.

Application Dependencies

Specification
# requirements.txt fastapi>=0.109.0 uvicorn>=0.25.0 prometheus-client>=0.19.0 opentelemetry-api>=1.22.0 opentelemetry-sdk>=1.22.0 opentelemetry-instrumentation-fastapi>=0.43b0 opentelemetry-exporter-otlp>=1.22.0 structlog>=24.1.0

Complete Instrumented Application

python
# main.py - Task API with full observability import time import structlog from contextlib import asynccontextmanager from fastapi import FastAPI, Request, Response, HTTPException from pydantic import BaseModel from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor # Configure structured logging with trace correlation structlog.configure( processors=[ structlog.processors.TimeStamper(fmt="iso"), structlog.processors.add_log_level, structlog.processors.JSONRenderer() ] ) logger = structlog.get_logger() # Configure tracing trace.set_tracer_provider(TracerProvider()) otlp_exporter = OTLPSpanExporter( endpoint="jaeger-collector.monitoring:4317", insecure=True ) trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter)) tracer = trace.get_tracer(__name__) # Define Prometheus metrics REQUEST_COUNT = Counter( "task_api_requests_total", "Total HTTP requests", ["method", "endpoint", "status"] ) REQUEST_LATENCY = Histogram( "task_api_request_duration_seconds", "Request latency in seconds", ["method", "endpoint"], buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0] ) TASK_OPERATIONS = Counter( "task_api_operations_total", "Task operations count", ["operation", "status"] ) # In-memory task store (replace with database in production) tasks: dict = {} class Task(BaseModel): title: str priority: str = "medium" completed: bool = False class TaskResponse(BaseModel): id: str title: str priority: str completed: bool @asynccontextmanager async def lifespan(app: FastAPI): logger.info("task_api_starting", version="1.0.0") yield logger.info("task_api_shutting_down") app = FastAPI(title="Task API", lifespan=lifespan) FastAPIInstrumentor.instrument_app(app) @app.middleware("http") async def observability_middleware(request: Request, call_next): """Add metrics and logging to every request""" start_time = time.time() span = trace.get_current_span() trace_id = format(span.get_span_context().trace_id, "032x") if span else "no-trace" response = await call_next(request) latency = time.time() - start_time REQUEST_COUNT.labels( method=request.method, endpoint=request.url.path, status=response.status_code ).inc() REQUEST_LATENCY.labels( method=request.method, endpoint=request.url.path ).observe(latency) logger.info( "http_request", method=request.method, path=request.url.path, status=response.status_code, latency_ms=round(latency * 1000, 2), trace_id=trace_id ) return response @app.get("/health") async def health_check(): return {"status": "healthy", "version": "1.0.0"} @app.get("/metrics") async def metrics(): return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST) @app.post("/tasks", response_model=TaskResponse, status_code=201) async def create_task(task: Task): with tracer.start_as_current_span("create_task") as span: task_id = f"task-{len(tasks) + 1}" span.set_attribute("task.id", task_id) span.set_attribute("task.priority", task.priority) tasks[task_id] = { "id": task_id, "title": task.title, "priority": task.priority, "completed": task.completed } TASK_OPERATIONS.labels(operation="create", status="success").inc() logger.info("task_created", task_id=task_id, priority=task.priority) return TaskResponse(**tasks[task_id]) @app.get("/tasks/{task_id}", response_model=TaskResponse) async def get_task(task_id: str): with tracer.start_as_current_span("get_task") as span: span.set_attribute("task.id", task_id) if task_id not in tasks: TASK_OPERATIONS.labels(operation="get", status="not_found").inc() logger.warning("task_not_found", task_id=task_id) raise HTTPException(status_code=404, detail="Task not found") TASK_OPERATIONS.labels(operation="get", status="success").inc() return TaskResponse(**tasks[task_id]) @app.put("/tasks/{task_id}/complete") async def complete_task(task_id: str): with tracer.start_as_current_span("complete_task") as span: span.set_attribute("task.id", task_id) if task_id not in tasks: TASK_OPERATIONS.labels(operation="complete", status="not_found").inc() raise HTTPException(status_code=404, detail="Task not found") tasks[task_id]["completed"] = True TASK_OPERATIONS.labels(operation="complete", status="success").inc() logger.info("task_completed", task_id=task_id) return {"status": "completed", "task_id": task_id}

Output (application logs on startup):

json
{"event": "task_api_starting", "version": "1.0.0", "level": "info", "timestamp": "2025-12-30T10:10:00Z"}

Kubernetes Deployment with Observability Labels

yaml
# task-api-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: task-api namespace: default labels: app: task-api cost-center: platform team: agents spec: replicas: 3 selector: matchLabels: app: task-api template: metadata: labels: app: task-api cost-center: platform team: agents spec: containers: - name: task-api image: ghcr.io/fistasolutions/task-api:1.0.0 ports: - containerPort: 8000 name: http env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://jaeger-collector.monitoring:4317" - name: OTEL_SERVICE_NAME value: "task-api" resources: requests: cpu: "100m" memory: "128Mi" limits: cpu: "500m" memory: "256Mi" livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 10 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 5 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: task-api namespace: default labels: app: task-api spec: selector: app: task-api ports: - port: 8000 targetPort: 8000 name: http

ServiceMonitor for Prometheus

yaml
# task-api-servicemonitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: task-api namespace: monitoring labels: release: prometheus spec: selector: matchLabels: app: task-api namespaceSelector: matchNames: - default endpoints: - port: http path: /metrics interval: 30s

Apply the manifests:

bash
kubectl apply -f task-api-deployment.yaml kubectl apply -f task-api-servicemonitor.yaml

Output:

Specification
deployment.apps/task-api created service/task-api created servicemonitor.monitoring.coreos.com/task-api created

Step 3: Define SLOs for Task API

Define Service Level Objectives that matter for a task management API.

SLO Targets

SLISLO TargetError Budget (30 days)
Availability99.9%43.2 minutes downtime
Latency (P95)< 200ms0.1% requests may exceed

PrometheusRule for SLO Recording and Alerting

yaml
# task-api-slo-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: task-api-slo namespace: monitoring labels: release: prometheus spec: groups: - name: task-api-slo-recording interval: 30s rules: # Availability SLI: Successful requests / Total requests - record: task_api:availability:5m expr: | sum(rate(task_api_requests_total{status!~"5.."}[5m])) / sum(rate(task_api_requests_total[5m])) # Latency SLI: Requests under 200ms / Total requests - record: task_api:latency_sli:5m expr: | sum(rate(task_api_request_duration_seconds_bucket{le="0.2"}[5m])) / sum(rate(task_api_request_duration_seconds_count[5m])) # Error budget burn rate - record: task_api:error_budget_burn_rate:5m expr: 1 - task_api:availability:5m # 1-hour burn rate for alerting - record: task_api:error_budget_burn_rate:1h expr: | 1 - ( sum(rate(task_api_requests_total{status!~"5.."}[1h])) / sum(rate(task_api_requests_total[1h])) ) - name: task-api-slo-alerts rules: # Fast burn: 2% of monthly budget in 1 hour (14.4x burn rate) - alert: TaskAPIHighErrorBudgetBurn expr: | task_api:error_budget_burn_rate:5m > (14.4 * 0.001) and task_api:error_budget_burn_rate:1h > (14.4 * 0.001) for: 2m labels: severity: critical service: task-api annotations: summary: "Task API burning error budget rapidly" description: "Error rate {{ $value | humanizePercentage }} is consuming budget at 14.4x normal rate." runbook_url: "https://runbooks.example.com/task-api-high-error-rate" # Slow burn: 10% of monthly budget in 6 hours (2x burn rate) - alert: TaskAPIElevatedErrorBudgetBurn expr: task_api:error_budget_burn_rate:1h > (2 * 0.001) for: 30m labels: severity: warning service: task-api annotations: summary: "Task API error budget consumption elevated" description: "Error rate is elevated. Investigate before it becomes critical." # Latency SLO breach - alert: TaskAPILatencySLOBreach expr: task_api:latency_sli:5m < 0.999 for: 10m labels: severity: warning service: task-api annotations: summary: "Task API P95 latency exceeding 200ms" description: "{{ $value | humanizePercentage }} of requests complete under 200ms (target: 99.9%)"

Apply the rules:

bash
kubectl apply -f task-api-slo-rules.yaml

Output:

Specification
prometheusrule.monitoring.coreos.com/task-api-slo created

Verify rules are loaded:

bash
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 & curl -s localhost:9090/api/v1/rules | jq '.data.groups[].name' | grep task-api

Output:

Specification
"task-api-slo-recording" "task-api-slo-alerts"

Step 4: Create Task API SLO Dashboard in Grafana

Create a comprehensive dashboard showing availability, latency, error budget, and golden signals.

Dashboard JSON

json
{ "title": "Task API SLO Dashboard", "uid": "task-api-slo", "timezone": "browser", "panels": [ { "title": "Availability (SLO: 99.9%)", "type": "gauge", "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}, "targets": [{"expr": "task_api:availability:5m * 100", "legendFormat": "Availability %"}], "fieldConfig": { "defaults": { "min": 99, "max": 100, "unit": "percent", "thresholds": {"steps": [ {"value": 99.9, "color": "green"}, {"value": 99.5, "color": "yellow"}, {"value": 0, "color": "red"} ]} } } }, { "title": "P95 Latency (SLO: <200ms)", "type": "gauge", "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}, "targets": [{"expr": "histogram_quantile(0.95, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000", "legendFormat": "P95 Latency (ms)"}], "fieldConfig": { "defaults": { "min": 0, "max": 500, "unit": "ms", "thresholds": {"steps": [ {"value": 200, "color": "green"}, {"value": 300, "color": "yellow"}, {"value": 400, "color": "red"} ]} } } }, { "title": "Error Budget Remaining", "type": "stat", "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}, "targets": [{"expr": "(1 - ((1 - task_api:availability:5m) / 0.001)) * 100", "legendFormat": "Budget %"}], "fieldConfig": { "defaults": { "unit": "percent", "thresholds": {"steps": [ {"value": 50, "color": "green"}, {"value": 20, "color": "yellow"}, {"value": 0, "color": "red"} ]} } } }, { "title": "Error Budget Burn Rate", "type": "stat", "gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}, "targets": [{"expr": "task_api:error_budget_burn_rate:1h / 0.001", "legendFormat": "Burn Rate (x normal)"}], "fieldConfig": { "defaults": { "thresholds": {"steps": [ {"value": 1, "color": "green"}, {"value": 2, "color": "yellow"}, {"value": 14.4, "color": "red"} ]} } } }, { "title": "Request Rate", "type": "timeseries", "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}, "targets": [{"expr": "sum(rate(task_api_requests_total[5m]))", "legendFormat": "Requests/sec"}] }, { "title": "Error Rate", "type": "timeseries", "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}, "targets": [{"expr": "sum(rate(task_api_requests_total{status=~\"5..\"}[5m])) / sum(rate(task_api_requests_total[5m])) * 100", "legendFormat": "Error %"}], "fieldConfig": { "defaults": { "unit": "percent", "thresholds": {"steps": [ {"value": 0.1, "color": "green"}, {"value": 0.5, "color": "yellow"}, {"value": 1.0, "color": "red"} ]} } } }, { "title": "Latency Distribution", "type": "timeseries", "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}, "targets": [ {"expr": "histogram_quantile(0.50, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000", "legendFormat": "P50"}, {"expr": "histogram_quantile(0.95, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000", "legendFormat": "P95"}, {"expr": "histogram_quantile(0.99, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000", "legendFormat": "P99"} ], "fieldConfig": {"defaults": {"unit": "ms"}} }, { "title": "Task Operations", "type": "timeseries", "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16}, "targets": [{"expr": "sum(rate(task_api_operations_total[5m])) by (operation, status)", "legendFormat": "{{operation}} ({{status}})"}] } ] }

Import the dashboard to Grafana:

bash
# Port-forward to Grafana kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 & # Login: admin / observability-demo # Import dashboard: Dashboards > Import > Paste JSON

Output (after import):

Specification
Dashboard "Task API SLO Dashboard" imported successfully URL: http://localhost:3000/d/task-api-slo

Step 5: Set Up Multi-Burn-Rate Alerts

The PrometheusRule from Step 3 already defines multi-burn-rate alerts. Now configure Alertmanager to route them.

Alertmanager Configuration

yaml
# alertmanager-config.yaml apiVersion: v1 kind: Secret metadata: name: alertmanager-prometheus-kube-prometheus-alertmanager namespace: monitoring stringData: alertmanager.yaml: | global: resolve_timeout: 5m route: receiver: 'default-receiver' group_by: ['alertname', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: 'pagerduty-critical' continue: true - match: severity: warning receiver: 'slack-warnings' receivers: - name: 'default-receiver' webhook_configs: - url: 'http://alertmanager-webhook-logger:8080/webhook' - name: 'pagerduty-critical' webhook_configs: - url: 'http://alertmanager-webhook-logger:8080/pagerduty' - name: 'slack-warnings' webhook_configs: - url: 'http://alertmanager-webhook-logger:8080/slack'

Apply and verify:

bash
kubectl apply -f alertmanager-config.yaml kubectl rollout restart statefulset/alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring

Output:

Specification
secret/alertmanager-prometheus-kube-prometheus-alertmanager configured statefulset.apps/alertmanager-prometheus-kube-prometheus-alertmanager restarted

Verify Alert Routing

bash
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093 & curl -s localhost:9093/api/v2/status | jq '.config.route'

Output:

json
{ "receiver": "default-receiver", "group_by": ["alertname", "service"], "routes": [ {"match": {"severity": "critical"}, "receiver": "pagerduty-critical"}, {"match": {"severity": "warning"}, "receiver": "slack-warnings"} ] }

Step 6: Configure Cost Allocation Labels

Cost allocation was configured in the Deployment (Step 2). Now verify OpenCost is collecting the data.

Verify Cost Labels

bash
kubectl get pods -n default --show-labels | grep task-api

Output:

Specification
task-api-xxxxx 1/1 Running app=task-api,cost-center=platform,team=agents task-api-yyyyy 1/1 Running app=task-api,cost-center=platform,team=agents task-api-zzzzz 1/1 Running app=task-api,cost-center=platform,team=agents

Query OpenCost

bash
kubectl port-forward -n monitoring svc/opencost 9003:9003 & curl -s "localhost:9003/allocation/compute?window=1d&aggregate=namespace" | jq '.data[0]'

Output:

json
{ "default": { "cpuCost": 0.0432, "memoryCost": 0.0216, "totalCost": 0.0648, "cpuEfficiency": 0.15, "memoryEfficiency": 0.45 }, "monitoring": { "cpuCost": 0.1296, "memoryCost": 0.0864, "totalCost": 0.2160, "cpuEfficiency": 0.35, "memoryEfficiency": 0.60 } }

Cost by Team Label

bash
curl -s "localhost:9003/allocation/compute?window=1d&aggregate=label:team" | jq '.data[0]'

Output:

json
{ "agents": { "cpuCost": 0.0432, "memoryCost": 0.0216, "totalCost": 0.0648 } }

The team=agents label enables cost attribution to specific teams.

Step 7: Final Skill Test and Verification Checklist

Complete System Verification

Run through this checklist after deployment to confirm every signal is flowing:

ComponentVerification CommandExpected Result
Prometheuskubectl get pods -n monitoring -l app.kubernetes.io/name=prometheusRunning
Grafanakubectl get pods -n monitoring -l app.kubernetes.io/name=grafanaRunning
Lokikubectl get pods -n monitoring -l app.kubernetes.io/name=lokiRunning
Jaegerkubectl get pods -n monitoring -l app.kubernetes.io/name=jaegerRunning
OpenCostkubectl get pods -n monitoring -l app.kubernetes.io/name=opencostRunning
Metrics flowingQuery task_api_requests_total in PrometheusNon-empty result
Traces visibleJaeger UI search for service=task-apiTraces found
Logs aggregatedLoki query {namespace="default", app="task-api"}Logs returned
SLO calculatedQuery task_api:availability:5m~1.0
Costs trackedOpenCost API aggregate=label:teamCost data by team

Infrastructure Verification

bash
kubectl get pods -n monitoring | grep -E "prometheus|grafana|loki|jaeger|opencost"

Expected: All pods in Running state.

Metrics Verification

bash
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 & curl -s "localhost:9090/api/v1/query?query=task_api_requests_total" | jq '.data.result | length'

Expected: Non-zero result indicating metrics are being collected.

Tracing Verification

bash
# Generate a trace curl -X POST http://localhost:8000/tasks \ -H "Content-Type: application/json" \ -d '{"title":"Test task"}' # Open http://localhost:16686, search for service=task-api kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &

Expected: Traces visible in Jaeger UI for the POST /tasks operation.

Logging Verification

bash
kubectl port-forward -n monitoring svc/loki 3100:3100 & curl -s 'localhost:3100/loki/api/v1/query?query={namespace="default",app="task-api"}' | \ jq '.data.result | length'

Expected: Logs found for Task API.

SLO Verification

bash
curl -s "localhost:9090/api/v1/query?query=task_api:availability:5m" | \ jq '.data.result[0].value[1]'

Expected: Value close to "1" (100% availability).

Final Skill Test

Specification
Using my observability-cost-engineer skill, deploy a complete observability stack for a new FastAPI service called "order-service" with: - 99.9% availability SLO - P95 latency target of 150ms - Cost allocation labels: cost-center=commerce, team=orders Your skill should produce: 1. ServiceMonitor for the new service 2. PrometheusRule with SLO recording rules and multi-burn-rate alerts 3. Dashboard JSON for the service 4. Deployment YAML with proper labels and probes

If your skill missed any of these, update it:

Specification
My observability-cost-engineer skill doesn't include multi-burn-rate alerting patterns. Update it to include the 14.4x and 2x burn rate thresholds for fast and slow burns, with proper alert annotations including runbook URLs.

Try With AI

Prompt 1: Extend the Stack

Specification
I've deployed the complete observability stack for Task API. Now I want to add observability for a new microservice called "notification-service" that sends emails and push notifications. What instrumentation do I need to add, and what SLOs make sense for a notification service?

What you're learning: Applying observability patterns to different service types. Notification services have different reliability characteristics than synchronous APIs.

Prompt 2: Debug with Observability

Specification
My Task API SLO dashboard shows availability dropped to 99.5% in the last hour. Walk me through how to use the observability stack to identify the root cause. What should I check in Prometheus, Jaeger, and Loki?

What you're learning: Using the three pillars together for incident investigation. Metrics tell you something is wrong, traces show where, logs explain why.

Prompt 3: Optimize Costs

Specification
OpenCost shows my monitoring namespace costs $0.22/day but my application namespace only costs $0.06/day. Is this ratio normal? How can I optimize observability costs without losing visibility?

What you're learning: FinOps for observability infrastructure. Retention policies, sampling rates, and resource right-sizing reduce costs while maintaining visibility.

Safety note: When testing alerts, use a non-production environment. Triggering real PagerDuty pages or Slack notifications during testing creates alert fatigue. Always configure test receivers that log but don't notify during development.

Reflect on Your Skill

This capstone integrated everything from Sub-Module 7. Your observability-cost-engineer skill should now be production-ready.

Verify Complete Coverage

Your skill should address:

  • Prometheus metrics via ServiceMonitor
  • OpenTelemetry tracing with Dapr correlation
  • Structured logging with trace_id
  • SLO definition with error budgets
  • Multi-burn-rate alerting rules
  • Cost allocation labels
  • Dapr-specific observability (actor metrics, workflow spans)