Capstone: Full Observability Stack for Task API

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

You've learned each piece of the observability puzzle across this chapter: Prometheus for metrics, Grafana for visualization, OpenTelemetry and Jaeger for tracing, Loki for logging, SLOs and error budgets for reliability, alerting for incident response, OpenCost for FinOps, and Dapr integration patterns. Now you bring them together.

This capstone deploys a complete, production-ready observability stack for Task API. By the end, you'll have:

Metrics: Prometheus collecting application and infrastructure metrics
Visualization: Grafana dashboards showing the four golden signals
Tracing: Jaeger receiving distributed traces from OpenTelemetry
Logging: Loki aggregating structured logs with trace correlation
SLOs: 99.9% availability and P95 latency targets with error budget tracking
Alerting: Multi-burn-rate alerts that page when SLO is at risk
Cost: OpenCost showing resource costs by team and service

This is the observability infrastructure your Digital FTE products need in production. Every AI agent you deploy deserves this level of visibility.

Step 1: Deploy Complete Observability Stack via Helm

Start by deploying all observability components. This is the infrastructure layer that receives telemetry from your applications.

Stack Architecture

Specification

┌───────────────────────────────────────────────────────────────────┐
│                        Kubernetes Cluster                         │
├───────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────┐    ┌─────────────┐    ┌──────────────┐             │
│  │ Task API │───►│  Prometheus │◄───│ServiceMonitor│             │
│  │ /metrics │    │   (TSDB)    │    │    (CRD)     │             │
│  └──────────┘    └──────┬──────┘    └──────────────┘             │
│        │                │                                         │
│        │         ┌──────▼──────┐                                 │
│        │         │   Grafana   │  ◄── Dashboards + Alerts        │
│        │         │ (Visualize) │                                  │
│        │         └─────────────┘                                  │
│        │                                                          │
│  ┌─────▼──────┐  ┌────────────┐                                  │
│  │  Task API  │─►│   Jaeger   │  ◄── Trace Analysis              │
│  │  (traces)  │  │(Collector) │                                   │
│  └────────────┘  └────────────┘                                   │
│        │                                                          │
│  ┌─────▼──────┐  ┌────────────┐                                  │
│  │  Task API  │─►│    Loki    │  ◄── Log Aggregation             │
│  │   (logs)   │  │+ Promtail  │                                   │
│  └────────────┘  └────────────┘                                   │
│                                                                   │
│  ┌─────────────┐                                                  │
│  │  OpenCost   │  ◄── Cost Allocation by Namespace/Team          │
│  │  (FinOps)   │                                                  │
│  └─────────────┘                                                  │
└───────────────────────────────────────────────────────────────────┘

Install All Helm Repositories

bash

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

Output:

Specification

"prometheus-community" has been added to your repositories
"grafana" has been added to your repositories
"jaegertracing" has been added to your repositories
"opencost" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
Update Complete. Happy Helming!

Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)

bash

kubectl create namespace monitoring

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set grafana.adminPassword=observability-demo \
  --set prometheus.prometheusSpec.retention=7d

Output:

Specification

NAME: prometheus
LAST DEPLOYED: Mon Dec 30 10:00:00 2025
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1

Install Loki for Logging

bash

helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=10Gi

Output:

Specification

NAME: loki
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1

Install Jaeger for Tracing

bash

helm install jaeger jaegertracing/jaeger \
  --namespace monitoring \
  --set collector.service.otlp.grpc.enabled=true \
  --set collector.service.otlp.http.enabled=true \
  --set query.ingress.enabled=false

Output:

Specification

NAME: jaeger
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1

Install OpenCost for Cost Monitoring

bash

helm install opencost opencost/opencost \
  --namespace monitoring \
  --set prometheus.internal.serviceName=prometheus-kube-prometheus-prometheus \
  --set prometheus.internal.namespaceName=monitoring

Output:

Specification

NAME: opencost
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1

Verify All Components Running

bash

kubectl get pods -n monitoring

Output:

Specification

NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running   0          3m
jaeger-agent-daemonset-xxxxx                             1/1     Running   0          2m
jaeger-collector-yyyyy                                   1/1     Running   0          2m
jaeger-query-zzzzz                                       1/1     Running   0          2m
loki-0                                                   1/1     Running   0          2m
loki-promtail-xxxxx                                      1/1     Running   0          2m
opencost-yyyyy                                           1/1     Running   0          1m
prometheus-grafana-xxxxx                                 3/3     Running   0          3m
prometheus-kube-prometheus-operator-yyyyy                1/1     Running   0          3m
prometheus-kube-state-metrics-zzzzz                      1/1     Running   0          3m
prometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running   0          3m

All components are running. The observability infrastructure is ready.

Step 2: Instrument Task API with Metrics, Traces, and Logs

With the stack deployed, instrument Task API to emit telemetry.

Application Dependencies

Specification

# requirements.txt
fastapi>=0.109.0
uvicorn>=0.25.0
prometheus-client>=0.19.0
opentelemetry-api>=1.22.0
opentelemetry-sdk>=1.22.0
opentelemetry-instrumentation-fastapi>=0.43b0
opentelemetry-exporter-otlp>=1.22.0
structlog>=24.1.0

Complete Instrumented Application

python

# main.py - Task API with full observability
import time
import structlog
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request, Response, HTTPException
from pydantic import BaseModel
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Configure structured logging with trace correlation
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ]
)
logger = structlog.get_logger()

# Configure tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(
    endpoint="jaeger-collector.monitoring:4317",
    insecure=True
)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))
tracer = trace.get_tracer(__name__)

# Define Prometheus metrics
REQUEST_COUNT = Counter(
    "task_api_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
    "task_api_request_duration_seconds",
    "Request latency in seconds",
    ["method", "endpoint"],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
TASK_OPERATIONS = Counter(
    "task_api_operations_total",
    "Task operations count",
    ["operation", "status"]
)

# In-memory task store (replace with database in production)
tasks: dict = {}

class Task(BaseModel):
    title: str
    priority: str = "medium"
    completed: bool = False

class TaskResponse(BaseModel):
    id: str
    title: str
    priority: str
    completed: bool

@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info("task_api_starting", version="1.0.0")
    yield
    logger.info("task_api_shutting_down")

app = FastAPI(title="Task API", lifespan=lifespan)
FastAPIInstrumentor.instrument_app(app)

@app.middleware("http")
async def observability_middleware(request: Request, call_next):
    """Add metrics and logging to every request"""
    start_time = time.time()
    span = trace.get_current_span()
    trace_id = format(span.get_span_context().trace_id, "032x") if span else "no-trace"
    response = await call_next(request)
    latency = time.time() - start_time
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(latency)
    logger.info(
        "http_request",
        method=request.method,
        path=request.url.path,
        status=response.status_code,
        latency_ms=round(latency * 1000, 2),
        trace_id=trace_id
    )
    return response

@app.get("/health")
async def health_check():
    return {"status": "healthy", "version": "1.0.0"}

@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

@app.post("/tasks", response_model=TaskResponse, status_code=201)
async def create_task(task: Task):
    with tracer.start_as_current_span("create_task") as span:
        task_id = f"task-{len(tasks) + 1}"
        span.set_attribute("task.id", task_id)
        span.set_attribute("task.priority", task.priority)
        tasks[task_id] = {
            "id": task_id, "title": task.title,
            "priority": task.priority, "completed": task.completed
        }
        TASK_OPERATIONS.labels(operation="create", status="success").inc()
        logger.info("task_created", task_id=task_id, priority=task.priority)
        return TaskResponse(**tasks[task_id])

@app.get("/tasks/{task_id}", response_model=TaskResponse)
async def get_task(task_id: str):
    with tracer.start_as_current_span("get_task") as span:
        span.set_attribute("task.id", task_id)
        if task_id not in tasks:
            TASK_OPERATIONS.labels(operation="get", status="not_found").inc()
            logger.warning("task_not_found", task_id=task_id)
            raise HTTPException(status_code=404, detail="Task not found")
        TASK_OPERATIONS.labels(operation="get", status="success").inc()
        return TaskResponse(**tasks[task_id])

@app.put("/tasks/{task_id}/complete")
async def complete_task(task_id: str):
    with tracer.start_as_current_span("complete_task") as span:
        span.set_attribute("task.id", task_id)
        if task_id not in tasks:
            TASK_OPERATIONS.labels(operation="complete", status="not_found").inc()
            raise HTTPException(status_code=404, detail="Task not found")
        tasks[task_id]["completed"] = True
        TASK_OPERATIONS.labels(operation="complete", status="success").inc()
        logger.info("task_completed", task_id=task_id)
        return {"status": "completed", "task_id": task_id}

Output (application logs on startup):

json

{"event": "task_api_starting", "version": "1.0.0", "level": "info", "timestamp": "2025-12-30T10:10:00Z"}

Kubernetes Deployment with Observability Labels

yaml

# task-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  namespace: default
  labels:
    app: task-api
    cost-center: platform
    team: agents
spec:
  replicas: 3
  selector:
    matchLabels:
      app: task-api
  template:
    metadata:
      labels:
        app: task-api
        cost-center: platform
        team: agents
    spec:
      containers:
      - name: task-api
        image: ghcr.io/fistasolutions/task-api:1.0.0
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://jaeger-collector.monitoring:4317"
        - name: OTEL_SERVICE_NAME
          value: "task-api"
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: task-api
  namespace: default
  labels:
    app: task-api
spec:
  selector:
    app: task-api
  ports:
  - port: 8000
    targetPort: 8000
    name: http

ServiceMonitor for Prometheus

yaml

# task-api-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: task-api
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: task-api
  namespaceSelector:
    matchNames:
    - default
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

Apply the manifests:

bash

kubectl apply -f task-api-deployment.yaml
kubectl apply -f task-api-servicemonitor.yaml

Output:

Specification

deployment.apps/task-api created
service/task-api created
servicemonitor.monitoring.coreos.com/task-api created

Step 3: Define SLOs for Task API

Define Service Level Objectives that matter for a task management API.

SLO Targets

SLI	SLO Target	Error Budget (30 days)
Availability	99.9%	43.2 minutes downtime
Latency (P95)	< 200ms	0.1% requests may exceed

PrometheusRule for SLO Recording and Alerting

yaml

# task-api-slo-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: task-api-slo
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
  - name: task-api-slo-recording
    interval: 30s
    rules:
    # Availability SLI: Successful requests / Total requests
    - record: task_api:availability:5m
      expr: |
        sum(rate(task_api_requests_total{status!~"5.."}[5m]))
        /
        sum(rate(task_api_requests_total[5m]))
    # Latency SLI: Requests under 200ms / Total requests
    - record: task_api:latency_sli:5m
      expr: |
        sum(rate(task_api_request_duration_seconds_bucket{le="0.2"}[5m]))
        /
        sum(rate(task_api_request_duration_seconds_count[5m]))
    # Error budget burn rate
    - record: task_api:error_budget_burn_rate:5m
      expr: 1 - task_api:availability:5m
    # 1-hour burn rate for alerting
    - record: task_api:error_budget_burn_rate:1h
      expr: |
        1 - (
          sum(rate(task_api_requests_total{status!~"5.."}[1h]))
          /
          sum(rate(task_api_requests_total[1h]))
        )

  - name: task-api-slo-alerts
    rules:
    # Fast burn: 2% of monthly budget in 1 hour (14.4x burn rate)
    - alert: TaskAPIHighErrorBudgetBurn
      expr: |
        task_api:error_budget_burn_rate:5m > (14.4 * 0.001)
        and
        task_api:error_budget_burn_rate:1h > (14.4 * 0.001)
      for: 2m
      labels:
        severity: critical
        service: task-api
      annotations:
        summary: "Task API burning error budget rapidly"
        description: "Error rate {{ $value | humanizePercentage }} is consuming budget at 14.4x normal rate."
        runbook_url: "https://runbooks.example.com/task-api-high-error-rate"

    # Slow burn: 10% of monthly budget in 6 hours (2x burn rate)
    - alert: TaskAPIElevatedErrorBudgetBurn
      expr: task_api:error_budget_burn_rate:1h > (2 * 0.001)
      for: 30m
      labels:
        severity: warning
        service: task-api
      annotations:
        summary: "Task API error budget consumption elevated"
        description: "Error rate is elevated. Investigate before it becomes critical."

    # Latency SLO breach
    - alert: TaskAPILatencySLOBreach
      expr: task_api:latency_sli:5m < 0.999
      for: 10m
      labels:
        severity: warning
        service: task-api
      annotations:
        summary: "Task API P95 latency exceeding 200ms"
        description: "{{ $value | humanizePercentage }} of requests complete under 200ms (target: 99.9%)"

Apply the rules:

bash

kubectl apply -f task-api-slo-rules.yaml

Output:

Specification

prometheusrule.monitoring.coreos.com/task-api-slo created

Verify rules are loaded:

bash

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &
curl -s localhost:9090/api/v1/rules | jq '.data.groups[].name' | grep task-api

Output:

Specification

"task-api-slo-recording"
"task-api-slo-alerts"

Step 4: Create Task API SLO Dashboard in Grafana

Create a comprehensive dashboard showing availability, latency, error budget, and golden signals.

Dashboard JSON

json

{
  "title": "Task API SLO Dashboard",
  "uid": "task-api-slo",
  "timezone": "browser",
  "panels": [
    {
      "title": "Availability (SLO: 99.9%)",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0},
      "targets": [{"expr": "task_api:availability:5m * 100", "legendFormat": "Availability %"}],
      "fieldConfig": {
        "defaults": {
          "min": 99, "max": 100, "unit": "percent",
          "thresholds": {"steps": [
            {"value": 99.9, "color": "green"},
            {"value": 99.5, "color": "yellow"},
            {"value": 0,    "color": "red"}
          ]}
        }
      }
    },
    {
      "title": "P95 Latency (SLO: <200ms)",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0},
      "targets": [{"expr": "histogram_quantile(0.95, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000", "legendFormat": "P95 Latency (ms)"}],
      "fieldConfig": {
        "defaults": {
          "min": 0, "max": 500, "unit": "ms",
          "thresholds": {"steps": [
            {"value": 200, "color": "green"},
            {"value": 300, "color": "yellow"},
            {"value": 400, "color": "red"}
          ]}
        }
      }
    },
    {
      "title": "Error Budget Remaining",
      "type": "stat",
      "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0},
      "targets": [{"expr": "(1 - ((1 - task_api:availability:5m) / 0.001)) * 100", "legendFormat": "Budget %"}],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {"steps": [
            {"value": 50, "color": "green"},
            {"value": 20, "color": "yellow"},
            {"value": 0,  "color": "red"}
          ]}
        }
      }
    },
    {
      "title": "Error Budget Burn Rate",
      "type": "stat",
      "gridPos": {"h": 8, "w": 6, "x": 18, "y": 0},
      "targets": [{"expr": "task_api:error_budget_burn_rate:1h / 0.001", "legendFormat": "Burn Rate (x normal)"}],
      "fieldConfig": {
        "defaults": {
          "thresholds": {"steps": [
            {"value": 1,    "color": "green"},
            {"value": 2,    "color": "yellow"},
            {"value": 14.4, "color": "red"}
          ]}
        }
      }
    },
    {
      "title": "Request Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
      "targets": [{"expr": "sum(rate(task_api_requests_total[5m]))", "legendFormat": "Requests/sec"}]
    },
    {
      "title": "Error Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
      "targets": [{"expr": "sum(rate(task_api_requests_total{status=~\"5..\"}[5m])) / sum(rate(task_api_requests_total[5m])) * 100", "legendFormat": "Error %"}],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {"steps": [
            {"value": 0.1, "color": "green"},
            {"value": 0.5, "color": "yellow"},
            {"value": 1.0, "color": "red"}
          ]}
        }
      }
    },
    {
      "title": "Latency Distribution",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
      "targets": [
        {"expr": "histogram_quantile(0.50, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000", "legendFormat": "P50"},
        {"expr": "histogram_quantile(0.95, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000", "legendFormat": "P95"},
        {"expr": "histogram_quantile(0.99, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000", "legendFormat": "P99"}
      ],
      "fieldConfig": {"defaults": {"unit": "ms"}}
    },
    {
      "title": "Task Operations",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
      "targets": [{"expr": "sum(rate(task_api_operations_total[5m])) by (operation, status)", "legendFormat": "{{operation}} ({{status}})"}]
    }
  ]
}

Import the dashboard to Grafana:

bash

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &
# Login: admin / observability-demo
# Import dashboard: Dashboards > Import > Paste JSON

Output (after import):

Specification

Dashboard "Task API SLO Dashboard" imported successfully
URL: http://localhost:3000/d/task-api-slo

Step 5: Set Up Multi-Burn-Rate Alerts

The PrometheusRule from Step 3 already defines multi-burn-rate alerts. Now configure Alertmanager to route them.

Alertmanager Configuration

yaml

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-prometheus-kube-prometheus-alertmanager
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
    route:
      receiver: 'default-receiver'
      group_by: ['alertname', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty-critical'
        continue: true
      - match:
          severity: warning
        receiver: 'slack-warnings'
    receivers:
    - name: 'default-receiver'
      webhook_configs:
      - url: 'http://alertmanager-webhook-logger:8080/webhook'
    - name: 'pagerduty-critical'
      webhook_configs:
      - url: 'http://alertmanager-webhook-logger:8080/pagerduty'
    - name: 'slack-warnings'
      webhook_configs:
      - url: 'http://alertmanager-webhook-logger:8080/slack'

Apply and verify:

bash

kubectl apply -f alertmanager-config.yaml
kubectl rollout restart statefulset/alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring

Output:

Specification

secret/alertmanager-prometheus-kube-prometheus-alertmanager configured
statefulset.apps/alertmanager-prometheus-kube-prometheus-alertmanager restarted

Verify Alert Routing

bash

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093 &
curl -s localhost:9093/api/v2/status | jq '.config.route'

Output:

json

{
  "receiver": "default-receiver",
  "group_by": ["alertname", "service"],
  "routes": [
    {"match": {"severity": "critical"}, "receiver": "pagerduty-critical"},
    {"match": {"severity": "warning"},  "receiver": "slack-warnings"}
  ]
}

Step 6: Configure Cost Allocation Labels

Cost allocation was configured in the Deployment (Step 2). Now verify OpenCost is collecting the data.

Verify Cost Labels

bash

kubectl get pods -n default --show-labels | grep task-api

Output:

Specification

task-api-xxxxx   1/1   Running   app=task-api,cost-center=platform,team=agents
task-api-yyyyy   1/1   Running   app=task-api,cost-center=platform,team=agents
task-api-zzzzz   1/1   Running   app=task-api,cost-center=platform,team=agents

Query OpenCost

bash

kubectl port-forward -n monitoring svc/opencost 9003:9003 &
curl -s "localhost:9003/allocation/compute?window=1d&aggregate=namespace" | jq '.data[0]'

Output:

json

{
  "default": {
    "cpuCost": 0.0432, "memoryCost": 0.0216, "totalCost": 0.0648,
    "cpuEfficiency": 0.15, "memoryEfficiency": 0.45
  },
  "monitoring": {
    "cpuCost": 0.1296, "memoryCost": 0.0864, "totalCost": 0.2160,
    "cpuEfficiency": 0.35, "memoryEfficiency": 0.60
  }
}

Cost by Team Label

bash

curl -s "localhost:9003/allocation/compute?window=1d&aggregate=label:team" | jq '.data[0]'

Output:

json

{
  "agents": {
    "cpuCost": 0.0432,
    "memoryCost": 0.0216,
    "totalCost": 0.0648
  }
}

The team=agents label enables cost attribution to specific teams.

Step 7: Final Skill Test and Verification Checklist

Complete System Verification

Run through this checklist after deployment to confirm every signal is flowing:

Component	Verification Command	Expected Result
Prometheus	kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus	Running
Grafana	kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana	Running
Loki	kubectl get pods -n monitoring -l app.kubernetes.io/name=loki	Running
Jaeger	kubectl get pods -n monitoring -l app.kubernetes.io/name=jaeger	Running
OpenCost	kubectl get pods -n monitoring -l app.kubernetes.io/name=opencost	Running
Metrics flowing	Query task_api_requests_total in Prometheus	Non-empty result
Traces visible	Jaeger UI search for service=task-api	Traces found
Logs aggregated	Loki query {namespace="default", app="task-api"}	Logs returned
SLO calculated	Query task_api:availability:5m	~1.0
Costs tracked	OpenCost API aggregate=label:team	Cost data by team

Infrastructure Verification

bash

kubectl get pods -n monitoring | grep -E "prometheus|grafana|loki|jaeger|opencost"

Expected: All pods in Running state.

Metrics Verification

bash

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &
curl -s "localhost:9090/api/v1/query?query=task_api_requests_total" | jq '.data.result | length'

Expected: Non-zero result indicating metrics are being collected.

Tracing Verification

bash

# Generate a trace
curl -X POST http://localhost:8000/tasks \
  -H "Content-Type: application/json" \
  -d '{"title":"Test task"}'

# Open http://localhost:16686, search for service=task-api
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &

Expected: Traces visible in Jaeger UI for the POST /tasks operation.

Logging Verification

bash

kubectl port-forward -n monitoring svc/loki 3100:3100 &
curl -s 'localhost:3100/loki/api/v1/query?query={namespace="default",app="task-api"}' | \
  jq '.data.result | length'

Expected: Logs found for Task API.

SLO Verification

bash

curl -s "localhost:9090/api/v1/query?query=task_api:availability:5m" | \
  jq '.data.result[0].value[1]'

Expected: Value close to "1" (100% availability).

Final Skill Test

Specification

Using my observability-cost-engineer skill, deploy a complete observability
stack for a new FastAPI service called "order-service" with:
- 99.9% availability SLO
- P95 latency target of 150ms
- Cost allocation labels: cost-center=commerce, team=orders

Your skill should produce:
1. ServiceMonitor for the new service

2. PrometheusRule with SLO recording rules and multi-burn-rate alerts

3. Dashboard JSON for the service

4. Deployment YAML with proper labels and probes

If your skill missed any of these, update it:

Specification

My observability-cost-engineer skill doesn't include multi-burn-rate alerting
patterns. Update it to include the 14.4x and 2x burn rate thresholds for fast
and slow burns, with proper alert annotations including runbook URLs.

Try With AI

Prompt 1: Extend the Stack

Specification

I've deployed the complete observability stack for Task API. Now I want to add
observability for a new microservice called "notification-service" that sends
emails and push notifications. What instrumentation do I need to add, and what
SLOs make sense for a notification service?

What you're learning: Applying observability patterns to different service types. Notification services have different reliability characteristics than synchronous APIs.

Prompt 2: Debug with Observability

Specification

My Task API SLO dashboard shows availability dropped to 99.5% in the last hour.
Walk me through how to use the observability stack to identify the root cause.
What should I check in Prometheus, Jaeger, and Loki?

What you're learning: Using the three pillars together for incident investigation. Metrics tell you something is wrong, traces show where, logs explain why.

Prompt 3: Optimize Costs

Specification

OpenCost shows my monitoring namespace costs $0.22/day but my application
namespace only costs $0.06/day. Is this ratio normal? How can I optimize
observability costs without losing visibility?

What you're learning: FinOps for observability infrastructure. Retention policies, sampling rates, and resource right-sizing reduce costs while maintaining visibility.

Safety note: When testing alerts, use a non-production environment. Triggering real PagerDuty pages or Slack notifications during testing creates alert fatigue. Always configure test receivers that log but don't notify during development.

Reflect on Your Skill

This capstone integrated everything from Sub-Module 7. Your observability-cost-engineer skill should now be production-ready.

Verify Complete Coverage

Your skill should address:

Prometheus metrics via ServiceMonitor
OpenTelemetry tracing with Dapr correlation
Structured logging with trace_id
SLO definition with error budgets
Multi-burn-rate alerting rules
Cost allocation labels
Dapr-specific observability (actor metrics, workflow spans)