You've learned each piece of the observability puzzle across this chapter: Prometheus for metrics, Grafana for visualization, OpenTelemetry and Jaeger for tracing, Loki for logging, SLOs and error budgets for reliability, alerting for incident response, OpenCost for FinOps, and Dapr integration patterns. Now you bring them together.
This capstone deploys a complete, production-ready observability stack for Task API. By the end, you'll have:
This is the observability infrastructure your Digital FTE products need in production. Every AI agent you deploy deserves this level of visibility.
Start by deploying all observability components. This is the infrastructure layer that receives telemetry from your applications.
Output:
Output:
Output:
Output:
Output:
Output:
All components are running. The observability infrastructure is ready.
With the stack deployed, instrument Task API to emit telemetry.
Output (application logs on startup):
Apply the manifests:
Output:
Define Service Level Objectives that matter for a task management API.
Apply the rules:
Output:
Verify rules are loaded:
Output:
Create a comprehensive dashboard showing availability, latency, error budget, and golden signals.
Import the dashboard to Grafana:
Output (after import):
The PrometheusRule from Step 3 already defines multi-burn-rate alerts. Now configure Alertmanager to route them.
Apply and verify:
Output:
Output:
Cost allocation was configured in the Deployment (Step 2). Now verify OpenCost is collecting the data.
Output:
Output:
Output:
The team=agents label enables cost attribution to specific teams.
Run through this checklist after deployment to confirm every signal is flowing:
Expected: All pods in Running state.
Expected: Non-zero result indicating metrics are being collected.
Expected: Traces visible in Jaeger UI for the POST /tasks operation.
Expected: Logs found for Task API.
Expected: Value close to "1" (100% availability).
If your skill missed any of these, update it:
What you're learning: Applying observability patterns to different service types. Notification services have different reliability characteristics than synchronous APIs.
What you're learning: Using the three pillars together for incident investigation. Metrics tell you something is wrong, traces show where, logs explain why.
What you're learning: FinOps for observability infrastructure. Retention policies, sampling rates, and resource right-sizing reduce costs while maintaining visibility.
Safety note: When testing alerts, use a non-production environment. Triggering real PagerDuty pages or Slack notifications during testing creates alert fatigue. Always configure test receivers that log but don't notify during development.
This capstone integrated everything from Sub-Module 7. Your observability-cost-engineer skill should now be production-ready.
Your skill should address: