Your Prometheus server is collecting thousands of metrics every 30 seconds. The data exists. But when your Task API starts throwing 500 errors at 3 AM, can you answer these questions in under 60 seconds?
Raw PromQL queries won't cut it. You need dashboards that surface the answers instantly—before your users notice and before your on-call engineer has their first coffee.
Grafana transforms your Prometheus metrics into operational intelligence. This lesson teaches you to build dashboards that make metrics actionable: panels for the 4 golden signals, variables for multi-service filtering, and community dashboard imports that save hours of configuration.
Google's Site Reliability Engineering book defines four golden signals that every service dashboard must display. These aren't arbitrary—they're the minimum information needed to diagnose most production issues:
Your Task API dashboard will include all four signals. When something breaks, you'll look at the dashboard and know within seconds whether it's a latency spike, error surge, traffic overload, or resource exhaustion.
Before creating panels, understand how Grafana structures dashboards:
Everything in Grafana is ultimately JSON. You can create dashboards through the UI, but the underlying model is a JSON document. This matters because:
Let's build a dashboard for the Task API metrics you instrumented in Lesson 2.
If you installed kube-prometheus-stack via Helm, Grafana is already running:
Output:
Open http://localhost:3000 in your browser. Default credentials are admin / prom-operator (or whatever you set in Helm values).
You'll see an empty panel with a query editor.
The first golden signal is latency. Configure your panel:
Query (PromQL):
Panel Settings:
Click Apply to add the panel to your dashboard.
Output (Visual): The panel displays a line graph showing the 95th percentile latency over time. Spikes indicate periods when 5% of requests took longer than usual.
Query (PromQL):
Panel Settings:
Output (Visual): A line showing request volume over time. Traffic patterns reveal usage spikes (lunch hour, batch jobs) and help contextualize errors.
Query (PromQL):
Panel Settings:
Output (Visual): A gauge showing current error percentage. Green means healthy; yellow is warning territory; red requires immediate attention.
Query (PromQL):
Panel Settings:
Output (Visual): A gauge showing CPU utilization as percentage of limits. High saturation (>85%) indicates your pods are resource-constrained and may need scaling.
Hardcoding namespace="default" in every query limits reusability. Dashboard variables let users filter dynamically.
Configure:
Output: A dropdown appears at the top of your dashboard. Selecting a namespace filters all panels.
Add another variable:
This variable depends on the namespace selection—Grafana refreshes the service list when namespace changes.
Update your panel queries to use variables:
Now the same dashboard works for any service in any namespace.
Everything you built through the UI exists as JSON. Export your dashboard:
Here's a simplified version of the Task API Golden Signals dashboard:
Store this JSON in your Git repository under observability/dashboards/task-api-golden-signals.json. Deploy it via ConfigMap with the kube-prometheus-stack's dashboard provisioning.
Don't reinvent the wheel. Grafana.com hosts thousands of community dashboards. For Kubernetes monitoring, the most popular include:
Output: The "Kubernetes Pods" dashboard appears with pre-built panels for CPU, memory, network, and filesystem metrics per pod.
Community dashboards are starting points. Customize for your needs:
Follow these patterns for dashboards that scale:
Group related panels into collapsible rows:
Always specify units in panel settings:
Set thresholds based on SLOs:
Add annotations for deployments and incidents:
This marks when deployments occurred, correlating code changes with metric changes.
Add dashboard links so operators can drill down:
For production, deploy dashboards via Kubernetes ConfigMaps:
The kube-prometheus-stack's Grafana sidecar watches for ConfigMaps with the grafana_dashboard: "1" label and auto-imports them.
Output:
Within seconds, the dashboard appears in Grafana without manual import.
Now collaborate with AI to extend your dashboard capabilities.
Setup: You have the Task API Golden Signals dashboard from this lesson. You want to add per-endpoint breakdowns and a summary table.
Prompt 1: Per-Endpoint Latency Breakdown
What you're learning: Using by (endpoint) in PromQL aggregations to create multi-line charts where each line represents a different endpoint.
Prompt 2: Summary Table Panel
What you're learning: Grafana's Table panel type with instant queries (not range queries) and value mappings for status colors.
Prompt 3: Dashboard for Multiple Services
What you're learning: Dashboard architecture patterns. The answer typically involves a hierarchy: fleet overview (all services on one page) linking to service-specific dashboards with full detail.
Safety note: When importing or creating dashboards, avoid exposing them publicly without authentication. Grafana dashboards can reveal infrastructure details. Always deploy behind authentication and consider read-only viewer roles for shared access.
You built an observability-cost-engineer skill in Lesson 0. Test and improve it based on what you learned about Grafana.
Ask yourself:
If you found gaps:
By the end of this lesson, your skill should generate production-ready Grafana dashboards that surface the 4 golden signals with proper filtering and thresholds.