Your Task API is running in Kubernetes. A user reports: "Creating a task takes 3 seconds, but it used to take 200ms." You check the Prometheus metrics from Lesson 2—latency is definitely high. But where? The request flows through your FastAPI service, then to Dapr sidecar, then to the database. Which hop is slow?
Metrics tell you THAT something is slow. Traces tell you WHERE.
A distributed trace follows a single request across every service it touches, recording timing for each operation. Instead of guessing which service is the bottleneck, you see exactly which function call or database query is causing the 3-second delay.
This lesson teaches you to instrument your applications with OpenTelemetry, visualize traces in Jaeger, and configure sampling strategies so you capture the traces you need without overwhelming your storage.
A trace represents the complete journey of a single request through your system. Think of it as a detailed receipt that records every service that handled your request and how long each service took.
Each numbered item is a span—a single timed operation within the trace.
A span represents one unit of work. Every span has:
When Service A calls Service B, how does Service B know it's part of the same trace?
Context propagation is the mechanism that passes trace context (trace ID, parent span ID) between services. OpenTelemetry handles this automatically by injecting headers into outgoing HTTP requests:
The traceparent header (part of the W3C Trace Context standard) carries:
Service B extracts this context, creating a child span that's automatically linked to Service A's span.
Add these dependencies to your requirements.txt:
Output: (No output—these are dependency declarations)
Install with pip:
Output:
The fastest way to add tracing is auto-instrumentation. OpenTelemetry automatically instruments supported libraries (FastAPI, httpx, SQLAlchemy) without code changes.
Install the distro and bootstrap:
Output:
Run your app with auto-instrumentation:
Output:
Every HTTP request to your FastAPI app now generates traces automatically.
Important limitation: Auto-instrumentation does NOT work with uvicorn --reload or --workers. For development with reload, use programmatic instrumentation.
For more control, configure OpenTelemetry in your code. This approach works with --reload and lets you create custom spans.
Create a tracing.py module:
Output: (No output—this is module code)
In your main.py:
Output: (No output—this is application code)
Now every request to /tasks creates a span with:
Auto-instrumentation captures HTTP boundaries, but what about internal operations? You need custom spans to see time spent in validation, database queries, or business logic.
Output: (No output—this is application code that produces traces)
When you POST to /tasks, the trace shows:
Attributes are key-value pairs attached to spans. Events are timestamped log entries within a span.
Output: (No output—spans with attributes/events visible in Jaeger)
Jaeger is an open-source distributed tracing system that stores and visualizes traces. You've been configuring exporters to send traces to Jaeger—now deploy it.
Output:
Install Jaeger:
Output:
Output:
Port-forward to access locally:
Output:
Open http://localhost:16686 in your browser.
The horizontal bar lengths are proportional to duration. In this trace, save_to_database is clearly the bottleneck—130ms of a 156ms request.
Click on a span to see:
Use Jaeger's search to find problematic traces:
In production, tracing every request creates massive data volumes. If your service handles 10,000 requests/second, that's 864 million traces/day. Storage costs explode.
Sampling reduces volume by tracing only a percentage of requests.
Output: (No output—environment variable configuration)
parentbased_traceidratio is recommended for production:
Output: (No output—configuration code)
Rule of thumb: Start with 100% in development, 1-10% in production. Increase temporarily when debugging issues.
Configure your Task API deployment to send traces to Jaeger:
Output: (Deployment manifest—apply with kubectl apply -f)
Key environment variables:
If your Task API uses Dapr (from Sub-Module 5), Dapr automatically propagates trace context through its sidecar. Configure Dapr to send traces to the same Jaeger:
Output: (Dapr configuration—apply with kubectl apply -f)
Apply and restart your Dapr-enabled pods. Now traces flow through:
All spans share the same trace ID, visible in Jaeger as a complete request flow.
Work through these scenarios with your AI assistant.
What you're learning: Thoughtful span design—creating spans that capture the information you'll actually need when debugging production issues.
What you're learning: Systematic troubleshooting—common issues include wrong endpoint format (missing http://), network policies blocking traffic, or missing instrumentation calls.
What you're learning: Production trade-offs—understanding that observability has costs and choosing appropriate settings for your scale.
Safety note: Traces can contain sensitive data (user IDs, request parameters). Never send traces to endpoints outside your control. In production, ensure your Jaeger deployment is secured and data is encrypted in transit.
You built an observability-cost-engineer skill in Lesson 0. Test and improve it based on what you learned.
Ask yourself:
If you found gaps: