USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookYou Know Something Is Slow. You Don't Know Which Service.
Previous Chapter
Visualization with Grafana
Next Chapter
Centralized Logging with Loki
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

47 sections

Progress0%
1 / 47

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Distributed Tracing with OpenTelemetry & Jaeger

Your Task API is running in Kubernetes. A user reports: "Creating a task takes 3 seconds, but it used to take 200ms." You check the Prometheus metrics from Lesson 2—latency is definitely high. But where? The request flows through your FastAPI service, then to Dapr sidecar, then to the database. Which hop is slow?

Metrics tell you THAT something is slow. Traces tell you WHERE.

A distributed trace follows a single request across every service it touches, recording timing for each operation. Instead of guessing which service is the bottleneck, you see exactly which function call or database query is causing the 3-second delay.

This lesson teaches you to instrument your applications with OpenTelemetry, visualize traces in Jaeger, and configure sampling strategies so you capture the traces you need without overwhelming your storage.


Understanding Distributed Tracing Concepts

What is a Trace?

A trace represents the complete journey of a single request through your system. Think of it as a detailed receipt that records every service that handled your request and how long each service took.

Specification
User Request: POST /tasks │ ├── [1] API Gateway (5ms) │ │ │ └── [2] Task API (150ms) │ │ │ ├── [3] Validate input (10ms) │ │ │ ├── [4] Save to database (130ms) ← BOTTLENECK! │ │ │ └── [5] Publish event (8ms) │ └── Total: 155ms (plus network overhead)

Each numbered item is a span—a single timed operation within the trace.

What is a Span?

A span represents one unit of work. Every span has:

FieldPurposeExample
NameWhat operation this representsPOST /tasks
Start timeWhen the operation began2025-01-15T10:30:00.123Z
DurationHow long it took150ms
Trace IDUnique identifier for the entire requestabc123...
Span IDUnique identifier for this operationxyz789...
Parent Span IDWhich span called this onedef456...
AttributesKey-value metadatahttp.method=POST, task.id=42
StatusSuccess or errorOK or ERROR

Context Propagation

When Service A calls Service B, how does Service B know it's part of the same trace?

Context propagation is the mechanism that passes trace context (trace ID, parent span ID) between services. OpenTelemetry handles this automatically by injecting headers into outgoing HTTP requests:

Specification
Service A Service B ┌─────────────────┐ ┌─────────────────┐ │ Create span │ HTTP Request │ Extract context │ │ Inject context │ ───────────────► │ Create child │ │ into headers │ traceparent: │ span │ │ │ 00-abc123-xyz789 │ │ └─────────────────┘ └─────────────────┘

The traceparent header (part of the W3C Trace Context standard) carries:

  • Trace ID: abc123...
  • Parent span ID: xyz789...

Service B extracts this context, creating a child span that's automatically linked to Service A's span.


Installing OpenTelemetry for FastAPI

Required Packages

Add these dependencies to your requirements.txt:

Specification
opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-fastapi opentelemetry-exporter-otlp opentelemetry-instrumentation-httpx opentelemetry-instrumentation-sqlalchemy

Output: (No output—these are dependency declarations)

Install with pip:

bash
pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-fastapi opentelemetry-exporter-otlp

Output:

Specification
Successfully installed opentelemetry-api-1.24.0 opentelemetry-sdk-1.24.0 ...

Auto-Instrumentation (Quick Start)

The fastest way to add tracing is auto-instrumentation. OpenTelemetry automatically instruments supported libraries (FastAPI, httpx, SQLAlchemy) without code changes.

Install the distro and bootstrap:

bash
pip install opentelemetry-distro opentelemetry-exporter-otlp opentelemetry-bootstrap -a install

Output:

Specification
Installing instrumentation packages... Installed opentelemetry-instrumentation-fastapi Installed opentelemetry-instrumentation-httpx ...

Run your app with auto-instrumentation:

bash
OTEL_SERVICE_NAME=task-api \ OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317 \ opentelemetry-instrument uvicorn main:app --host 0.0.0.0 --port 8000

Output:

Specification
INFO: Started server process [12345] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Every HTTP request to your FastAPI app now generates traces automatically.

Important limitation: Auto-instrumentation does NOT work with uvicorn --reload or --workers. For development with reload, use programmatic instrumentation.


Programmatic Instrumentation

For more control, configure OpenTelemetry in your code. This approach works with --reload and lets you create custom spans.

Basic Setup

Create a tracing.py module:

python
# tracing.py from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.resources import Resource, SERVICE_NAME def setup_tracing(service_name: str, otlp_endpoint: str) -> None: """Configure OpenTelemetry tracing with OTLP exporter.""" # Create resource with service metadata resource = Resource.create({SERVICE_NAME: service_name}) # Create tracer provider provider = TracerProvider(resource=resource) trace.set_tracer_provider(provider) # Configure OTLP exporter (gRPC to Jaeger) otlp_exporter = OTLPSpanExporter( endpoint=otlp_endpoint, insecure=True # Set False in production with TLS ) # Add batch processor for efficient export provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

Output: (No output—this is module code)

Instrument FastAPI

In your main.py:

python
# main.py from fastapi import FastAPI from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor import os from tracing import setup_tracing # Configure tracing before creating app setup_tracing( service_name=os.getenv("OTEL_SERVICE_NAME", "task-api"), otlp_endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317") ) # Create FastAPI app app = FastAPI(title="Task API") # Instrument the app FastAPIInstrumentor.instrument_app(app) @app.get("/tasks") async def list_tasks(): return {"tasks": []} @app.get("/health") async def health(): return {"status": "healthy"}

Output: (No output—this is application code)

Now every request to /tasks creates a span with:

  • HTTP method, URL, status code
  • Request/response timing
  • Auto-linked parent spans from incoming headers

Creating Custom Spans

Auto-instrumentation captures HTTP boundaries, but what about internal operations? You need custom spans to see time spent in validation, database queries, or business logic.

Creating Spans for Business Operations

python
# main.py (continued) from opentelemetry import trace from pydantic import BaseModel tracer = trace.get_tracer(__name__) class TaskCreate(BaseModel): title: str description: str = "" @app.post("/tasks") async def create_task(task: TaskCreate): # Create a span for the entire operation with tracer.start_as_current_span("create_task") as span: # Add attributes to the span span.set_attribute("task.title", task.title) # Child span for validation with tracer.start_as_current_span("validate_task"): if len(task.title) < 3: span.set_status(trace.Status(trace.StatusCode.ERROR)) raise ValueError("Title too short") # Child span for database save with tracer.start_as_current_span("save_to_database") as db_span: # Simulate database latency import time time.sleep(0.1) new_task_id = 42 db_span.set_attribute("db.task_id", new_task_id) # Child span for event publishing with tracer.start_as_current_span("publish_event") as event_span: event_span.set_attribute("event.type", "task.created") # Publish to message broker span.set_attribute("task.id", new_task_id) return {"id": new_task_id, "title": task.title}

Output: (No output—this is application code that produces traces)

When you POST to /tasks, the trace shows:

Specification
POST /tasks (auto-instrumented) ├── create_task (custom) │ ├── validate_task (custom) │ ├── save_to_database (custom) ← 100ms, shows db.task_id=42 │ └── publish_event (custom)

Adding Attributes and Events

Attributes are key-value pairs attached to spans. Events are timestamped log entries within a span.

python
with tracer.start_as_current_span("process_task") as span: # Attributes: structured metadata span.set_attribute("task.priority", "high") span.set_attribute("task.assignee", "user@example.com") # Events: timestamped milestones span.add_event("validation_started") # ... validation logic ... span.add_event("validation_completed", {"rules_checked": 5}) # Record exceptions try: risky_operation() except Exception as e: span.record_exception(e) span.set_status(trace.Status(trace.StatusCode.ERROR, str(e))) raise

Output: (No output—spans with attributes/events visible in Jaeger)


Deploying Jaeger for Trace Visualization

Jaeger is an open-source distributed tracing system that stores and visualizes traces. You've been configuring exporters to send traces to Jaeger—now deploy it.

Deploy Jaeger with Helm

bash
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts helm repo update

Output:

Specification
"jaegertracing" has been added to your repositories Update Complete. Happy Helming!

Install Jaeger:

bash
helm install jaeger jaegertracing/jaeger \ --namespace monitoring --create-namespace \ --set collector.service.otlp.grpc.enabled=true \ --set collector.service.otlp.http.enabled=true

Output:

Specification
NAME: jaeger NAMESPACE: monitoring STATUS: deployed ...

Verify Jaeger is Running

bash
kubectl get pods -n monitoring -l app.kubernetes.io/name=jaeger

Output:

Specification
NAME READY STATUS RESTARTS AGE jaeger-collector-xxx-yyy 1/1 Running 0 2m jaeger-query-xxx-zzz 1/1 Running 0 2m jaeger-agent-xxx-aaa 1/1 Running 0 2m

Access the Jaeger UI

Port-forward to access locally:

bash
kubectl port-forward svc/jaeger-query -n monitoring 16686:16686

Output:

Specification
Forwarding from 127.0.0.1:16686 -> 16686

Open http://localhost:16686 in your browser.


Analyzing Traces in Jaeger

Finding Traces

  1. Select your service (task-api) from the dropdown
  2. Click "Find Traces"
  3. Click on a trace to see the waterfall view

Reading the Waterfall

Specification
Trace: abc123def456 Duration: 156ms Spans: 5 ┌──────────────────────────────────────────────────────────────┐ │ POST /tasks 156ms │ │ ├── create_task 150ms │ │ │ ├── validate_task 2ms │ │ │ ├── save_to_database 130ms │ ← SLOW! │ │ └── publish_event 8ms │ └──────────────────────────────────────────────────────────────┘

The horizontal bar lengths are proportional to duration. In this trace, save_to_database is clearly the bottleneck—130ms of a 156ms request.

Examining Span Details

Click on a span to see:

  • Tags: Attributes like db.task_id=42
  • Logs: Events with timestamps
  • Process: Service info, hostname
  • References: Parent span relationship

Finding Slow Traces

Use Jaeger's search to find problematic traces:

  • Min Duration: 1s (find traces over 1 second)
  • Tags: error=true (find failed requests)
  • Operation: POST /tasks (filter by endpoint)

Sampling Strategies

In production, tracing every request creates massive data volumes. If your service handles 10,000 requests/second, that's 864 million traces/day. Storage costs explode.

Sampling reduces volume by tracing only a percentage of requests.

Environment-Based Configuration

bash
# development: trace everything OTEL_TRACES_SAMPLER=always_on # production: trace 1% OTEL_TRACES_SAMPLER=parentbased_traceidratio OTEL_TRACES_SAMPLER_ARG=0.01

Output: (No output—environment variable configuration)

Sampler Options

SamplerBehaviorUse Case
always_onTrace 100%Development, testing
always_offTrace 0%Disabled
traceidratioTrace X%Production (0.01 = 1%)
parentbased_*Respect parent decisionDistributed systems

parentbased_traceidratio is recommended for production:

  • If an incoming request already has a trace (from upstream service), continue tracing it
  • If no parent trace, apply the ratio (sample 1% of new requests)

Configure Sampling in Code

python
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBasedTraceIdRatio from opentelemetry.sdk.trace import TracerProvider # For production: 10% sampling with parent-based propagation sampler = ParentBasedTraceIdRatio(0.1) provider = TracerProvider( resource=resource, sampler=sampler )

Output: (No output—configuration code)

Sampling Trade-offs

Sampling RateTraces/Day (10K rps)Storage CostDebugging Visibility
100%864MVery highComplete
10%86.4MHighGood
1%8.64MModerateLimited
0.1%864KLowMinimal

Rule of thumb: Start with 100% in development, 1-10% in production. Increase temporarily when debugging issues.


Kubernetes Deployment with Tracing

Configure your Task API deployment to send traces to Jaeger:

yaml
apiVersion: apps/v1 kind: Deployment metadata: name: task-api namespace: default spec: replicas: 3 selector: matchLabels: app: task-api template: metadata: labels: app: task-api spec: containers: - name: task-api image: task-api:latest ports: - containerPort: 8000 env: - name: OTEL_SERVICE_NAME value: "task-api" - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://jaeger-collector.monitoring.svc:4317" - name: OTEL_TRACES_SAMPLER value: "parentbased_traceidratio" - name: OTEL_TRACES_SAMPLER_ARG value: "0.1" - name: OTEL_PYTHON_FASTAPI_EXCLUDED_URLS value: "health,metrics,readiness"

Output: (Deployment manifest—apply with kubectl apply -f)

Key environment variables:

  • OTEL_SERVICE_NAME: Identifies your service in Jaeger
  • OTEL_EXPORTER_OTLP_ENDPOINT: Jaeger collector address (gRPC port 4317)
  • OTEL_TRACES_SAMPLER: Sampling strategy
  • OTEL_PYTHON_FASTAPI_EXCLUDED_URLS: Don't trace health checks (noise)

Dapr Observability Integration

If your Task API uses Dapr (from Sub-Module 5), Dapr automatically propagates trace context through its sidecar. Configure Dapr to send traces to the same Jaeger:

yaml
apiVersion: dapr.io/v1alpha1 kind: Configuration metadata: name: dapr-observability namespace: default spec: tracing: samplingRate: "0.1" otel: endpointAddress: jaeger-collector.monitoring.svc:4317 isSecure: false protocol: grpc

Output: (Dapr configuration—apply with kubectl apply -f)

Apply and restart your Dapr-enabled pods. Now traces flow through:

Specification
Client → Task API (Open Telemetry) → Dapr Sidecar (auto-traced) → State Store

All spans share the same trace ID, visible in Jaeger as a complete request flow.


Try With AI

Work through these scenarios with your AI assistant.

Prompt 1: Design Custom Spans

Specification
I'm instrumenting a payment processing function in my FastAPI app. The function: validate_card → check_fraud → charge_card → send_receipt. Help me design custom spans for this flow. What attributes should each span have? What would make debugging payment failures easier?

What you're learning: Thoughtful span design—creating spans that capture the information you'll actually need when debugging production issues.

Prompt 2: Troubleshoot Missing Traces

Specification
I instrumented my FastAPI app with OpenTelemetry and deployed Jaeger, but no traces appear. My environment variables are: OTEL_SERVICE_NAME=my-api OTEL_EXPORTER_OTLP_ENDPOINT=jaeger:4317 What could be wrong? Walk me through a debugging checklist.

What you're learning: Systematic troubleshooting—common issues include wrong endpoint format (missing http://), network policies blocking traffic, or missing instrumentation calls.

Prompt 3: Choose a Sampling Strategy

Specification
My service handles 5,000 requests per second in production. I need to balance trace visibility for debugging with storage costs. My current Jaeger retention is 7 days. Help me calculate storage needs at different sampling rates and recommend a strategy.

What you're learning: Production trade-offs—understanding that observability has costs and choosing appropriate settings for your scale.

Safety note: Traces can contain sensitive data (user IDs, request parameters). Never send traces to endpoints outside your control. In production, ensure your Jaeger deployment is secured and data is encrypted in transit.


Reflect on Your Skill

You built an observability-cost-engineer skill in Lesson 0. Test and improve it based on what you learned.

Test Your Skill

Specification
Using my observability skill, instrument a FastAPI application with OpenTelemetry. Does my skill include: - TracerProvider configuration with OTLP exporter? - FastAPIInstrumentor setup? - Custom span creation with attributes? - Sampling configuration for production?

Identify Gaps

Ask yourself:

  • Did my skill explain the trace → span → attribute hierarchy?
  • Did it cover context propagation and why it matters?
  • Did it include Jaeger deployment and UI navigation?
  • Did it address sampling trade-offs for production?

Improve Your Skill

If you found gaps:

Specification
My observability skill is missing distributed tracing patterns. Update it to include: - OpenTelemetry instrumentation (auto and programmatic) - Custom span creation with tracer.start_as_current_span - Jaeger Helm deployment - Sampling strategies with environment variables - Kubernetes deployment configuration for OTLP export