Who is Muhammad Usman Akbar?

Muhammad Usman Akbar is a world-class AI Transformation Consultant and Agentic Architect focused on achieving 30x industrial efficiency through autonomous ecosystems.

What results can an AI Transformation Consultant provide?

By replacing manual work with autonomous AI workflows, a consultant like Muhammad Usman can deliver up to 30x growth in output while reducing operational overhead by 40%.

What is Agentic AI Orchestration?

It is the engineering of multi-agent systems where autonomous AI entities collaborate to manage complex industrial operations in production environments.

Distributed Tracing with OpenTelemetry & Jaeger

Your Task API is running in Kubernetes. A user reports: "Creating a task takes 3 seconds, but it used to take 200ms." You check the Prometheus metrics from Lesson 2—latency is definitely high. But where? The request flows through your FastAPI service, then to Dapr sidecar, then to the database. Which hop is slow?

Metrics tell you THAT something is slow. Traces tell you WHERE.

A distributed trace follows a single request across every service it touches, recording timing for each operation. Instead of guessing which service is the bottleneck, you see exactly which function call or database query is causing the 3-second delay.

This lesson teaches you to instrument your applications with OpenTelemetry, visualize traces in Jaeger, and configure sampling strategies so you capture the traces you need without overwhelming your storage.

Understanding Distributed Tracing Concepts

What is a Trace?

A trace represents the complete journey of a single request through your system. Think of it as a detailed receipt that records every service that handled your request and how long each service took.

Specification

User Request: POST /tasks
    │
    ├── [1] API Gateway (5ms)
    │       │
    │       └── [2] Task API (150ms)
    │               │
    │               ├── [3] Validate input (10ms)
    │               │
    │               ├── [4] Save to database (130ms)  ← BOTTLENECK!
    │               │
    │               └── [5] Publish event (8ms)
    │
    └── Total: 155ms (plus network overhead)

Each numbered item is a span—a single timed operation within the trace.

What is a Span?

A span represents one unit of work. Every span has:

Field	Purpose	Example
Name	What operation this represents	POST /tasks
Start time	When the operation began	2025-01-15T10:30:00.123Z
Duration	How long it took	150ms
Trace ID	Unique identifier for the entire request	abc123...
Span ID	Unique identifier for this operation	xyz789...
Parent Span ID	Which span called this one	def456...
Attributes	Key-value metadata	http.method=POST, task.id=42
Status	Success or error	OK or ERROR

Context Propagation

When Service A calls Service B, how does Service B know it's part of the same trace?

Context propagation is the mechanism that passes trace context (trace ID, parent span ID) between services. OpenTelemetry handles this automatically by injecting headers into outgoing HTTP requests:

Specification

Service A                              Service B
┌─────────────────┐                    ┌─────────────────┐
│ Create span     │    HTTP Request    │ Extract context │
│ Inject context  │ ───────────────►   │ Create child    │
│ into headers    │   traceparent:     │ span            │
│                 │   00-abc123-xyz789 │                 │
└─────────────────┘                    └─────────────────┘

The traceparent header (part of the W3C Trace Context standard) carries:

Trace ID: abc123...
Parent span ID: xyz789...

Service B extracts this context, creating a child span that's automatically linked to Service A's span.

Installing OpenTelemetry for FastAPI

Required Packages

Add these dependencies to your requirements.txt:

Specification

opentelemetry-api
opentelemetry-sdk
opentelemetry-instrumentation-fastapi
opentelemetry-exporter-otlp
opentelemetry-instrumentation-httpx
opentelemetry-instrumentation-sqlalchemy

Output: (No output—these are dependency declarations)

Install with pip:

bash

pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-fastapi opentelemetry-exporter-otlp

Output:

Specification

Successfully installed opentelemetry-api-1.24.0 opentelemetry-sdk-1.24.0 ...

Auto-Instrumentation (Quick Start)

The fastest way to add tracing is auto-instrumentation. OpenTelemetry automatically instruments supported libraries (FastAPI, httpx, SQLAlchemy) without code changes.

Install the distro and bootstrap:

bash

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

Output:

Specification

Installing instrumentation packages...
Installed opentelemetry-instrumentation-fastapi
Installed opentelemetry-instrumentation-httpx
...

Run your app with auto-instrumentation:

bash

OTEL_SERVICE_NAME=task-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317 \
opentelemetry-instrument uvicorn main:app --host 0.0.0.0 --port 8000

Output:

Specification

INFO:     Started server process [12345]
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Every HTTP request to your FastAPI app now generates traces automatically.

Important limitation: Auto-instrumentation does NOT work with uvicorn --reload or --workers. For development with reload, use programmatic instrumentation.

Programmatic Instrumentation

For more control, configure OpenTelemetry in your code. This approach works with --reload and lets you create custom spans.

Basic Setup

Create a tracing.py module:

python

# tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME

def setup_tracing(service_name: str, otlp_endpoint: str) -> None:
    """Configure OpenTelemetry tracing with OTLP exporter."""
    # Create resource with service metadata
    resource = Resource.create({SERVICE_NAME: service_name})
    # Create tracer provider
    provider = TracerProvider(resource=resource)
    trace.set_tracer_provider(provider)
    # Configure OTLP exporter (gRPC to Jaeger)
    otlp_exporter = OTLPSpanExporter(
        endpoint=otlp_endpoint,
        insecure=True  # Set False in production with TLS
    )
    # Add batch processor for efficient export
    provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

Output: (No output—this is module code)

Instrument FastAPI

In your main.py:

python

# main.py
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import os
from tracing import setup_tracing

# Configure tracing before creating app
setup_tracing(
    service_name=os.getenv("OTEL_SERVICE_NAME", "task-api"),
    otlp_endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317")
)

# Create FastAPI app
app = FastAPI(title="Task API")

# Instrument the app
FastAPIInstrumentor.instrument_app(app)

@app.get("/tasks")
async def list_tasks():
    return {"tasks": []}

@app.get("/health")
async def health():
    return {"status": "healthy"}

Output: (No output—this is application code)

Now every request to /tasks creates a span with:

HTTP method, URL, status code
Request/response timing
Auto-linked parent spans from incoming headers

Creating Custom Spans

Auto-instrumentation captures HTTP boundaries, but what about internal operations? You need custom spans to see time spent in validation, database queries, or business logic.

Creating Spans for Business Operations

python

# main.py (continued)
from opentelemetry import trace
from pydantic import BaseModel

tracer = trace.get_tracer(__name__)

class TaskCreate(BaseModel):
    title: str
    description: str = ""

@app.post("/tasks")
async def create_task(task: TaskCreate):
    # Create a span for the entire operation
    with tracer.start_as_current_span("create_task") as span:
        # Add attributes to the span
        span.set_attribute("task.title", task.title)

        # Child span for validation
        with tracer.start_as_current_span("validate_task"):
            if len(task.title) < 3:
                span.set_status(trace.Status(trace.StatusCode.ERROR))
                raise ValueError("Title too short")

        # Child span for database save
        with tracer.start_as_current_span("save_to_database") as db_span:
            # Simulate database latency
            import time
            time.sleep(0.1)
            new_task_id = 42
            db_span.set_attribute("db.task_id", new_task_id)

        # Child span for event publishing
        with tracer.start_as_current_span("publish_event") as event_span:
            event_span.set_attribute("event.type", "task.created")
            # Publish to message broker

        span.set_attribute("task.id", new_task_id)
        return {"id": new_task_id, "title": task.title}

Output: (No output—this is application code that produces traces)

When you POST to /tasks, the trace shows:

Specification

POST /tasks (auto-instrumented)
├── create_task (custom)
│   ├── validate_task (custom)
│   ├── save_to_database (custom) ← 100ms, shows db.task_id=42
│   └── publish_event (custom)

Adding Attributes and Events

Attributes are key-value pairs attached to spans. Events are timestamped log entries within a span.

python

with tracer.start_as_current_span("process_task") as span:
    # Attributes: structured metadata
    span.set_attribute("task.priority", "high")
    span.set_attribute("task.assignee", "user@example.com")

    # Events: timestamped milestones
    span.add_event("validation_started")
    # ... validation logic ...
    span.add_event("validation_completed", {"rules_checked": 5})

    # Record exceptions
    try:
        risky_operation()
    except Exception as e:
        span.record_exception(e)
        span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
        raise

Output: (No output—spans with attributes/events visible in Jaeger)

Deploying Jaeger for Trace Visualization

Jaeger is an open-source distributed tracing system that stores and visualizes traces. You've been configuring exporters to send traces to Jaeger—now deploy it.

Deploy Jaeger with Helm

bash

helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

Output:

Specification

"jaegertracing" has been added to your repositories
Update Complete. Happy Helming!

Install Jaeger:

bash

helm install jaeger jaegertracing/jaeger \
  --namespace monitoring --create-namespace \
  --set collector.service.otlp.grpc.enabled=true \
  --set collector.service.otlp.http.enabled=true

Output:

Specification

NAME: jaeger
NAMESPACE: monitoring
STATUS: deployed
...

Verify Jaeger is Running

bash

kubectl get pods -n monitoring -l app.kubernetes.io/name=jaeger

Output:

Specification

NAME                              READY   STATUS    RESTARTS   AGE
jaeger-collector-xxx-yyy          1/1     Running   0          2m
jaeger-query-xxx-zzz              1/1     Running   0          2m
jaeger-agent-xxx-aaa              1/1     Running   0          2m

Access the Jaeger UI

Port-forward to access locally:

bash

kubectl port-forward svc/jaeger-query -n monitoring 16686:16686

Output:

Specification

Forwarding from 127.0.0.1:16686 -> 16686

Open http://localhost:16686 in your browser.

Analyzing Traces in Jaeger

Finding Traces

Select your service (task-api) from the dropdown
Click "Find Traces"
Click on a trace to see the waterfall view

Reading the Waterfall

Specification

Trace: abc123def456
Duration: 156ms
Spans: 5
┌──────────────────────────────────────────────────────────────┐
│ POST /tasks                                           156ms  │
│   ├── create_task                                     150ms  │
│   │     ├── validate_task                              2ms   │
│   │     ├── save_to_database                         130ms   │ ← SLOW!
│   │     └── publish_event                              8ms   │
└──────────────────────────────────────────────────────────────┘

The horizontal bar lengths are proportional to duration. In this trace, save_to_database is clearly the bottleneck—130ms of a 156ms request.

Examining Span Details

Click on a span to see:

Tags: Attributes like db.task_id=42
Logs: Events with timestamps
Process: Service info, hostname
References: Parent span relationship

Finding Slow Traces

Use Jaeger's search to find problematic traces:

Min Duration: 1s (find traces over 1 second)
Tags: error=true (find failed requests)
Operation: POST /tasks (filter by endpoint)

Sampling Strategies

In production, tracing every request creates massive data volumes. If your service handles 10,000 requests/second, that's 864 million traces/day. Storage costs explode.

Sampling reduces volume by tracing only a percentage of requests.

Environment-Based Configuration

bash

# development: trace everything
OTEL_TRACES_SAMPLER=always_on

# production: trace 1%
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.01

Output: (No output—environment variable configuration)

Sampler Options

Sampler	Behavior	Use Case
always_on	Trace 100%	Development, testing
always_off	Trace 0%	Disabled
traceidratio	Trace X%	Production (0.01 = 1%)
parentbased_*	Respect parent decision	Distributed systems

parentbased_traceidratio is recommended for production:

If an incoming request already has a trace (from upstream service), continue tracing it
If no parent trace, apply the ratio (sample 1% of new requests)

Configure Sampling in Code

python

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBasedTraceIdRatio
from opentelemetry.sdk.trace import TracerProvider

# For production: 10% sampling with parent-based propagation
sampler = ParentBasedTraceIdRatio(0.1)
provider = TracerProvider(
    resource=resource,
    sampler=sampler
)

Output: (No output—configuration code)

Sampling Trade-offs

Sampling Rate	Traces/Day (10K rps)	Storage Cost	Debugging Visibility
100%	864M	Very high	Complete
10%	86.4M	High	Good
1%	8.64M	Moderate	Limited
0.1%	864K	Low	Minimal

Rule of thumb: Start with 100% in development, 1-10% in production. Increase temporarily when debugging issues.

Kubernetes Deployment with Tracing

Configure your Task API deployment to send traces to Jaeger:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: task-api
  template:
    metadata:
      labels:
        app: task-api
    spec:
      containers:
      - name: task-api
        image: task-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: OTEL_SERVICE_NAME
          value: "task-api"
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://jaeger-collector.monitoring.svc:4317"
        - name: OTEL_TRACES_SAMPLER
          value: "parentbased_traceidratio"
        - name: OTEL_TRACES_SAMPLER_ARG
          value: "0.1"
        - name: OTEL_PYTHON_FASTAPI_EXCLUDED_URLS
          value: "health,metrics,readiness"

Output: (Deployment manifest—apply with kubectl apply -f)

Key environment variables:

OTEL_SERVICE_NAME: Identifies your service in Jaeger
OTEL_EXPORTER_OTLP_ENDPOINT: Jaeger collector address (gRPC port 4317)
OTEL_TRACES_SAMPLER: Sampling strategy
OTEL_PYTHON_FASTAPI_EXCLUDED_URLS: Don't trace health checks (noise)

Dapr Observability Integration

If your Task API uses Dapr (from Sub-Module 5), Dapr automatically propagates trace context through its sidecar. Configure Dapr to send traces to the same Jaeger:

yaml

apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: dapr-observability
  namespace: default
spec:
  tracing:
    samplingRate: "0.1"
    otel:
      endpointAddress: jaeger-collector.monitoring.svc:4317
      isSecure: false
      protocol: grpc

Output: (Dapr configuration—apply with kubectl apply -f)

Apply and restart your Dapr-enabled pods. Now traces flow through:

Specification

Client → Task API (Open
Telemetry) → Dapr Sidecar (auto-traced) → State Store

All spans share the same trace ID, visible in Jaeger as a complete request flow.

Try With AI

Work through these scenarios with your AI assistant.

Prompt 1: Design Custom Spans

Specification

I'm instrumenting a payment processing function in my FastAPI app.
The function: validate_card → check_fraud → charge_card → send_receipt.
Help me design custom spans for this flow. What attributes should each span have?
What would make debugging payment failures easier?

What you're learning: Thoughtful span design—creating spans that capture the information you'll actually need when debugging production issues.

Prompt 2: Troubleshoot Missing Traces

Specification

I instrumented my FastAPI app with OpenTelemetry and deployed Jaeger, but no traces appear.
My environment variables are:
OTEL_SERVICE_NAME=my-api
OTEL_EXPORTER_OTLP_ENDPOINT=jaeger:4317
What could be wrong? Walk me through a debugging checklist.

What you're learning: Systematic troubleshooting—common issues include wrong endpoint format (missing http://), network policies blocking traffic, or missing instrumentation calls.

Prompt 3: Choose a Sampling Strategy

Specification

My service handles 5,000 requests per second in production.
I need to balance trace visibility for debugging with storage costs.
My current Jaeger retention is 7 days.
Help me calculate storage needs at different sampling rates and recommend a strategy.

What you're learning: Production trade-offs—understanding that observability has costs and choosing appropriate settings for your scale.

Safety note: Traces can contain sensitive data (user IDs, request parameters). Never send traces to endpoints outside your control. In production, ensure your Jaeger deployment is secured and data is encrypted in transit.

Reflect on Your Skill

You built an observability-cost-engineer skill in Lesson 0. Test and improve it based on what you learned.

Test Your Skill

Specification

Using my observability skill, instrument a FastAPI application with OpenTelemetry.
Does my skill include:
- TracerProvider configuration with OTLP exporter?
- FastAPIInstrumentor setup?
- Custom span creation with attributes?
- Sampling configuration for production?

Identify Gaps

Ask yourself:

Did my skill explain the trace → span → attribute hierarchy?
Did it cover context propagation and why it matters?
Did it include Jaeger deployment and UI navigation?
Did it address sampling trade-offs for production?

Improve Your Skill

If you found gaps:

Specification

My observability skill is missing distributed tracing patterns.
Update it to include:
- OpenTelemetry instrumentation (auto and programmatic)
- Custom span creation with tracer.start_as_current_span
- Jaeger Helm deployment
- Sampling strategies with environment variables
- Kubernetes deployment configuration for OTLP export