USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookYour Traces Stop at the App. The Sidecar Is Still a Black Box.
Previous Chapter
Cost Engineering and FinOps
Next Chapter
Capstone Full Observability Stack for Task API
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

36 sections

Progress0%
1 / 36

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Dapr Observability Integration

You've built your observability stack. Prometheus collects metrics. Jaeger visualizes traces. Loki aggregates logs. Your Task API endpoints are instrumented, and you can answer questions like "What's our P95 latency?" and "Why did that request fail?"

But something is invisible. Every request to your Dapr-enabled services goes through a sidecar. That sidecar calls Redis for state, Kafka for pub/sub, and other services for invocations. When a request is slow, is it your application code or the Dapr sidecar? When an actor method fails, did the method throw an error or did the state store timeout? When a workflow step takes too long, which activity is the bottleneck?

Without Dapr observability integration, you see your application and you see your infrastructure, but the bridge between them is a black box. You're debugging half the story.

This lesson integrates Dapr's native observability into your existing stack. You'll configure sidecars to export metrics to Prometheus and traces to Jaeger. You'll learn the Dapr-specific metrics that reveal actor and workflow behavior. And you'll connect the dots between your application traces and Dapr's internal operations.

The Dapr Observability Gap

When you deployed Dapr, you gained powerful abstractions: state management, pub/sub, service invocation, actors, workflows. But every abstraction hides complexity, and hidden complexity is hard to debug.

Consider this trace from your Task API:

Specification
task-api: POST /tasks/create [45ms] Total request time

What happened inside that 45ms? Did your application spend 40ms and Dapr 5ms? Or did your application spend 5ms and Dapr 40ms waiting for Redis? Without Dapr observability, you can't answer this.

With Dapr observability integrated:

Specification
task-api: POST /tasks/create [2ms] Application logic [38ms] dapr: state/set (statestore) [35ms] Redis SET operation [5ms] dapr: publish (pubsub) [3ms] Kafka produce

Now you know: the bottleneck is Redis, not your code. You can optimize in the right place.

Configuring Dapr Metrics

Dapr sidecars expose Prometheus metrics on port 9090 by default. But you need to configure this explicitly and tell Prometheus where to scrape.

Step 1: Create the Dapr Configuration

The Configuration CRD controls observability for all sidecars that reference it:

yaml
# components/dapr-observability.yaml apiVersion: dapr.io/v1alpha1 kind: Configuration metadata: name: dapr-observability namespace: default spec: metric: enabled: true port: 9090 path: /metrics tracing: samplingRate: "1" otel: endpointAddress: "jaeger-collector.monitoring:4317" isSecure: false protocol: grpc

Apply it:

bash
kubectl apply -f components/dapr-observability.yaml

Output:

Specification
configuration.dapr.io/dapr-observability created

Each field serves a specific purpose:

FieldValuePurpose
metric.enabledtrueExpose Prometheus metrics endpoint
metric.port9090Port for metrics (default)
metric.path/metricsEndpoint path (default)
tracing.samplingRate"1"Trace 100% of requests (use "0.1" for 10% in production)
tracing.otel.endpointAddressjaeger-collector.monitoring:4317Where to send traces
tracing.otel.protocolgrpcUse efficient gRPC protocol

Step 2: Reference Configuration in Deployments

Your applications must reference this Configuration via annotation:

yaml
# kubernetes/task-api-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: task-api namespace: default spec: template: metadata: annotations: dapr.io/enabled: "true" dapr.io/app-id: "task-api" dapr.io/app-port: "8000" dapr.io/config: "dapr-observability" # Reference the Configuration dapr.io/log-as-json: "true" # Structured logging for Loki spec: containers: - name: task-api image: task-api:latest ports: - containerPort: 8000

The critical annotation is dapr.io/config: "dapr-observability". Without it, the sidecar won't export metrics or traces.

Step 3: Create PodMonitor for Dapr Sidecars

Dapr sidecars don't have their own Service objects — they run inside pods alongside your application. A ServiceMonitor won't find them. Use a PodMonitor to scrape pods directly:

yaml
# monitoring/dapr-podmonitor.yaml apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: dapr-sidecars namespace: monitoring labels: release: prometheus spec: namespaceSelector: matchNames: - default selector: matchLabels: dapr.io/enabled: "true" podMetricsEndpoints: - port: "9090" path: /metrics interval: 15s

Apply and verify:

bash
kubectl apply -f monitoring/dapr-podmonitor.yaml

Output:

Specification
podmonitor.monitoring.coreos.com/dapr-sidecars created

Check Prometheus targets:

bash
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090 & curl -s http://localhost:9090/api/v1/targets | \ jq '.data.activeTargets[] | select(.labels.job == "dapr-sidecars")'

Output:

json
{ "discoveredLabels": { "pod": "task-api-7b9f5c6d4-x2k9j", "container": "daprd" }, "labels": { "job": "dapr-sidecars" }, "scrapeUrl": "http://10.244.1.23:9090/metrics", "health": "up" }

Dapr Tracing with OpenTelemetry Collector

The Configuration we created sends traces directly to Jaeger. But in production, you often want traces to flow through an OpenTelemetry Collector for processing, filtering, and routing.

Architecture with OTel Collector

Specification
Your App --> Dapr Sidecar --> OTel Collector --> Jaeger | +--> (future: Tempo, Datadog, etc.) | v Prometheus

Configure Dapr to Send to OTel Collector

Update your Configuration to point to the collector:

yaml
# components/dapr-observability.yaml apiVersion: dapr.io/v1alpha1 kind: Configuration metadata: name: dapr-observability namespace: default spec: metric: enabled: true tracing: samplingRate: "1" otel: endpointAddress: "otel-collector.monitoring:4317" isSecure: false protocol: grpc

The collector then routes to Jaeger (or any backend). This lets you change backends without touching Dapr configuration.

Observability for Dapr Actors

Dapr Actors have their own metrics that reveal activation patterns, method durations, and pending call queues.

Key Actor Metrics

MetricWhat It MeasuresWhy It Matters
dapr_actor_invocations_totalTotal actor method callsRequest volume per actor type and method
dapr_actor_pending_callsCalls waiting in actor queueTurn-based concurrency backlog
dapr_actor_active_countCurrently activated actorsMemory pressure indicator
dapr_actor_operation_duration_secondsMethod execution timePerformance per method
dapr_actor_timers_countActive timersTimer resource usage
dapr_actor_reminders_countActive remindersReminder resource usage

PromQL Queries for Actors

Request rate by actor type and method:

promql
sum(rate(dapr_actor_invocations_total[5m])) by (actor_type, method)

Output:

Specification
{actor_type="ChatAgent", method="ProcessMessage"} 23.4 {actor_type="ChatAgent", method="GetHistory"} 8.7 {actor_type="TaskActor", method="UpdateStatus"} 15.2

95th percentile method duration:

promql
histogram_quantile(0.95, rate(dapr_actor_operation_duration_seconds_bucket[5m]) ) by (actor_type, method)

Output:

Specification
{actor_type="Chat Agent", method="Process Message"} 0.045 {actor_type="Chat Agent", method="Get History"} 0.012

ChatAgent.ProcessMessage is at 45ms P95; GetHistory is 12ms. If ProcessMessage suddenly jumps to 500ms, you know where to investigate.

Pending calls (turn-based concurrency backlog):

promql
dapr_actor_pending_calls{actor_type="Chat Agent"}

Output:

Specification
{actor_type="Chat Agent", app_id="task-api"} 3

Three calls are waiting. If this number grows continuously, the actor can't keep up with demand.

Tracing Actor Method Calls

In Jaeger, search for traces from your Dapr-enabled service. Actor method calls appear as spans:

Specification
task-api: POST /chat/user123 [48ms] task-api: actor/ChatAgent/user123/method/ProcessMessage [15ms] task-api: state/get (statestore) [25ms] task-api: state/set (statestore)

The trace shows the full flow: HTTP request to actor invocation to state operations. You can see that state operations account for most of the time.

Observability for Dapr Workflows

Dapr Workflows orchestrate multi-step processes. Observability reveals which steps are slow, which fail, and how long workflows take end-to-end.

Key Workflow Metrics

MetricWhat It MeasuresWhy It Matters
dapr_workflow_execution_countWorkflow executions startedThroughput
dapr_workflow_activity_execution_countActivity invocationsPer-step volume
dapr_workflow_execution_duration_secondsTotal workflow durationEnd-to-end performance
dapr_workflow_activity_duration_secondsActivity durationPer-step performance
dapr_workflow_failure_countFailed workflowsError rate
dapr_workflow_activity_failure_countFailed activitiesPer-step error rate

PromQL Queries for Workflows

Workflow execution rate by workflow type:

promql
sum(rate(dapr_workflow_execution_count[5m])) by (workflow_name)

Output:

Specification
{workflow_name="Order Processing Workflow"} 12.3 {workflow_name="Task Approval Workflow"} 4.5

Activity step duration (identify slow steps):

promql
histogram_quantile(0.95, rate(dapr_workflow_activity_duration_seconds_bucket[5m]) ) by (activity_name)

Output:

Specification
{activity_name="SendEmail"} 0.250 {activity_name="UpdateDatabase"} 0.045 {activity_name="CallExternalAPI"} 1.200

CallExternalAPI takes 1.2 seconds at P95. That's your bottleneck.

Workflow failure rate:

promql
sum(rate(dapr_workflow_failure_count[5m])) by (workflow_name) / sum(rate(dapr_workflow_execution_count[5m])) by (workflow_name)

Output:

Specification
{workflow_name="Order Processing Workflow"} 0.02 {workflow_name="Task Approval Workflow"} 0.00

OrderProcessingWorkflow has a 2% failure rate. Drill into traces to find the failing step.

Tracing Workflow Execution

Workflow traces show the full orchestration:

Specification
task-api: Start OrderProcessingWorkflow [2.5s] OrderProcessingWorkflow [45ms] Activity: ValidateOrder [200ms] Activity: ReserveInventory [1200ms] Activity: ProcessPayment <-- Bottleneck [250ms] Activity: SendConfirmation [100ms] Activity: UpdateAnalytics

The trace reveals that ProcessPayment dominates workflow duration. Optimize there first.

Correlating App Traces with Dapr Traces

Your application might already emit its own traces using OpenTelemetry. How do you connect them with Dapr's traces?

Trace Context Propagation

Dapr automatically propagates trace context (W3C Trace Context headers) through sidecars. When your app makes an HTTP call to localhost:3500, Dapr extracts the trace context and includes it in downstream operations.

For full correlation, instrument your FastAPI app with OpenTelemetry and export to the same Jaeger instance:

python
# main.py from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor # Configure tracing provider = TracerProvider() exporter = OTLPSpanExporter( endpoint="otel-collector.monitoring:4317", insecure=True ) provider.add_span_processor(BatchSpanProcessor(exporter)) trace.set_tracer_provider(provider) # Instrument FastAPI app = FastAPI() FastAPIInstrumentor.instrument_app(app)

Now your app's spans and Dapr's spans share the same trace ID. In Jaeger, you see the complete picture:

Specification
task-api: POST /tasks/create [2ms] FastAPI middleware [1ms] Application: validate_task() [40ms] dapr: state/set (statestore) [38ms] Redis SET [5ms] dapr: publish (pubsub) [4ms] Kafka produce [1ms] Application: format_response()

Your code (2ms + 1ms + 1ms = 4ms) versus Dapr (40ms + 5ms = 45ms). Crystal clear.

Dapr System Components Observability

The Dapr control plane components (dapr-operator, dapr-placement, dapr-sentry) also expose metrics. Monitor them to ensure platform health:

yaml
# monitoring/dapr-system-podmonitor.yaml apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: dapr-system namespace: monitoring labels: release: prometheus spec: namespaceSelector: matchNames: - dapr-system selector: matchLabels: app.kubernetes.io/part-of: dapr podMetricsEndpoints: - port: metrics path: /metrics interval: 30s

Key system metrics:

ComponentMetricPurpose
dapr-placementdapr_placement_actor_table_entriesActors registered in placement table
dapr-operatordapr_operator_reconcile_duration_secondsComponent reconciliation performance
dapr-sentrydapr_sentry_cert_sign_countCertificate signing operations

Reflect on Your Skill

Your observability-cost-engineer skill should now include Dapr integration patterns. Test it:

Test Your Skill

Specification
Using my observability-cost-engineer skill, configure Dapr observability for my Kubernetes cluster. I need: - Metrics scraped by Prometheus from all Dapr sidecars - Traces exported to Jaeger via OpenTelemetry - Actor and workflow metrics visible in Grafana Generate the Configuration CRD, PodMonitor, and explain how to verify it's working.

Does your skill produce:

  • Complete Dapr Configuration with metrics and tracing enabled?
  • PodMonitor for scraping sidecar metrics?
  • Verification steps to confirm observability is working?

Identify Gaps

Ask yourself:

  • Can my skill explain the difference between ServiceMonitor and PodMonitor for Dapr sidecars?
  • Does it know the key actor metrics (dapr_actor_invocations_total, dapr_actor_pending_calls)?
  • Can it generate PromQL queries for workflow step duration analysis?
  • Does it understand trace context propagation between app and Dapr?

Improve Your Skill

If gaps exist:

Specification
My observability-cost-engineer skill needs better Dapr coverage. Update it to include: - Dapr Configuration CRD with OpenTelemetry tracing settings - PodMonitor for scraping sidecar metrics (not ServiceMonitor) - Key actor metrics and their meanings - Key workflow metrics and their meanings - Trace correlation between application and Dapr spans - Sampling rate guidance (100% dev, 10% production)

Try With AI

Prompt 1: Configure End-to-End Dapr Observability

Specification
Help me configure complete Dapr observability for my Kubernetes cluster. Current setup: - Prometheus operator installed (kube-prometheus-stack) - Jaeger deployed in monitoring namespace - Dapr installed in dapr-system namespace - My Task API uses Dapr for state, pub/sub, and service invocation I need: 1. Dapr Configuration CRD that enables metrics and OpenTelemetry tracing 2. PodMonitor to scrape Dapr sidecar metrics 3. The deployment annotation to apply the configuration 4. Verification commands to confirm everything is working Also explain: why PodMonitor instead of ServiceMonitor for Dapr sidecars?

What you're learning: The complete flow from Dapr configuration to Prometheus/Jaeger integration. The AI helps you understand why sidecars require PodMonitor (no dedicated Service) rather than ServiceMonitor.

Prompt 2: Debug Actor Performance Issues

Specification
My Dapr Actors are responding slowly. Users report 2-3 second response times for ChatAgent actors that should respond in under 100ms. I have Prometheus and Jaeger configured for Dapr. Walk me through systematic debugging: 1. What PromQL queries identify which actor methods are slow? 2. How do I find if pending_calls is building up (turn-based backlog)? 3. In Jaeger, how do I trace an actor method to see if state operations are slow? 4. What's the difference between actor method time and state store time? Give me specific queries and what the results would indicate.

What you're learning: Using Dapr-specific metrics and traces to diagnose actor performance. The AI guides you through metrics-then-traces workflow for root cause analysis.

Prompt 3: Monitor Dapr Workflow Health

Specification
I'm running Dapr Workflows for order processing. Some workflows take 30+ seconds when they should complete in 5 seconds. Others are failing silently. Help me build observability for these workflows: 1. PromQL query to find which activity steps are slowest 2. PromQL query to calculate workflow failure rate by workflow type 3. How to trace a specific workflow execution in Jaeger 4. Alerting rules for workflow step timeouts and failure thresholds My workflow has these activities: ValidateOrder, ReserveInventory, ProcessPayment, SendConfirmation. Which metrics tell me where to investigate?

What you're learning: Workflow-specific observability patterns. The AI helps you translate workflow concepts (steps, activities, execution) into PromQL queries and tracing strategies.

Safety note: Dapr observability adds overhead. With samplingRate: "1" (100% tracing), every request generates trace data. In high-throughput production: reduce sampling to 10% or 1%, set resource limits on sidecars via annotations (dapr.io/sidecar-cpu-limit, dapr.io/sidecar-memory-limit), and monitor the observability pipeline itself.