Your ChatAgent actors are running in production. Users are chatting. State is persisting. Everything looks fine. Then someone asks: "Why did user-456's conversation take 3 seconds to respond yesterday at 14:32?"
You check the application logs. Nothing unusual. You check the pod status. All healthy. You check Redis. Data is there. But you can't see inside the actor method call. You can't trace the request from HTTP endpoint through actor activation through state store operation. You're debugging blind.
This is the observability gap. Without distributed tracing, you see individual components but not the flow between them. Without metrics, you know something is slow but not which actors or which methods. Without systematic debugging, you're guessing.
Dapr's observability features close this gap. In this lesson, you'll configure OpenTelemetry tracing to follow requests through actor method calls, deploy Prometheus to collect actor metrics, and learn debugging strategies that turn "it's slow somewhere" into "the ProcessMessage method on ChatAgent is slow due to state store latency."
Dapr integrates with three observability pillars:
For actor debugging, tracing and metrics are most powerful. Tracing shows you the flow; metrics show you the patterns.
Dapr uses OpenTelemetry for distributed tracing. You configure it once in a Dapr Configuration resource, and every actor method call automatically generates spans.
Jaeger collects and visualizes traces. Deploy it to your Kubernetes cluster:
Output after applying:
Create a Dapr Configuration that enables OpenTelemetry export to Jaeger:
The key settings:
Your actor service must reference this configuration. Update your deployment annotations:
The critical annotation is dapr.io/config: "observability". This tells Dapr to apply the tracing and metrics configuration to this service's sidecar.
After deploying, generate some actor activity:
Open Jaeger UI at http://localhost:16686. Select service chat-agent-service and find traces:
What you see in the trace:
Each actor method call creates a span. State store operations appear as child spans. You can see exactly where time was spent.
Output in Jaeger UI (simplified):
While tracing shows individual requests, metrics show patterns over time. Prometheus scrapes metrics from Dapr sidecars.
Output after applying:
Open Prometheus UI at http://localhost:9090. Query these actor-specific metrics:
Example Prometheus Query Output:
This tells you: 23 ChatAgent instances are currently active, with 142 ProcessMessage calls and 58 GetConversationHistory calls.
Use metrics to answer operational questions:
"Which methods are slowest?"
Output:
"Are actors accumulating?"
If active count keeps increasing without decreasing, actors aren't being garbage-collected (check idle timeout configuration).
When something goes wrong with actors, use this systematic approach:
Symptoms: Requests to actor methods return 404 or timeout.
Debugging checklist:
Expected output:
If your actor type isn't listed, it's not registered. Check your startup code.
Look for:
Symptoms: Actor calls hang or timeout after default 60 seconds.
Debugging approach:
If pending calls are high (> 10), the actor is processing slowly and requests are queuing. This is expected with turn-based concurrency but indicates the actor is overloaded.
High state store latency affects all actor operations.
Symptoms: Actor loses state after deactivation/reactivation.
Debugging approach:
Check for:
If keys exist, state is persisting. If not, check StateManager calls in your actor code.
Here's a Tiltfile that deploys the full observability stack:
After tilt up, you have:
For quick actor inspection without querying Prometheus, use Dapr Dashboard:
Navigate to Actors tab to see:
This is faster than Prometheus for "How many ChatAgent instances are active right now?"
Your dapr-deployment skill should now include observability configuration. Test it:
Does your skill produce:
Ask yourself:
If gaps exist:
Open your AI companion and explore actor observability scenarios.
What you're learning: How to configure distributed tracing end-to-end. The AI helps you understand the relationship between Configuration, deployment annotations, and what appears in the tracing UI.
Prompt 2: Diagnose Slow Actor Performance
What you're learning: Systematic performance debugging using observability data. The AI guides you through the diagnostic workflow, connecting metrics and traces to root causes.
Prompt 3: Set Up Alerting for Actor Health
What you're learning: Proactive monitoring with alerting. The AI helps you translate operational concerns into actionable alerting rules that catch problems before users report them.
Safety Note: Tracing and metrics add overhead. With samplingRate: "1" (100% tracing), every request generates trace data. In high-throughput production systems, this can impact performance and storage. Start with 100% sampling during development, then reduce to 10% or 1% in production. Monitor the observability system itself: if Jaeger or Prometheus can't keep up, you'll lose visibility when you need it most.