You've learned that Dapr Workflows provide durable execution that survives failures. But how? When your workflow crashes mid-execution, how does it know where to resume? When your Kubernetes pod gets evicted, how does the workflow continue on a different node?
The answer lies in a clever architectural pattern: replay-based execution. Instead of trying to checkpoint every variable and program counter (like a traditional process migration), Dapr Workflows record what happened and replay the workflow code from the beginning, skipping over work that's already complete.
This approach is powerful, but it has a strict requirement: your workflow code must be deterministic. If your code produces different results on replay than it did originally, the workflow breaks. Understanding this constraint is critical before you write your first workflow.
When you call WorkflowRuntime().start() in your Python application, you're connecting to the workflow engine embedded inside the Dapr sidecar. Here's what's happening architecturally:
The workflow engine is built on the Durable Task Framework, a battle-tested orchestration library. Dapr contributes a storage backend that uses internal actors to manage workflow state, giving you the scalability and distribution characteristics of the actor model.
Key insight: Your application code and the workflow engine communicate over a gRPC stream. The engine sends work items ("start workflow X", "run activity Y"), and your code returns execution results. All the durability magic happens in the sidecar, not in your application.
Imagine recording a cooking show on a VCR (or DVR, if you're younger). You can pause at any point, turn off the TV, and later resume exactly where you left off. The recording doesn't store your current "state of understanding"; it stores the sequence of events, and you replay from the beginning, fast-forwarding through parts you've already seen.
Dapr Workflows work the same way:
The workflow "fast-forwards" through completed work by reading from history, then continues executing new work from where it left off.
The workflow engine persists state to your configured state store (Redis, PostgreSQL, etc.) through internal actors. Each workflow instance is managed by a Workflow Actor that stores several types of data:
When your workflow yields at an activity call, here's what happens:
This append-only history model means the engine never modifies past events. It only adds new ones. This makes recovery simple: read the history, replay, continue.
The number of records saved varies by workflow complexity:
A workflow with 10 chained activities might create 30-35 state store records. This is important for understanding state store load in high-volume scenarios.
Here's where the replay model has a strict requirement: if your workflow code doesn't behave identically on replay, the engine can't trust the history.
Consider this broken workflow:
What goes wrong:
Similarly, datetime.utcnow() returns a different value on replay than during original execution, causing the return value to differ.
The rules for deterministic workflow code are straightforward once you understand why:
Here's how to fix the broken workflow:
Activities execute once and their results are recorded. On replay, activities don't re-execute; the engine returns the recorded result. This means activities can safely:
The workflow engine detects many violations at runtime. When replay produces different operations than history records, you'll see errors like:
Debugging approach:
The durability model has performance implications:
Dapr Workflows are optimized for correctness over latency. They excel at operations measured in seconds to hours, where durability matters more than speed.
You extended your dapr-deployment skill in Module 7.0 to include workflow patterns. Does it understand workflow architecture?
If your skill covers the replay-based execution model and the ctx.current_utc_datetime alternative, it's working correctly.
What you're learning: How to reason about workflow recovery. The AI helps you trace the exact sequence of events that makes durability work.
Prompt 2: Identify Determinism Violations
Prompt 3: Design for Determinism
Safety Note: When debugging workflow failures, remember that history is your source of truth. If you suspect a determinism violation, compare your workflow code against the recorded history events. Don't modify workflow code while instances are still running.