USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookThe Replay Engine: Workflow Architecture
Previous Chapter
Dapr Workflows Overview
Next Chapter
Authoring Workflows
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

16 sections

Progress0%
1 / 16

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Workflow Architecture

You've learned that Dapr Workflows provide durable execution that survives failures. But how? When your workflow crashes mid-execution, how does it know where to resume? When your Kubernetes pod gets evicted, how does the workflow continue on a different node?

The answer lies in a clever architectural pattern: replay-based execution. Instead of trying to checkpoint every variable and program counter (like a traditional process migration), Dapr Workflows record what happened and replay the workflow code from the beginning, skipping over work that's already complete.

This approach is powerful, but it has a strict requirement: your workflow code must be deterministic. If your code produces different results on replay than it did originally, the workflow breaks. Understanding this constraint is critical before you write your first workflow.


The Workflow Engine: Inside the Sidecar

When you call WorkflowRuntime().start() in your Python application, you're connecting to the workflow engine embedded inside the Dapr sidecar. Here's what's happening architecturally:

text
+--------------------------------------------------------------------+ | Your Application | | +----------------------------------------------------------------+ | | | @wfr.workflow | | | | def task_processing(ctx, task): | | | | result = yield ctx.call_activity(validate_task, ...) | | | | # Your workflow logic | | | +----------------------------------------------------------------+ | | | | | gRPC stream | | v | +--------------------------------------------------------------------+ | Dapr Sidecar (daprd) | | +----------------------------------------------------------------+ | | | Workflow Engine (Durable Task Framework - Go implementation) | | | | | | | Uses internal actors for state management: | | | | - Workflow Actor: manages workflow instance state | | | | - Activity Actor: manages activity execution | | | +----------------------------------------------------------------+ | | | | | v | | State Store | | (Redis, PostgreSQL) | +--------------------------------------------------------------------+

The workflow engine is built on the Durable Task Framework, a battle-tested orchestration library. Dapr contributes a storage backend that uses internal actors to manage workflow state, giving you the scalability and distribution characteristics of the actor model.

Key insight: Your application code and the workflow engine communicate over a gRPC stream. The engine sends work items ("start workflow X", "run activity Y"), and your code returns execution results. All the durability magic happens in the sidecar, not in your application.


Replay-Based Execution: The VCR Analogy

Imagine recording a cooking show on a VCR (or DVR, if you're younger). You can pause at any point, turn off the TV, and later resume exactly where you left off. The recording doesn't store your current "state of understanding"; it stores the sequence of events, and you replay from the beginning, fast-forwarding through parts you've already seen.

Dapr Workflows work the same way:

  1. Recording Phase: As your workflow executes, every significant action (activity calls, timer creations, external events received) gets recorded as an "event" in the workflow's history.
  2. Crash: Your application pod crashes, or the node fails, or you simply restart for a deployment.
  3. Replay Phase: When the workflow resumes, the engine re-executes your workflow code from the beginning. But here's the trick: when the code hits an activity call, the engine checks history. If that activity already completed, it immediately returns the recorded result instead of executing the activity again.
text
Original Execution After Crash: Replay ================== ==================== def my_workflow(ctx, input): def my_workflow(ctx, input): │ │ v v result1 = yield call_activity(A) result1 = yield call_activity(A) │ │ │ [Activity A executes] │ [SKIP - return recorded result] │ [History: "A completed: X"] │ v v result2 = yield call_activity(B) result2 = yield call_activity(B) │ │ │ [Activity B executes] │ [SKIP - return recorded result] │ [History: "B completed: Y"] │ v v result3 = yield call_activity(C) result3 = yield call_activity(C) │ │ ╳ [CRASH before C completes] │ [Not in history - EXECUTE NOW] v return result3

The workflow "fast-forwards" through completed work by reading from history, then continues executing new work from where it left off.


State Persistence: What Gets Saved

The workflow engine persists state to your configured state store (Redis, PostgreSQL, etc.) through internal actors. Each workflow instance is managed by a Workflow Actor that stores several types of data:

State Key PatternContentsPurpose
inbox-NNNNNNPending messagesQueue of work items waiting to be processed
history-NNNNNNCompleted eventsAppend-only log of what happened (activity, timer)
metadataWorkflow metadataHistory length, inbox length, generation counter
customStatusUser-defined statusOptional status payload set by your workflow

When your workflow yields at an activity call, here's what happens:

  1. Intent recorded: The engine saves "intending to call activity X with input Y"
  2. Activity dispatched: An Activity Actor is created to run the activity
  3. Result recorded: When the activity completes, "activity X completed with result Z" is appended to history
  4. Checkpoint complete: Your workflow can now safely crash; the result is persisted

This append-only history model means the engine never modifies past events. It only adds new ones. This makes recovery simple: read the history, replay, continue.

State Store Record Counts

The number of records saved varies by workflow complexity:

Task TypeApproximate Records
Start workflow~5 records
Call activity~3 records
Create timer~3 records
Raise event~3 records
Start child workflow~8 records

A workflow with 10 chained activities might create 30-35 state store records. This is important for understanding state store load in high-volume scenarios.


Why Workflow Code Must Be Deterministic

Here's where the replay model has a strict requirement: if your workflow code doesn't behave identically on replay, the engine can't trust the history.

Consider this broken workflow:

python
import random from datetime import datetime def broken_workflow(ctx, input): # DON'T DO THIS: Non-deterministic decision if random.random() > 0.5: result = yield ctx.call_activity(path_a, input=input) else: result = yield ctx.call_activity(path_b, input=input) # DON'T DO THIS: Non-deterministic timestamp timestamp = datetime.utcnow().isoformat() return {"result": result, "processed_at": timestamp}

What goes wrong:

  1. Original execution: random.random() returns 0.7, so we call path_a. History records: "path_a completed with X"
  2. Crash and replay: random.random() returns 0.3 (different seed), so we try to call path_b. But history says path_a completed. Mismatch. Workflow fails.

Similarly, datetime.utcnow() returns a different value on replay than during original execution, causing the return value to differ.


Determinism Rules: DO and DON'T

The rules for deterministic workflow code are straightforward once you understand why:

CategoryDON'T (Non-Deterministic)DO (Deterministic)
Random valuesrandom.randint(1, 100)Generate in activity or pass as input
Current timedatetime.now()ctx.current_utc_datetime
External callshttpx.get("https://api...")yield ctx.call_activity(fetch_data)
Environment varsos.getenv("MY_CONFIG")Pass configuration as workflow input
UUIDsuuid.uuid4()Generate in activity
File I/Oopen("data.json").read()Read in activity, pass data as input

Correct Patterns

Here's how to fix the broken workflow:

python
def correct_workflow(ctx, input): # DO: Use context for current time timestamp = ctx.current_utc_datetime # DO: Make decisions based on input (deterministic) if input.get("priority") == high: result = yield ctx.call_activity(path_a, input=input) else: result = yield ctx.call_activity(path_b, input=input) # DO: If you need randomness, generate it in an activity if input.get("needs_random_value"): random_value = yield ctx.call_activity(generate_random, input={}) return {"result": result, "processed_at": timestamp.isoformat()}

Why Activities Can Be Non-Deterministic

Activities execute once and their results are recorded. On replay, activities don't re-execute; the engine returns the recorded result. This means activities can safely:

  • Call external APIs
  • Read environment variables
  • Generate random values
  • Query databases
  • Read files
python
def validate_task(ctx, task: dict) -> dict: """Activities CAN be non-deterministic - they're not replayed.""" # Safe: Call external service response = httpx.post("https://validation-service/api/validate", json=task) # Safe: Use current time (result is recorded) validated_at = datetime.utcnow().isoformat() return {"valid": response.json()["is_valid"], "validated_at": validated_at}

Detecting Determinism Violations

The workflow engine detects many violations at runtime. When replay produces different operations than history records, you'll see errors like:

  • NON_DETERMINISTIC: Workflow code changed or behaves differently
  • UNEXPECTED_ACTIVITY: Workflow tried to call different activity than recorded

Debugging approach:

  1. Check your workflow code for datetime.now(), random, os.getenv() calls
  2. Ensure all external service calls are in activities
  3. Verify you're not modifying workflow code while instances are running
  4. Use deterministic branching (based on input, not runtime values)

Workflow Latency Considerations

The durability model has performance implications:

Latency SourceCauseMitigation
State store writesEvery checkpoint writes to storeUse fast state store (Redis vs Postgres)
History growthLarge histories slow down replayUse continue_as_new for long runs
Reminder overheadActors use reminders for recoveryMonitor reminder count in cluster
Activity coordinationActivities route through actorsKeep activities focused and fast

Dapr Workflows are optimized for correctness over latency. They excel at operations measured in seconds to hours, where durability matters more than speed.


Key Vocabulary

TermDefinition
ReplayRe-executing code using history to skip completed work
HistoryAppend-only log of events (activities, timers, external events)
CheckpointPoint where state is persisted; occurs at every yield
DeterminismProperty that code produces identical results given identical inputs
Workflow ActorInternal Dapr actor managing one workflow instance
Activity ActorInternal Dapr actor managing one activity invocation
Durable Task FrameworkGo library powering the Dapr workflow engine

Reflect on Your Skill

You extended your dapr-deployment skill in Module 7.0 to include workflow patterns. Does it understand workflow architecture?

Test Your Skill

text
Using my dapr-deployment skill, explain why I can't use datetime.utcnow() in my workflow code but I can use it in my activity code.

If your skill covers the replay-based execution model and the ctx.current_utc_datetime alternative, it's working correctly.


Try With AI

Prompt 1: Trace Replay Execution

text
Walk me through exactly what happens when a Dapr workflow crashes and restarts. My workflow has 3 activities: validate_task, assign_task, send_notification. It crashed after assign_task completed but before send_notification started. Show me: 1. What's in the workflow history at crash time 2. What happens during replay 3. How the workflow knows to skip validate_task and assign_task 4. When send_notification actually executes Include the state store keys that would be involved.

What you're learning: How to reason about workflow recovery. The AI helps you trace the exact sequence of events that makes durability work.


Prompt 2: Identify Determinism Violations

text
Review this workflow code and identify all determinism violations: def problematic_workflow(ctx, order: dict): # Get current timestamp order_time = datetime.now() # Call external service directly pricing = httpx.get(f"https://api/pricing/{order['item']}").json() # Make random routing decision import random if random.random() > 0.5: warehouse = "east" else: warehouse = "west" # Read configuration config = json.load(open("config.json")) result = yield ctx.call_activity(process_order, input={ "order": order, "pricing": pricing, "warehouse": warehouse, "config": config }) return {"order_id": order["id"], "processed_at": order_time.isoformat()} For each violation, explain: - Why it's non-deterministic - What breaks during replay - The correct alternative

Prompt 3: Design for Determinism

text
I need to build a workflow that: 1. Gets the current price from an external API 2. Generates a unique order ID 3. Makes a routing decision based on time of day (morning vs afternoon) 4. Retries failed steps automatically Help me design this workflow to be fully deterministic: - What should be workflow code vs activity code? - How do I handle the time-of-day routing decision? - How do I generate a unique ID deterministically? - Where should retry logic live? Show me the correct code structure with activities and workflow.

Safety Note: When debugging workflow failures, remember that history is your source of truth. If you suspect a determinism violation, compare your workflow code against the recorded history events. Don't modify workflow code while instances are still running.