Workflow Patterns: Saga & Monitor

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

Your task processing system handles thousands of operations daily. Most complete successfully. But what happens when step 3 of a 5-step workflow fails? Do you leave step 1 and 2 in an inconsistent state? What about a health monitoring job that needs to run forever, checking service status every 5 minutes? Can a workflow really run for months without running out of memory?

These are the problems that saga and monitor patterns solve. The saga pattern ensures transactional consistency across distributed operations without traditional database transactions. The monitor pattern creates eternal workflows that can run indefinitely without accumulating unbounded history. Together with human interaction patterns, they handle the complex, long-running scenarios that real agent systems encounter.

The Saga Pattern: Compensation for Consistency

Traditional database transactions follow ACID properties: if any step fails, everything rolls back automatically. But in distributed systems, each step might touch a different service, each with its own database. There's no global transaction coordinator.

The saga pattern solves this by recording compensating actions as you go. For each step that succeeds, you remember how to undo it. If a later step fails, you execute those compensations in reverse order.

Why Reverse Order Matters

Consider an order processing workflow:

Reserve inventory (compensation: release inventory)
Process payment (compensation: refund payment)
Ship order (compensation: cancel shipment)

If shipping fails, you must undo in reverse: first cancel shipment (nothing to cancel, it failed), then refund payment, then release inventory. If you compensated in forward order, you'd release inventory before refunding payment, potentially allowing someone else to buy inventory before the refund completes.

Saga Implementation

Here's a task processing saga that handles failures gracefully:

python

import dapr.ext.workflow as wf
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class TaskOrder:
    task_id: str
    title: str
    assignee: str
    priority: str

@dataclass
class SagaResult:
    status: str
    task_id: str
    error: str | None = None

def task_processing_saga(ctx: wf.DaprWorkflowContext, order: TaskOrder):
    """Saga workflow with compensation on failure."""
    compensations: List[Tuple[str, dict]] = []

    try:
        # Step 1: Create task record
        yield ctx.call_activity(create_task_record, input=order)
        compensations.append(("delete_task_record", {"task_id": order.task_id}))

        # Step 2: Reserve assignee capacity
        yield ctx.call_activity(reserve_assignee_capacity, input=order)
        compensations.append(("release_assignee_capacity", {"task_id": order.task_id, "assignee": order.assignee}))

        # Step 3: Send notification to assignee
        yield ctx.call_activity(notify_assignee, input=order)
        compensations.append(("send_cancellation_notice", {"task_id": order.task_id, "assignee": order.assignee}))

        # Step 4: Update dashboard (might fail due to external service)
        yield ctx.call_activity(update_dashboard, input=order)

        return SagaResult(status="success", task_id=order.task_id)

    except Exception as e:
        # Compensate in reverse order
        for comp_name, comp_data in reversed(compensations):
            try:
                yield ctx.call_activity(comp_name, input=comp_data)
            except Exception as comp_error:
                # Log but continue compensating
                pass
        return SagaResult(status="failed", task_id=order.task_id, error=str(e))

Key Saga Principles

Principle	Description
Track compensations	Don't wait until failure to figure out rollback
Compensate in reverse	Undo most recent operations first
Compensations are idempotent	Running compensation twice should be safe
Handle comp failures	Log and continue; don't stop mid-compensation
Keep it simple	Complex compensation is a design smell

The Monitor Pattern: Eternal Workflows

Some workflows need to run forever: health monitors, SLA checkers, quota enforcers. Using a while True: loop is an anti-pattern because each iteration adds to the workflow history.

The continue_as_new method solves this. It restarts the workflow from the beginning with new state, discarding the accumulated history.

Monitor Implementation

python

from dataclasses import dataclass
from datetime import timedelta

@dataclass
class MonitorState:
    job_id: str
    is_healthy: bool = True
    check_count: int = 0
    consecutive_failures: int = 0

def health_monitor_workflow(ctx: wf.DaprWorkflowContext, state: MonitorState):
    """Eternal monitoring workflow with continue_as_new."""
    # Check current status
    status = yield ctx.call_activity(check_service_status, input=state.job_id)

    # Determine sleep interval based on status
    if status == "healthy":
        state.is_healthy = True
        state.consecutive_failures = 0
        sleep_interval = timedelta(minutes=60)
    else:
        if state.is_healthy:
            state.is_healthy = False
            yield ctx.call_activity(send_alert, input={"job_id": state.job_id, "severity": "warning"})
        state.consecutive_failures += 1
        sleep_interval = timedelta(minutes=5)

    # Sleep until next check
    yield ctx.create_timer(sleep_interval)

    # Restart workflow with new state (keeps history bounded)
    ctx.continue_as_new(state)

Human Interaction: Waiting for Approval

Real workflows often need human input: approvals, reviews, decisions. Your workflow pauses, waiting for an external event, ideally with a timeout.

Approval Workflow Implementation

python

def approval_workflow(ctx: wf.DaprWorkflowContext, request: ApprovalRequest):
    """Workflow that waits for human approval with timeout."""
    # Request approval from manager
    yield ctx.call_activity(send_approval_request, input=request)

    # Wait for approval or timeout
    approval_event = ctx.wait_for_external_event("approval_received")
    timeout = ctx.create_timer(timedelta(days=3))

    winner = yield wf.when_any([approval_event, timeout])

    if winner == timeout:
        return {"status": "timeout", "reason": "No approval received within 3 days"}

    decision: ApprovalDecision = approval_event.get_result()
    if not decision.approved:
        return {"status": "rejected", "approver": decision.approver}

    # Approved - proceed with action
    yield ctx.call_activity(execute_approved_action, input=request)
    return {"status": "approved", "approver": decision.approver}

Pattern Comparison

Pattern	Use Case	Key Mechanism
Saga	Multi-step transactions needing rollback	Compensation list, reverse execution
Monitor	Eternal polling/checking	continue_as_new, bounded history
Human Interaction	Approval workflows, reviews	wait_for_external_event, timeout

Reflect on Your Skill

Does your dapr-deployment skill understand saga and monitor patterns?

Test Your Skill

text

Using my dapr-deployment skill, explain when I should use the saga pattern vs
just retrying failed operations. My task processing has 4 steps, and step 3
sometimes fails due to external API timeouts.

If your skill covers reverse compensation and continue_as_new for memory management, it's working correctly.

Try With AI

Prompt 1: Design a Saga for Your Domain

text

I'm building a task management system with these steps:
1. Create task record in database

2. Reserve capacity from assignee's workload

3. Send notification to assignee

4. Update external analytics dashboard
Help me design a saga workflow that tracks compensation for each step.

Prompt 2: Implement an Eternal Monitor

text

I need a workflow that monitors my AI agent's health every 5 minutes forever.
Show me how to implement this using Dapr Workflows with continue_as_new.
Explain what happens to workflow history with vs without continue_as_new.

Prompt 3: Build an Approval Workflow

text

My task system needs manager approval for high-priority tasks. Design a
workflow that requests manager approval, waits up to 48 hours, and
escapes to a timeout state if no response is received.

Safety Note: Compensation logic is critical for data consistency. Test your compensations thoroughly for idempotency. Monitor patterns that run eternally can accumulate operational costs; ensure your health checks are appropriately tuned.