USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookIndestructible Operations: Saga and Monitor Patterns
Previous Chapter
Workflow Patterns Chaining Fan-Out
Next Chapter
Combining Actors with Workflows
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

14 sections

Progress0%
1 / 14

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Workflow Patterns: Saga & Monitor

Your task processing system handles thousands of operations daily. Most complete successfully. But what happens when step 3 of a 5-step workflow fails? Do you leave step 1 and 2 in an inconsistent state? What about a health monitoring job that needs to run forever, checking service status every 5 minutes? Can a workflow really run for months without running out of memory?

These are the problems that saga and monitor patterns solve. The saga pattern ensures transactional consistency across distributed operations without traditional database transactions. The monitor pattern creates eternal workflows that can run indefinitely without accumulating unbounded history. Together with human interaction patterns, they handle the complex, long-running scenarios that real agent systems encounter.


The Saga Pattern: Compensation for Consistency

Traditional database transactions follow ACID properties: if any step fails, everything rolls back automatically. But in distributed systems, each step might touch a different service, each with its own database. There's no global transaction coordinator.

The saga pattern solves this by recording compensating actions as you go. For each step that succeeds, you remember how to undo it. If a later step fails, you execute those compensations in reverse order.

Why Reverse Order Matters

Consider an order processing workflow:

  1. Reserve inventory (compensation: release inventory)
  2. Process payment (compensation: refund payment)
  3. Ship order (compensation: cancel shipment)

If shipping fails, you must undo in reverse: first cancel shipment (nothing to cancel, it failed), then refund payment, then release inventory. If you compensated in forward order, you'd release inventory before refunding payment, potentially allowing someone else to buy inventory before the refund completes.

Saga Implementation

Here's a task processing saga that handles failures gracefully:

python
import dapr.ext.workflow as wf from dataclasses import dataclass from typing import List, Tuple @dataclass class TaskOrder: task_id: str title: str assignee: str priority: str @dataclass class SagaResult: status: str task_id: str error: str | None = None def task_processing_saga(ctx: wf.DaprWorkflowContext, order: TaskOrder): """Saga workflow with compensation on failure.""" compensations: List[Tuple[str, dict]] = [] try: # Step 1: Create task record yield ctx.call_activity(create_task_record, input=order) compensations.append(("delete_task_record", {"task_id": order.task_id})) # Step 2: Reserve assignee capacity yield ctx.call_activity(reserve_assignee_capacity, input=order) compensations.append(("release_assignee_capacity", {"task_id": order.task_id, "assignee": order.assignee})) # Step 3: Send notification to assignee yield ctx.call_activity(notify_assignee, input=order) compensations.append(("send_cancellation_notice", {"task_id": order.task_id, "assignee": order.assignee})) # Step 4: Update dashboard (might fail due to external service) yield ctx.call_activity(update_dashboard, input=order) return SagaResult(status="success", task_id=order.task_id) except Exception as e: # Compensate in reverse order for comp_name, comp_data in reversed(compensations): try: yield ctx.call_activity(comp_name, input=comp_data) except Exception as comp_error: # Log but continue compensating pass return SagaResult(status="failed", task_id=order.task_id, error=str(e))

Key Saga Principles

PrincipleDescription
Track compensationsDon't wait until failure to figure out rollback
Compensate in reverseUndo most recent operations first
Compensations are idempotentRunning compensation twice should be safe
Handle comp failuresLog and continue; don't stop mid-compensation
Keep it simpleComplex compensation is a design smell

The Monitor Pattern: Eternal Workflows

Some workflows need to run forever: health monitors, SLA checkers, quota enforcers. Using a while True: loop is an anti-pattern because each iteration adds to the workflow history.

The continue_as_new method solves this. It restarts the workflow from the beginning with new state, discarding the accumulated history.

Monitor Implementation

python
from dataclasses import dataclass from datetime import timedelta @dataclass class MonitorState: job_id: str is_healthy: bool = True check_count: int = 0 consecutive_failures: int = 0 def health_monitor_workflow(ctx: wf.DaprWorkflowContext, state: MonitorState): """Eternal monitoring workflow with continue_as_new.""" # Check current status status = yield ctx.call_activity(check_service_status, input=state.job_id) # Determine sleep interval based on status if status == "healthy": state.is_healthy = True state.consecutive_failures = 0 sleep_interval = timedelta(minutes=60) else: if state.is_healthy: state.is_healthy = False yield ctx.call_activity(send_alert, input={"job_id": state.job_id, "severity": "warning"}) state.consecutive_failures += 1 sleep_interval = timedelta(minutes=5) # Sleep until next check yield ctx.create_timer(sleep_interval) # Restart workflow with new state (keeps history bounded) ctx.continue_as_new(state)

Human Interaction: Waiting for Approval

Real workflows often need human input: approvals, reviews, decisions. Your workflow pauses, waiting for an external event, ideally with a timeout.

Approval Workflow Implementation

python
def approval_workflow(ctx: wf.DaprWorkflowContext, request: ApprovalRequest): """Workflow that waits for human approval with timeout.""" # Request approval from manager yield ctx.call_activity(send_approval_request, input=request) # Wait for approval or timeout approval_event = ctx.wait_for_external_event("approval_received") timeout = ctx.create_timer(timedelta(days=3)) winner = yield wf.when_any([approval_event, timeout]) if winner == timeout: return {"status": "timeout", "reason": "No approval received within 3 days"} decision: ApprovalDecision = approval_event.get_result() if not decision.approved: return {"status": "rejected", "approver": decision.approver} # Approved - proceed with action yield ctx.call_activity(execute_approved_action, input=request) return {"status": "approved", "approver": decision.approver}

Pattern Comparison

PatternUse CaseKey Mechanism
SagaMulti-step transactions needing rollbackCompensation list, reverse execution
MonitorEternal polling/checkingcontinue_as_new, bounded history
Human InteractionApproval workflows, reviewswait_for_external_event, timeout

Reflect on Your Skill

Does your dapr-deployment skill understand saga and monitor patterns?

Test Your Skill

text
Using my dapr-deployment skill, explain when I should use the saga pattern vs just retrying failed operations. My task processing has 4 steps, and step 3 sometimes fails due to external API timeouts.

If your skill covers reverse compensation and continue_as_new for memory management, it's working correctly.


Try With AI

Prompt 1: Design a Saga for Your Domain

text
I'm building a task management system with these steps: 1. Create task record in database 2. Reserve capacity from assignee's workload 3. Send notification to assignee 4. Update external analytics dashboard Help me design a saga workflow that tracks compensation for each step.

Prompt 2: Implement an Eternal Monitor

text
I need a workflow that monitors my AI agent's health every 5 minutes forever. Show me how to implement this using Dapr Workflows with continue_as_new. Explain what happens to workflow history with vs without continue_as_new.

Prompt 3: Build an Approval Workflow

text
My task system needs manager approval for high-priority tasks. Design a workflow that requests manager approval, waits up to 48 hours, and escapes to a timeout state if no response is received.

Safety Note: Compensation logic is critical for data consistency. Test your compensations thoroughly for idempotency. Monitor patterns that run eternally can accumulate operational costs; ensure your health checks are appropriately tuned.