Your Task API publishes events to Kafka reliably---acks=all ensures durability, idempotent producers prevent duplicates on retry. But what happens when a single operation must update multiple topics atomically? When a task is created, you might need to write to task-events, audit-log, and notification-queue as a single unit of work. If the producer crashes after writing to task-events but before audit-log, you have an inconsistent state that no amount of retries can fix.
This is the exactly-once challenge for stream processing. Consider a payment processor that consumes from payment-requests, processes the payment, and produces to both payment-completed and ledger-updates. If it crashes after producing to payment-completed but before ledger-updates, restarting will re-process the same payment---potentially charging a customer twice or leaving the ledger inconsistent.
Kafka transactions solve this by making the read-process-write cycle atomic. Either all outputs are committed together, or none are. This chapter covers the transactional producer lifecycle, consumer isolation levels, and the zombie fencing mechanism that prevents duplicate processing from crashed producers.
Consider this stream processing pattern:
The crash window exists between steps 3 and 5. If the processor crashes after step 3:
The result: payment-completed has the message, ledger-updates doesn't, and on restart the processor will re-read and re-process---creating duplicates in payment-completed.
Transactions wrap the entire read-process-write cycle:
Every transactional producer needs a unique, stable identifier:
Output:
Critical: The transactional.id must be:
Before any transactional operations, call init_transactions() exactly once:
Output:
What happens during init:
Each transaction follows a strict lifecycle:
Output (success):
Output (failure):
What happens when a transactional producer crashes and restarts?
The transactional.id maps to a monotonically increasing epoch:
When the old producer tries to commit:
Output:
The zombie producer receives PRODUCER_FENCED and must shut down. Only the producer with the current epoch can commit transactions.
By default, consumers see messages as soon as they're written---even if the transaction hasn't committed yet:
This consumer will see messages from in-progress transactions that might be aborted. If it processes and acts on an aborted message, data consistency is broken.
To see only committed messages, set isolation.level:
Output:
The latency trade-off: With read_committed, consumers must wait for transactions to complete. If a producer has a long-running transaction, consumers see a delay.
Here's a complete stream processor that consumes orders, processes them, and atomically writes to multiple output topics:
Output (successful processing):
Output (transaction aborted):
Transactions add latency at two points:
For high-throughput systems, batch multiple operations within a single transaction:
Transactions have a timeout (default 60 seconds). Long-running transactions risk timeout failures:
If a transaction takes longer than transaction.timeout.ms, the coordinator aborts it automatically.
You've implemented transactional producers, but choosing when to use them requires careful analysis.
Your scenario:
Your Task API creates tasks and must notify three downstream systems:
Evaluating the trade-offs:
Consider these questions before choosing transactions:
Must all writes succeed together?
What's your latency budget?
Are your consumers idempotent?
Questioning the approach:
For the Task API scenario, consider:
What emerged from this analysis:
You might use transactions for task-events + audit-log only, and publish to analytics-events separately with idempotent producer. This reduces transaction scope while maintaining critical consistency.
The decision isn't "use transactions everywhere" but rather "use transactions where atomicity is required, and simpler patterns elsewhere."
You built a kafka-events skill in Chapter 1. Test and improve it based on what you learned.
Ask yourself:
If you found gaps:
Apply what you've learned by designing transactional systems for real scenarios.
Setup: Open Claude Code or your preferred AI assistant in your Kafka project directory.
Prompt 1: Design Transaction Boundaries
What you're learning: Transaction boundary design---grouping related writes while avoiding performance bottlenecks from overly broad transactions.
Prompt 2: Debug a Zombie Fencing Issue
What you're learning: Transactional.id uniqueness requirements in distributed systems. Each instance needs a unique ID, typically combining service name with partition assignment.
Prompt 3: Implement Consume-Transform-Produce
What you're learning: The full exactly-once pattern including send_offsets_to_transaction() for atomic offset commits---the most complex but most reliable stream processing pattern.
Safety Note: Transactional producers require proper cleanup. Always call abort_transaction() on errors and ensure transactions complete within timeout limits. Orphaned transactions can block consumers with read_committed isolation.