USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookThe Industrial Grade Heart: Production Kafka with Strimzi
Previous Chapter
Saga Pattern for Multi-Step Workflows
Next Chapter
Monitoring and Debugging Kafka
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

40 sections

Progress0%
1 / 40

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Production Kafka with Strimzi

Your development cluster works perfectly on Docker Desktop. One broker, ephemeral storage, no authentication. But production is a different world.

In production, Kafka clusters must survive broker failures without data loss. They must encrypt traffic to prevent eavesdropping. They must authenticate clients to prevent unauthorized access. And they must have enough resources to handle peak load without throttling.

The gap between your development setup and production readiness is significant. Chapter 4 got Kafka running quickly. This chapter makes it production-grade.

Development vs Production: The Gap

Before diving into configuration, understand what changes between environments:

AspectDevelopmentProductionWhy It Matters
Node ArchitectureSingle dual-role nodeSeparate controller (3) + broker (3+) poolsController failures don't affect message processing; scale independently
StorageEphemeralPersistent (SSD-backed PVCs)Data survives pod restarts and node failures
ReplicationFactor 1Factor 3, min.insync.replicas 2Survive broker failures without data loss
EncryptionNone (plain listener)TLS everywherePrevent network-level eavesdropping
AuthenticationNoneSCRAM-SHA-512 or mTLSPrevent unauthorized client access
AuthorizationNoneTopic-level ACLsLeast-privilege access control
ResourcesDefault (minimal)Explicit CPU/memory limitsPredictable performance, prevent noisy neighbors

Every difference addresses a specific production failure mode. Let's configure each one.

Separate Controller and Broker Node Pools

In Chapter 4, you deployed a single node running both controller and broker roles. This works for development but creates problems in production:

  1. Blast radius: A controller failure takes down message processing
  2. Scaling constraints: Controllers don't need to scale like brokers
  3. Resource contention: Controller metadata operations compete with broker I/O

Production clusters separate these roles into dedicated node pools.

Controller Node Pool

Controllers manage cluster metadata through Raft consensus. They don't handle client traffic.

yaml
# kafka-controller-pool.yaml apiVersion: kafka.strimzi.io/v1beta2 kind: KafkaNodePool metadata: name: controllers namespace: kafka labels: strimzi.io/cluster: task-events spec: replicas: 3 roles: - controller storage: type: persistent-claim size: 10Gi class: standard # Use your cluster's storage class resources: requests: memory: 1Gi cpu: 500m limits: memory: 2Gi cpu: 1000m jvmOptions: -Xms: 512m -Xmx: 1g

Key production settings:

FieldValueRationale
replicas: 3Odd number for Raft quorum3 controllers tolerate 1 failure; 5 tolerates 2
roles: [controller]Controller onlyDedicated to metadata, not message handling
storage.type: persistent-claimDurable storageMetadata survives pod restarts
storage.size: 10GiModest sizeControllers store metadata, not messages
resources.limitsExplicit boundsPrevent runaway memory; enable capacity planning

Broker Node Pool

Brokers handle producer/consumer traffic and store message data. They scale based on throughput requirements.

yaml
# kafka-broker-pool.yaml apiVersion: kafka.strimzi.io/v1beta2 kind: KafkaNodePool metadata: name: brokers namespace: kafka labels: strimzi.io/cluster: task-events spec: replicas: 3 roles: - broker storage: type: persistent-claim size: 100Gi class: standard resources: requests: memory: 4Gi cpu: 1000m limits: memory: 8Gi cpu: 2000m jvmOptions: -Xms: 2g -Xmx: 4g

Key production settings:

FieldValueRationale
replicas: 3Minimum for RF=3Each partition has 3 copies across brokers
roles: [broker]Broker onlyDedicated to message handling
storage.size: 100GiProduction sizingSized for retention period and throughput
resources.limits.memory: 8GiGenerous memoryJVM heap + page cache for read performance
jvmOptions.-Xmx: 4gHalf of limitLeave memory for OS page cache

Why Separate Pools?

The separation creates independent failure domains:

text
┌─────────────────────────────────────────────────────────────┐ │ Controller Pool (Metadata) │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ ctrl-0 │ │ ctrl-1 │ │ ctrl-2 │ │ │ │ (leader) │ │ (follower)│ │ (follower)│ │ │ └───────────┘ └───────────┘ └───────────┘ │ │ Raft consensus for cluster metadata │ ├─────────────────────────────────────────────────────────────┤ │ Broker Pool (Messages) │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────┐ │ │ │ broker-0 │ │ broker-1 │ │ broker-2 │ │ broker-3│ │ │ │ 100Gi SSD │ │ 100Gi SSD │ │ 100Gi SSD │ │ 100Gi │ │ │ └───────────┘ └───────────┘ └───────────┘ └─────────┘ │ │ Partition replicas distributed │ └─────────────────────────────────────────────────────────────┘

Benefits:

  • Scale brokers independently (add broker-3, broker-4 for more throughput)
  • Controller failure doesn't stop message processing (existing brokers continue)
  • Different resource profiles (controllers need less memory, brokers need more)

TLS Encryption

In development, you used the plain listener on port 9092. Production traffic should be encrypted.

Update Kafka CRD for TLS

yaml
# kafka-cluster-production.yaml apiVersion: kafka.strimzi.io/v1beta2 kind: Kafka metadata: name: task-events namespace: kafka annotations: strimzi.io/node-pools: enabled strimzi.io/kraft: enabled spec: kafka: version: 4.1.1 metadataVersion: 4.1-IV0 listeners: - name: tls port: 9093 type: internal tls: true authentication: type: scram-sha-512 - name: external port: 9094 type: nodeport tls: true authentication: type: scram-sha-512 config: offsets.topic.replication.factor: 3 transaction.state.log.replication.factor: 3 transaction.state.log.min.isr: 2 default.replication.factor: 3 min.insync.replicas: 2 auto.create.topics.enable: false entityOperator: topicOperator: {} userOperator: {}

Key security settings:

FieldValuePurpose
listeners[].tls: trueEnabledEncrypt traffic with TLS 1.2+
listeners[].authentication.typescram-sha-512Require username/password auth
min.insync.replicas: 22 of 3Require 2 acks for durability
auto.create.topics.enable: falseDisabledPrevent accidental topic creation

How Strimzi Handles Certificates

Strimzi automatically manages TLS certificates:

  1. Cluster CA: Signs broker certificates; clients verify server identity
  2. Clients CA: Signs client certificates (for mTLS); brokers verify client identity
  3. Auto-renewal: Strimzi rotates certificates before expiration

You don't need to manually create certificates. Strimzi's Entity Operator handles the PKI lifecycle.

To extract the CA certificate for clients:

bash
# Get cluster CA certificate kubectl get secret task-events-cluster-ca-cert -n kafka \ -o jsonpath='{.data.ca\.crt}' | base64 -d > ca.crt

Clients use this CA certificate to verify they're connecting to the real Kafka cluster, not an impersonator.

SCRAM-SHA-512 Authentication

TLS encrypts traffic but doesn't identify clients. Add authentication so only authorized clients can connect.

Create KafkaUser with ACLs

yaml
# kafka-user-task-api.yaml apiVersion: kafka.strimzi.io/v1beta2 kind: KafkaUser metadata: name: task-api namespace: kafka labels: strimzi.io/cluster: task-events spec: authentication: type: scram-sha-512 authorization: type: simple acls: # Produce to task-* topics - resource: type: topic name: task- patternType: prefix operations: - Write - Describe host: "*" # Consume from task-* topics (for testing) - resource: type: topic name: task- patternType: prefix operations: - Read - Describe host: "*" # Consumer group for task-api - resource: type: group name: task-api- patternType: prefix operations: - Read host: "*"

ACL breakdown:

ResourcePatternOperationsPurpose
topic: task-prefixWrite, DescribeProduce to any topic starting with "task-"
topic: task-prefixRead, DescribeConsume from any topic starting with "task-"
group: task-api-prefixReadUse consumer groups starting with "task-api-"

Apply the user:

Specification
kubectl apply -f kafka-user-task-api.yaml

Output:

Specification
kafkauser.kafka.strimzi.io/task-api created

Retrieve Credentials

Strimzi stores the generated password in a Kubernetes Secret:

bash
# Get the password kubectl get secret task-api -n kafka \ -o jsonpath='{.data.password}' | base64 -d

For a Python producer, you'd configure authentication like this:

python
from confluent_kafka import Producer producer = Producer({ 'bootstrap.servers': 'task-events-kafka-bootstrap:9093', 'security.protocol': 'SASL_SSL', 'sasl.mechanism': 'SCRAM-SHA-512', 'sasl.username': 'task-api', 'sasl.password': '<password-from-secret>', 'ssl.ca.location': '/path/to/ca.crt' })

Critical settings for authenticated connections:

ConfigValuePurpose
security.protocolSASL_SSLTLS encryption + SASL authentication
sasl.mechanismSCRAM-SHA-512Password-based auth (not plaintext)
ssl.ca.locationPath to ca.crtVerify server certificate

Resource Limits and Requests

Production clusters need explicit resource boundaries. Without them:

  • Pods get OOMKilled during traffic spikes
  • Other workloads starve when Kafka uses all available resources
  • Capacity planning becomes guesswork

Sizing Guidelines

ComponentMemory RequestMemory LimitCPU RequestCPU Limit
Controller1Gi2Gi500m1000m
Broker (small)4Gi8Gi1000m2000m
Broker (medium)8Gi16Gi2000m4000m
Broker (large)16Gi32Gi4000m8000m

JVM heap sizing rules:

  • Set -Xmx to half the memory limit
  • Leave the other half for OS page cache (critical for read performance)
  • Set -Xms equal to -Xmx for predictable performance

Example for a broker with 8Gi memory limit:

Specification
jvm Options: -Xms: 4g -Xmx: 4g gc Logging Enabled: true

Persistent Storage Configuration

Production data must survive pod restarts. Configure storage classes that match your cloud provider:

yaml
# AWS EKS storage: type: persistent-claim size: 500Gi class: gp3 deleteClaim: false # GKE storage: type: persistent-claim size: 500Gi class: premium-rwo deleteClaim: false # Azure AKS storage: type: persistent-claim size: 500Gi class: managed-premium deleteClaim: false

Key settings:

FieldValuePurpose
deleteClaim: falsePreserve PVCsData survives accidental Kafka CRD deletion
sizeBased on retentionCalculate: throughput x retention period

Storage Sizing Formula

Specification
Required Storage = (Messages/sec × Avg Message Size × Retention Seconds × Replication Factor) / Broker Count Example:- 10,000 messages/second- 1KB average message size- 7 days (604,800 seconds) retention - Replication factor 3- 3 brokers Storage = (10,000 × 1KB × 604,800 × 3) / 3 = 6,048,000,000 KB / 3 = 2,016 GB per broker

Round up and add 20% headroom for operational flexibility.

Complete Production Configuration

Here's the full production deployment combining all security and reliability settings:

yaml
# production-kafka-cluster.yaml apiVersion: kafka.strimzi.io/v1beta2 kind: KafkaNodePool metadata: name: controllers namespace: kafka labels: strimzi.io/cluster: task-events spec: replicas: 3 roles: - controller storage: type: persistent-claim size: 20Gi class: standard deleteClaim: false resources: requests: memory: 1Gi cpu: 500m limits: memory: 2Gi cpu: 1000m jvmOptions: -Xms: 512m -Xmx: 1g --- apiVersion: kafka.strimzi.io/v1beta2 kind: KafkaNodePool metadata: name: brokers namespace: kafka labels: strimzi.io/cluster: task-events spec: replicas: 3 roles: - broker storage: type: persistent-claim size: 100Gi class: standard deleteClaim: false resources: requests: memory: 4Gi cpu: 1000m limits: memory: 8Gi cpu: 2000m jvmOptions: -Xms: 2g -Xmx: 4g --- apiVersion: kafka.strimzi.io/v1beta2 kind: Kafka metadata: name: task-events namespace: kafka annotations: strimzi.io/node-pools: enabled strimzi.io/kraft: enabled spec: kafka: version: 4.1.1 metadataVersion: 4.1-IV0 listeners: - name: tls port: 9093 type: internal tls: true authentication: type: scram-sha-512 config: offsets.topic.replication.factor: 3 transaction.state.log.replication.factor: 3 transaction.state.log.min.isr: 2 default.replication.factor: 3 min.insync.replicas: 2 auto.create.topics.enable: false log.retention.hours: 168 log.segment.bytes: 1073741824 num.partitions: 6 entityOperator: topicOperator: resources: requests: memory: 256Mi cpu: 100m limits: memory: 512Mi cpu: 500m userOperator: resources: requests: memory: 256Mi cpu: 100m limits: memory: 512Mi cpu: 500m

Verifying Production Readiness

After applying the production configuration:

bash
# Check all pods are running kubectl get pods -n kafka # Verify node pools kubectl get kafkanodepools -n kafka # Check Kafka status kubectl get kafka task-events -n kafka -o yaml | grep -A 20 status:

Expected output:

Specification
NAME READY STATUS RESTARTS AGEtask-events-controllers-0 1/1 Running 0 5mtask-events-controllers-1 1/1 Running 0 5mtask-events-controllers-2 1/1 Running 0 5mtask-events-brokers-0 1/1 Running 0 4mtask-events-brokers-1 1/1 Running 0 4mtask-events-brokers-2 1/1 Running 0 4mtask-events-entity-operator-... 2/2 Running 0 3m

Migration Path: Development to Production

If you have an existing development cluster, here's the migration approach:

  1. Create new production node pools alongside existing dual-role pool
  2. Apply updated Kafka CRD with new listeners and replication settings
  3. Wait for Strimzi to redistribute partitions to new brokers
  4. Create KafkaUser resources for all clients
  5. Update client configurations to use TLS + authentication
  6. Remove development node pool once all traffic migrated

Strimzi handles partition redistribution automatically when you add brokers. The migration can be done with zero downtime.


Reflect on Your Skill

You built a kafka-events skill in Chapter 1. Test and improve it based on what you learned.

Test Your Skill

Specification
Using my kafka-events skill, configure a production Kafka cluster with TLS encryption, SCRAM authentication, and node pools.Does my skill generate Strimzi CRDs with proper security and resource allocation?

Identify Gaps

Ask yourself:

  • Did my skill include TLS listener configuration and certificate management?
  • Did it cover SCRAM user authentication and ACL setup?

Improve Your Skill

If you found gaps:

Specification
My kafka-events skill is missing production Kafka configuration (TLS, SCRAM, node pools, resource quotas).Update it to include how to secure and scale Kafka clusters in production.

Try With AI

You've configured production Kafka with security and reliability features. Now explore how to validate and optimize your configuration.

Prompt 1: Security Audit

Specification
I've configured production Kafka with: - Separate controller (3 nodes) and broker (3 nodes) pools - TLS on port 9093- SCRAM-SHA-512 authentication - Kafka User with topic-prefix ACLs Review my security configuration:1. What attack vectors am I still exposed to? 2. How would you improve the ACL configuration for least privilege? 3. What monitoring should I add to detect unauthorized access attempts?Start by asking about my specific security requirements (compliance,multi-tenancy, external access) so you can give targeted recommendations.

What you're learning: Security configuration involves tradeoffs between usability and protection. AI can help you understand your threat model and prioritize hardening efforts.

Prompt 2: Capacity Planning

Specification
I'm planning a production Kafka cluster for an event-driven Task API with: - Expected load: 5,000 events/second at peak - Average event size: 2KB - Retention: 30 days - Must survive 1 broker failure without data loss Help me size the cluster:1. How many brokers do I need? 2. What storage per broker? 3. What memory and CPU?Walk me through your calculations so I can adjust them as ourrequirements change.

What you're learning: Capacity planning requires understanding the relationship between throughput, retention, replication, and resources. AI can teach you the formulas while applying them to your specific scenario.

Prompt 3: Troubleshooting Production Issues

Specification
My production Kafka cluster is experiencing issues: - Producers getting Not Enough Replicas Exception intermittently - Consumer lag increasing on some partitions - One broker showing higher disk usage than others Here's my configuration:- 3 brokers, replication factor 3, min.insync.replicas 2- Storage: 100Gi per broker, 60% used on broker-2Help me diagnose:1. What's likely causing each symptom? 2. What commands should I run to investigate? 3. What's the priority order for fixing these issues?

What you're learning: Production debugging requires correlating symptoms with root causes. AI can help you develop a systematic troubleshooting methodology for distributed systems.

Safety note: Always test configuration changes in a staging environment before production. Incorrect authentication or replication settings can cause client failures or data loss. Keep your development cluster configuration separate so you can iterate quickly without risking production.