You've built producers and consumers in Python. They work, but now you're asked to integrate Kafka with a PostgreSQL database, an Elasticsearch cluster, and an S3 bucket. Writing custom producers and consumers for each integration means maintaining three separate codebases, handling retries, managing offsets, and implementing fault tolerance—for each one.
This is exactly what Kafka Connect solves. It's a framework for building standardized, scalable data pipelines without writing custom code. Over 200 pre-built connectors exist for databases, message queues, cloud storage, and search engines. Instead of implementing integration logic, you deploy a connector with configuration.
In this chapter, you'll deploy Kafka Connect on Kubernetes using Strimzi and configure connectors using declarative YAML—the same GitOps pattern you've used throughout this book. The pattern you learn here lets you build data pipelines that would take weeks to code manually, deployed in minutes.
Kafka Connect consists of workers that execute connectors and tasks. This separation lets Connect distribute work across multiple nodes for scalability and fault tolerance.
Key components:
Connectors come in two types based on data flow direction:
Source connectors produce data into Kafka topics:
Sink connectors consume from Kafka topics and write to external systems:
Strimzi provides two CRDs for Kafka Connect:
The KafkaConnect CRD deploys Connect workers and optionally builds a custom image with connector plugins:
Key configuration explained:
Output:
The build pod compiles the connector image, then the Connect workers start.
With Kafka Connect running, deploy a connector using KafkaConnector CRD. This example uses the File Source connector (useful for testing):
Output:
Sink connectors write Kafka data to external systems. This example writes to a file (for demonstration):
Now data flows: input.txt → file-events topic → output.txt
A realistic pipeline might sync database changes to Elasticsearch for search. Here's how you'd configure it:
Data flow:
Not every integration needs Kafka Connect. Here's a decision framework:
Choose Kafka Connect when:
Choose custom code when:
Check connector and task status:
Output (status section):
Common connector states:
View connector logs:
You built a kafka-events skill in Chapter 1. Test and improve it based on what you learned.
Ask yourself:
If you found gaps:
Setup: You need to build a data pipeline integrating multiple systems with Kafka.
Prompt 1: Design a pipeline architecture
What you're learning: AI helps you think through pipeline topology. It will likely point out that enrichment (joining streams) happens in Kafka Streams or ksqlDB, not in Connect itself—Connect just moves data, transformation is a separate concern.
Prompt 2: Troubleshoot a connector failure
What you're learning: AI walks through systematic debugging—network connectivity, credentials, service discovery in Kubernetes, and how to test database access from within the Connect pod.
Prompt 3: Evaluate for your use case
What you're learning: AI helps you apply the Connect vs custom code decision framework to your specific constraints, likely recommending custom code for the REST API integration while suggesting Connect patterns you might adopt in your custom implementation.
Safety note: When testing connectors, especially source connectors reading from databases, use read-only credentials or a replica database. A misconfigured connector can create load on production databases.