Your company just launched a new feature. Within seconds, millions of events start flooding in: user clicks, purchases, page views, API calls. Traditional databases are choking. Message queues are dropping data. Your monitoring dashboard shows error rates climbing.
This is the exact moment LinkedIn faced in 2010. They were processing billions of events daily, and nothing in the market could handle the load. So they built something new. Something that would eventually power systems at Netflix, Uber, Airbnb, and thousands of other companies.
They called it Kafka.
But here’s the thing: Kafka is not a message queue. It’s not a database. It’s something far more interesting.
The Problem: Why Traditional Systems Break
Before we talk about Kafka, let’s understand why existing solutions failed.
Traditional Message Queues: The Delivery Truck Problem
Think of RabbitMQ or ActiveMQ as delivery trucks. When you send a message:
- Producer sends message to queue
- Queue stores it in memory
- Consumer reads and processes it
- Message gets deleted
This works great for low volumes. But what happens when you need to:
- Process millions of messages per second?
- Let multiple consumers read the same data?
- Replay old messages for debugging?
- Keep messages for days or weeks?
Traditional queues struggle because they delete messages after delivery. Once it’s gone, it’s gone forever.
Traditional Databases: The Filing Cabinet Problem
Databases are built for structured data with complex queries. But when you need to:
- Append millions of events per second
- Read in strict chronological order
- Scale horizontally without complex sharding
Databases become the bottleneck. They’re solving the wrong problem.
The Insight: Think Like a Log
Here’s the key insight that makes Kafka different: Everything is a log.
Not a log file you grep through. Not application logs. We’re talking about an append-only sequence of records.
This simple shift changes everything. With a log:
- Producers append at the end (blazing fast)
- Consumers track their position (called offset)
- Multiple consumers can read independently
- Messages persist for configurable retention
- Replay is just resetting your position
LinkedIn’s Jay Kreps (Kafka’s creator) realized that this pattern shows up everywhere:
- Database replication logs
- File systems
- Version control (Git commits)
- Event sourcing
Kafka took this universal pattern and built a distributed system around it.
Kafka Architecture: The Core Components
Let’s break down how Kafka actually works.
Topics: The Category of Your Data
A topic is like a folder or channel where related events live. Examples:
user-clicks
payment-transactions
sensor-readings
application-logs
Each topic is completely independent. You can have thousands of topics in a single Kafka cluster.
Partitions: The Secret to Scale
Here’s where it gets interesting. Each topic is split into partitions. Think of partitions as lanes on a highway.
graph TB
subgraph "Topic: user-events"
P0[Partition 0<br/>Event 0, 3, 6, 9...]
P1[Partition 1<br/>Event 1, 4, 7, 10...]
P2[Partition 2<br/>Event 2, 5, 8, 11...]
end
Producer[Producer] -->|Hash user_id| P0
Producer -->|Hash user_id| P1
Producer -->|Hash user_id| P2
style P0 fill:#e3f2fd
style P1 fill:#e8f5e9
style P2 fill:#fff3e0
Why partitions matter:
- Parallelism: Different partitions on different servers
- Ordering: Messages within a partition are strictly ordered
- Scalability: Add more partitions to handle more load
Key insight: Kafka guarantees ordering within a partition, not across partitions. This is a crucial trade-off for performance.
Brokers: The Workers
A broker is a Kafka server. It stores data and serves client requests. A Kafka cluster typically has multiple brokers (3, 5, 10, or even hundreds).
graph TB
subgraph "Kafka Cluster"
B1[Broker 1<br/>Stores: P0, P3]
B2[Broker 2<br/>Stores: P1, P4]
B3[Broker 3<br/>Stores: P2, P5]
end
Producer[Producers] --> B1
Producer --> B2
Producer --> B3
B1 --> Consumer[Consumers]
B2 --> Consumer
B3 --> Consumer
Each partition lives on one broker (called the leader) and is replicated to others (called followers). If a broker dies, followers take over.
Producers: Writing Data
Producers send data to Kafka. Here’s what happens:
- Producer creates a record with a key and value
- Producer decides which partition to send it to
- Record gets appended to the partition’s log
- Kafka returns an acknowledgment
Partition selection strategies:
- Key-based: Same key always goes to same partition (order preserved for that key)
- Round-robin: Distribute evenly across partitions
- Custom: You define the logic
Consumers: Reading Data
Consumers read data from Kafka. But here’s the clever part: consumers track their own position.
sequenceDiagram
participant C as Consumer
participant K as Kafka
C->>K: Give me messages from offset 100
K->>C: Here are messages 100-110
C->>C: Process messages
C->>K: Commit offset 110
Note over C,K: If consumer crashes, restart from offset 110
Consumer Groups: Multiple consumers can work together. Kafka automatically distributes partitions among them.
graph TB
subgraph TopicPartitions["Topic Partitions"]
P0["Partition 0"]
P1["Partition 1"]
P2["Partition 2"]
end
subgraph ConsumerGroup["Consumer Group: analytics"]
C1["Consumer 1"]
C2["Consumer 2"]
C3["Consumer 3"]
end
P0 --> C1
P1 --> C2
P2 --> C3
style C1 fill:#e3f2fd
style C2 fill:#e8f5e9
style C3 fill:#fff3e0
Key rule: One partition is read by only one consumer in a group. This guarantees ordering and prevents duplicate processing.
How Data Flows: A Complete Journey
Let’s trace what happens when you send a message through Kafka.
Step 1: Producer Sends Data
sequenceDiagram
participant P as Producer
participant B1 as Broker 1 (Leader)
participant B2 as Broker 2 (Follower)
participant B3 as Broker 3 (Follower)
P->>B1: Send record (key="user123", value="clicked_button")
B1->>B1: Append to log
B1->>B2: Replicate to follower
B1->>B3: Replicate to follower
B2->>B1: Ack replication
B3->>B1: Ack replication
B1->>P: Ack success (offset: 4567)
Acknowledgment levels:
- acks=0: Don’t wait (fire and forget, fastest but risky)
- acks=1: Leader acknowledges (balanced)
- acks=all: All replicas acknowledge (safest but slower)
Step 2: Data Gets Replicated
Each partition has a replication factor (typically 3). The data lives on multiple brokers for fault tolerance.
graph TB
subgraph "Partition 0 - Replication Factor 3"
Leader[Broker 1: Leader<br/>Handles reads/writes]
Follower1[Broker 2: Follower<br/>Keeps synced copy]
Follower2[Broker 3: Follower<br/>Keeps synced copy]
end
Leader -.->|Replicates| Follower1
Leader -.->|Replicates| Follower2
style Leader fill:#4caf50
style Follower1 fill:#e8f5e9
style Follower2 fill:#e8f5e9
ISR (In-Sync Replicas): Followers that are caught up with the leader. If the leader dies, Kafka promotes a follower from the ISR.
Step 3: Consumer Reads Data
sequenceDiagram
participant C as Consumer
participant K as Kafka Broker
C->>K: Subscribe to topic "user-events"
K->>C: Assigned partition 0
loop Every poll
C->>K: Fetch from offset 4565
K->>C: Return records 4565-4575
C->>C: Process records
C->>K: Commit offset 4575
end
Consumer commits can be:
- Automatic: Kafka commits periodically
- Manual: You control when to commit (safer)
The Write-Ahead Log: Kafka’s Performance Secret
Kafka’s speed comes from treating storage as a sequential log. Here’s why this is brilliant:
Sequential Disk I/O is Fast
Random disk access: ~100-200 IOPS (seeks kill performance)
Sequential disk access: ~500 MB/s (same as memory in some cases)
Kafka only appends to the end of log files. No random seeks. No updates. Just append.
graph LR
subgraph "Kafka Log Segment"
M1[Offset 0<br/>Message] --> M2[Offset 1<br/>Message] --> M3[Offset 2<br/>Message] --> M4[...]
end
M4 -.->|Always append at end| New[New Message]
style New fill:#4caf50
Real numbers from LinkedIn:
- 2 million writes/second per broker
- Linear scaling: Add more brokers for more throughput
- Retention: Can keep data for days with minimal performance impact
Similar to the Write-Ahead Log pattern used in databases, Kafka’s log-based architecture ensures durability and high performance.
Zero-Copy: Skip the Application Layer
When consumers read data, Kafka uses sendfile() system call to transfer data directly from disk to network socket without copying through user space.
graph TB
subgraph "Traditional Copy"
Disk1[Disk] --> Kernel1[Kernel<br/>Buffer]
Kernel1 --> App1[Application<br/>Buffer]
App1 --> Socket1[Socket<br/>Buffer]
Socket1 --> Network1[Network]
end
subgraph "Kafka Zero-Copy"
Disk2[Disk] --> Kernel2[Kernel<br/>Buffer]
Kernel2 --> Socket2[Socket<br/>Buffer]
Socket2 --> Network2[Network]
end
style Disk2 fill:#4caf50
style Network2 fill:#4caf50
This eliminates expensive memory copies and context switches. It’s one reason Kafka can saturate network bandwidth.
Consumer Groups: The Coordination Magic
Consumer groups are Kafka’s way of distributing work. Let’s see how this works in practice.
Scenario: Processing Orders
You have an orders
topic with 6 partitions. You want to process orders as fast as possible.
With 1 consumer:
graph LR
P0[P0] --> C1[Consumer 1]
P1[P1] --> C1
P2[P2] --> C1
P3[P3] --> C1
P4[P4] --> C1
P5[P5] --> C1
One consumer handles all partitions. Slow and no parallelism.
With 3 consumers in same group:
graph TB
P0[P0] --> C1[Consumer 1]
P1[P1] --> C1
P2[P2] --> C2[Consumer 2]
P3[P3] --> C2
P4[P4] --> C3[Consumer 3]
P5[P5] --> C3
style C1 fill:#e3f2fd
style C2 fill:#e8f5e9
style C3 fill:#fff3e0
Each consumer gets 2 partitions. Processing happens in parallel. 3x faster.
With 6 consumers in same group:
graph LR
P0[P0] --> C1[Consumer 1]
P1[P1] --> C2[Consumer 2]
P2[P2] --> C3[Consumer 3]
P3[P3] --> C4[Consumer 4]
P4[P4] --> C5[Consumer 5]
P5[P5] --> C6[Consumer 6]
One partition per consumer. Maximum parallelism for this topic.
With 8 consumers in same group:
graph LR
P0[P0] --> C1[Consumer 1]
P1[P1] --> C2[Consumer 2]
P2[P2] --> C3[Consumer 3]
P3[P3] --> C4[Consumer 4]
P4[P4] --> C5[Consumer 5]
P5[P5] --> C6[Consumer 6]
C7[Consumer 7<br/>Idle]
C8[Consumer 8<br/>Idle]
style C7 fill:#ffcdd2
style C8 fill:#ffcdd2
Two consumers sit idle. More consumers than partitions provides no benefit.
Key takeaway: Number of partitions = maximum parallelism for a consumer group.
Rebalancing: When Things Change
When a consumer joins or leaves the group, Kafka rebalances partitions.
sequenceDiagram
participant C1 as Consumer 1
participant C2 as Consumer 2
participant K as Kafka
participant C3 as Consumer 3 (New)
Note over C1,C2: C1 has P0,P1,P2<br/>C2 has P3,P4,P5
C3->>K: Join consumer group
K->>C1: Stop consuming, rebalancing
K->>C2: Stop consuming, rebalancing
K->>C1: You now have P0,P1
K->>C2: You now have P2,P3
K->>C3: You now have P4,P5
Note over C1,C3: Resume consuming with new assignments
Rebalancing can cause brief pauses, so keep your consumer group stable when possible.
Fault Tolerance: What Happens When Things Break
Kafka is built for failures. Let’s walk through different failure scenarios.
Broker Failure
graph TB
subgraph "Before Failure"
B1[Broker 1<br/>Leader P0]
B2[Broker 2<br/>Follower P0]
B3[Broker 3<br/>Follower P0]
end
subgraph "After Broker 1 Crashes"
B1_Dead[Broker 1<br/>Dead]
B2_Leader[Broker 2<br/>NEW Leader P0]
B3_Follower[Broker 3<br/>Follower P0]
end
style B1_Dead fill:#ffcdd2
style B2_Leader fill:#4caf50
What happens:
- Broker 1 dies
- Kafka detects the failure (via heartbeats)
- Broker 2 (in-sync follower) becomes the new leader
- Producers and consumers automatically switch to Broker 2
- Total downtime: seconds or less
No data loss if you used acks=all
because all replicas had the data.
Consumer Failure
sequenceDiagram
participant C1 as Consumer 1
participant C2 as Consumer 2
participant K as Kafka
Note over C1,C2: C1 reading P0,P1<br/>C2 reading P2,P3
C1->>K: Heartbeat
K->>C1: OK
Note over C1: Consumer 1 crashes
K->>K: Detect C1 failure<br/>(missed heartbeats)
K->>C2: Rebalancing: you now have P0,P1,P2,P3
C2->>K: Resume from last committed offsets
No messages lost because consumer commits its offset after processing. The new consumer picks up where the old one left off.
Network Partition
This is the tricky one. What if a consumer loses connection but doesn’t crash?
sequenceDiagram
participant C1 as Consumer 1
participant K as Kafka
participant C2 as Consumer 2
C1->>K: Read messages
K->>C1: Messages 100-110
Note over C1,K: Network partition
K->>K: C1 missed heartbeats<br/>Trigger rebalance
K->>C2: You now own C1's partitions
Note over C1: Network restored
C1->>K: Commit offset 110
K->>C1: Error: You don't own this partition anymore
Generation ID prevents this consumer from committing stale offsets. Kafka tracks which generation of the consumer group you belong to.
Real-World Use Cases: Where Kafka Shines
Use Case 1: Activity Tracking at LinkedIn
The problem: Track every action users take (views, clicks, searches) for analytics and recommendations.
Volume: Billions of events per day
Solution:
graph LR
Web[Web App] -->|user events| Kafka[Kafka Topic:<br/>user-activity]
Mobile[Mobile App] -->|user events| Kafka
Kafka --> Analytics[Real-time Analytics]
Kafka --> Recommendations[Recommendation Engine]
Kafka --> Hadoop[Hadoop Data Lake]
Why Kafka:
- Handles millions of events/second
- Multiple teams consume the same data
- Can replay for debugging or new features
- Retention allows historical analysis
Use Case 2: Log Aggregation at Uber
The problem: Collect logs from thousands of microservices running across multiple datacenters.
Volume: Petabytes per day
Solution:
graph TB
subgraph "Datacenter 1"
S1[Service A] -->|logs| A1[Agent]
S2[Service B] -->|logs| A1
end
subgraph "Datacenter 2"
S3[Service C] -->|logs| A2[Agent]
S4[Service D] -->|logs| A2
end
A1 --> K[Kafka Cluster]
A2 --> K
K --> ES[Elasticsearch<br/>Search]
K --> S3[S3<br/>Archive]
K --> Alert[Alerting System]
Why Kafka:
- Buffers log spikes (prevents data loss)
- Decouples producers from consumers
- Easy to add new consumers (monitoring, alerting, etc.)
- Retention for debugging production issues
Use Case 3: Event Sourcing at Netflix
The problem: Track every state change in the system for debugging and auditing.
Solution: Store all events in Kafka and rebuild current state by replaying events.
sequenceDiagram
participant App as Application
participant K as Kafka
participant View as Materialized View
App->>K: Event: "User123 started video V456"
App->>K: Event: "User123 paused at 00:05:32"
App->>K: Event: "User123 resumed at 00:05:45"
View->>K: Replay all events
K->>View: Return event stream
View->>View: Build current state
Why Kafka:
- Immutable event log
- Can rebuild state from any point in time
- Debugging is replaying events
- Multiple views from same events
Similar to how CQRS separates reads and writes, Kafka enables event-sourced architectures where the event log is the source of truth.
The Coordination Problem: ZooKeeper vs KRaft
For years, Kafka depended on Apache ZooKeeper for coordination:
- Electing partition leaders
- Tracking cluster membership
- Storing metadata
But ZooKeeper adds operational complexity. In 2020, Kafka started replacing it with KRaft (Kafka Raft), a built-in consensus protocol.
graph TB
subgraph "Old Architecture (ZooKeeper)"
K1[Kafka Broker 1]
K2[Kafka Broker 2]
K3[Kafka Broker 3]
Z1[ZooKeeper 1]
Z2[ZooKeeper 2]
Z3[ZooKeeper 3]
K1 -.->|Metadata| Z1
K2 -.->|Metadata| Z2
K3 -.->|Metadata| Z3
end
subgraph "New Architecture (KRaft)"
C1[Controller 1]
C2[Controller 2]
C3[Controller 3]
B1[Broker 1]
B2[Broker 2]
B1 -.->|Metadata| C1
B2 -.->|Metadata| C1
end
style Z1 fill:#ffcdd2
style Z2 fill:#ffcdd2
style Z3 fill:#ffcdd2
style C1 fill:#4caf50
Benefits of KRaft:
- Simpler operations (one system instead of two)
- Faster metadata propagation
- Support for millions of partitions
- Easier to scale
If you’re starting fresh, use KRaft mode. It’s production-ready as of Kafka 3.3.
Performance Tuning: Making Kafka Fly
Producer Optimization
Batching: Don’t send every message immediately. Batch them.
1
2
3
4
// Good: High throughput
props.put("linger.ms", 10); // Wait up to 10ms to batch
props.put("batch.size", 32768); // 32KB batches
props.put("compression.type", "lz4"); // Compress batches
Compression saves network bandwidth:
- Text data: 5-10x compression
- JSON logs: 3-5x compression
- Binary data: 1-2x compression
Idempotence prevents duplicates:
1
props.put("enable.idempotence", true); // Exactly-once semantics
Consumer Optimization
Increase fetch size:
1
2
props.put("fetch.min.bytes", 50000); // Wait for 50KB
props.put("fetch.max.wait.ms", 500); // Or 500ms timeout
Parallel processing:
1
2
// Use a thread pool to process messages
messages.forEach(msg -> threadPool.submit(() -> process(msg)));
Commit strategy:
1
2
3
4
// Manual commits after processing
consumer.poll(Duration.ofMillis(100));
processMessages(messages);
consumer.commitSync(); // Only commit after successful processing
Partition Design
How many partitions?
Too few: Can’t scale consumption Too many: Metadata overhead, slower leader elections
Rule of thumb:
1
2
partitions = max(target_throughput / producer_throughput,
target_throughput / consumer_throughput)
Example: Need 1 MB/s throughput, producer can do 100 KB/s, consumer can do 50 KB/s
1
partitions = max(1000/100, 1000/50) = max(10, 20) = 20 partitions
Lessons from the Trenches
Lesson 1: Partition Keys Matter
Wrong approach:
1
2
// Random partitioning - breaks ordering
producer.send(new ProducerRecord<>("orders", orderId, orderData));
Right approach:
1
2
// Use customer_id as key - all orders for a customer go to same partition
producer.send(new ProducerRecord<>("orders", customerId, orderData));
Why: If you care about order (and you probably do), use meaningful keys.
Lesson 2: Monitor Consumer Lag
Consumer lag = how far behind your consumers are from the latest messages.
graph LR
subgraph "Kafka Partition"
M1[Offset 1000] --> M2[Offset 1500] --> M3[Latest: 2000]
end
Consumer[Consumer<br/>At offset 1200]
M2 -.->|Lag: 800 messages| Consumer
style Consumer fill:#fff3e0
High lag means:
- Consumers can’t keep up
- Need more consumer instances
- Or consumers have bugs/slow processing
Monitor with:
1
2
kafka-consumer-groups --bootstrap-server localhost:9092 \
--describe --group my-group
Lesson 3: Retention is a Safety Net
Keep data longer than you think you need:
1
2
// Retain for 7 days instead of 1 day
configs.put("retention.ms", "604800000"); // 7 days
Real story: A company had 1-day retention. They discovered a bug that corrupted data. But the bug was 2 days old. All correct data was already deleted. They couldn’t recover.
With 7-day retention, they could replay from before the bug.
Lesson 4: Replication Factor = 3
1
2
3
// Always use replication factor of 3 in production
configs.put("replication.factor", "3");
configs.put("min.insync.replicas", "2"); // Require 2 replicas for ack
Why 3?
- Tolerates one broker failure (for maintenance)
- Tolerates another unexpected failure
- Industry standard
Lesson 5: Design for Reprocessing
Assume consumers will crash and reprocess messages.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// BAD: Not idempotent
void process(Message msg) {
balance += msg.amount; // Double-processing = wrong balance
db.save(balance);
}
// GOOD: Idempotent
void process(Message msg) {
if (!db.hasProcessed(msg.id)) {
balance += msg.amount;
db.save(balance);
db.markProcessed(msg.id);
}
}
Make your processing idempotent - processing the same message twice has the same effect as processing it once.
When NOT to Use Kafka
Kafka is powerful but not always the right choice.
Don’t use Kafka when:
- Low latency is critical (< 1ms)
- Kafka has millisecond latency, not microsecond
- Use: In-memory message queues or Redis
- Small scale (< 1000 messages/second)
- Kafka’s complexity isn’t worth it
- Use: RabbitMQ, AWS SQS, or simple pub/sub
- Request-reply patterns
- Kafka is for event streaming, not RPC
- Use: REST APIs, gRPC
- Complex routing
- Kafka has simple topic-based routing
- Use: RabbitMQ with topic exchanges
- Tiny messages (< 10 bytes)
- Kafka’s per-message overhead is high
- Use: Redis Streams or batch messages
The Bigger Picture: Kafka in Modern Architecture
Kafka has become the central nervous system for data-driven companies.
graph TB
subgraph "Data Sources"
DB[(Databases)]
API[APIs]
Mobile[Mobile Apps]
IoT[IoT Devices]
end
subgraph "Kafka as Data Hub"
K[Apache Kafka<br/>Event Streaming Platform]
end
subgraph "Data Destinations"
Analytics[Real-time Analytics]
ML[ML Training]
DW[Data Warehouse]
Search[Search Index]
Cache[Caches]
Alerts[Alerting]
end
DB --> K
API --> K
Mobile --> K
IoT --> K
K --> Analytics
K --> ML
K --> DW
K --> Search
K --> Cache
K --> Alerts
style K fill:#4caf50
Instead of point-to-point connections (N systems talking to M systems = N*M connections), everything flows through Kafka.
Benefits:
- Decoupling: Add new data sources or destinations without changing existing systems
- Replay: Reprocess historical data for new use cases
- Real-time: All systems get data in real-time
- Scalability: One scalable platform instead of many queues
This is why companies like LinkedIn built Kafka - they needed a universal data pipeline.
Getting Started: Your First Kafka Application
Want to try Kafka? Here’s a quick start:
1. Run Kafka locally (using Docker):
1
2
3
docker run -d --name kafka \
-p 9092:9092 \
apache/kafka:latest
2. Create a topic:
1
2
3
4
5
docker exec kafka /opt/kafka/bin/kafka-topics.sh \
--create --topic my-topic \
--bootstrap-server localhost:9092 \
--partitions 3 \
--replication-factor 1
3. Send some data:
1
2
3
docker exec -it kafka /opt/kafka/bin/kafka-console-producer.sh \
--topic my-topic \
--bootstrap-server localhost:9092
4. Read the data:
1
2
3
4
docker exec -it kafka /opt/kafka/bin/kafka-console-consumer.sh \
--topic my-topic \
--from-beginning \
--bootstrap-server localhost:9092
For production, explore managed services like:
- Confluent Cloud: Kafka-as-a-Service by Kafka’s creators
- AWS MSK: Managed Streaming for Kafka
- Azure Event Hubs: Kafka-compatible service
Conclusion: Why Kafka Changed Everything
Kafka succeeded because it solved a fundamental problem: how to move data between systems at scale.
The key insights:
- Treat data as a log - append-only, immutable, replayable
- Partition for parallelism - scale horizontally
- Decouple producers and consumers - let them scale independently
- Persist everything - storage is cheap, lost data is expensive
What started as LinkedIn’s internal tool became the backbone of modern data architecture. Companies process trillions of messages through Kafka daily.
The next time you order an Uber, stream on Netflix, or buy something online, there’s a good chance Kafka is moving data behind the scenes.
For more distributed systems patterns, check out our posts on Write-Ahead Log, Paxos Consensus Algorithm, and Distributed Counter Architecture. Want to understand Kafka’s deployment infrastructure? Read our Kubernetes Architecture guide.