Baidyanath Prasad - Scale or Signal? How One Config Change Rewrites Kafka's Behaviour

Scale or Signal? How One Config Change Rewrites Kafka's Behaviour

- Baidyanath Prasad on January 31, 2026

"Is your system designed for Teamwork or Broadcast? The difference is just one line of code."

Imagine you’re debugging a critical production issue. You’re staring at the logs, coffee in hand, trying to track an order through your system.

You have a Kafka Topic named orders with 2 Partitions (P1 & P2). You publish two messages:

Order #101 (lands on Partition P1).
Order #102 (lands on Partition P2).

You fire up two PODs of your microservice— let’s call them Consumer A and Consumer B— to process these orders. You watch the terminal, waiting for them to light up.

Consumer A picks up Order #101.
Consumer B picks up Order #102.

But then you notice something unsettling. Consumer A never saw Order #102. And Consumer B completely ignored Order #101.

If you come from a traditional pub-sub world (like JMS or ActiveMQ), the panic starts to set in. "Did the message get lost? Why didn't Consumer A see both orders? Is the partition broken?"

The short answer to this scenario is that "the data isn't lost". The system is doing exactly what it was designed to do. You are just operating in "Scale" mode when you might have expected "Signal" mode. So, what are these modes? Let's understand below:

Mode 1: Scale (The "Work Sharing" Mindset)

The reason your consumers didn't see every message is that Kafka is designed for huge throughput. When you start multiple consumers with the same group.id = cleaning_service Kafka treats them as a Single Team for the cleaning service.

Think of it like a Railway Station.

The Topic is the Station itself (e.g., Central Station).
The Partitions are the physical Platforms (Platform 1 & Platform 2).
The Consumers are the Cleaning Crew (Cleaner A and Cleaner B).

The goal is to clean trains as fast as possible. Kafka acts as the Station Master. It assigns duties to prevent chaos:

Cleaner A is assigned exclusively to Platform 1.
Cleaner B is assigned exclusively to Platform 2.

If a train arrives on Platform 2, Cleaner A ignores it. Is Cleaner A lazy? No. If Cleaner A ran over to Platform 2, they would just bump into Cleaner B. It would be an
inefficient duplication of work.

In Kafka, this is called Load Balancing. So, when Consumer A "misses" the message on Partition 2, it’s not a bug. It’s effective Station Management. In the above scenarios, only one of the PODs (consumerA) is processing a message, and the other is processing another message; this way, duplicate work is avoided.

Mode 2: Signal (The "Broadcast" Mindset)

But what if you do want everyone to see everything? Let's say you have two different departments at the station:

The Cleaning Crew: Needs to clean every train.
The Security Squad: Needs to scan every train for safety.

If you put a Cleaner and a Security Guard in the same consumer group, Kafka will split the platforms between them. The Cleaner will clean the train on Platform 1 (but not scan it), and the Guard will scan the train on Platform 2 (but not clean it). Disaster.

The Solution: Give them different Badges.

To switch Kafka from "Work Sharing" mode to "Broadcast" mode, you simply give each application a unique identity.

Cleaning Crew: Configures group.id = cleaning_service.
- Result: Kafka assigns Platform 1 + 2 to this group.
Security Squad: Configures group.id = security_service.
- Result: Kafka also assigns Platform 1 + 2 to this group.

Now, both the Cleaner and the Guard visit every single train on every platform. You have successfully pivoted from "Scale" to "Signal."

The Partition Myth: Why 'More' Isn't Always 'Faster'?

Now that you understand how groups work, you might be tempted to think: "Okay, if Platforms (Partitions) allow for parallelism, why don't I just build 10,000 Platforms? Then I can have 10,000 Cleaners working at once!"

Don't do this.

While Consumer Groups are lightweight (just badges), Partitions are heavy.

Each partition is a physical folder on the disk with an open file (.log, .index, .timeindex) handles.
If a broker fails, the cluster must elect new leaders for all its partitions. If you have too many, this "election storm" can freeze your cluster for minutes.
Rule of Thumb: Try to keep under 4,000 partitions per broker. If you need more parallelism than that, you need a bigger cluster.

How does this compare to the others?

If you’re coming from a RabbitMQ or Redis background, Kafka's logic can feel alien. Here is the quick translation guide:

1. RabbitMQ (The "Smart Broker") RabbitMQ doesn't use Consumer Groups. It uses Queues.

For Load Balancing: Bind multiple consumers to the same Queue.
For Broadcast: Use a Fanout Exchange. This effectively copies the message into a unique Queue for every single subscriber. (Note: This increases storage cost linearly!)

2. Redis Pub/Sub (The "Fire & Forget") Redis is purely for broadcast, but with a catch: Zero Memory.

There is no "Platform" or "Log."
If you publish a message and a consumer is offline, that message is lost forever.
Use Case: Real-time signals like "User is typing..." Never use it for orders.

Summary

If you are designing a Kafka system, keep these rules in your back pocket:

Same group.id = Scale. Use this for high-speed parallel processing (The Cleaning Crew).
Different group.id = Signal. Use this for independent applications (Security vs. Cleaning).
The Limit Exists: You cannot have more consumers than partitions in a single group. If you have 2 Platforms and 3 Cleaners, the 3rd Cleaner will simply stand on the tracks with nothing to do.

So, the next time you see a consumer "ignoring" a message, don't panic. Ideally, they are just trusting their teammate on the next platform to handle it.

Comments

Sanjeev Kumar GuptaFebruary 1, 2026 at 9:11 AM
A very well-written and informative article, explained clearly with real-life practical examples. Complex Kafka concepts such as producers, consumers, topics, partitions, offsets, and consumer groups are broken down in a simple and easy-to-understand manner. The article provides valuable insights into event-driven architecture, message durability, fault tolerance, and performance tuning, which are extremely helpful for designing robust, scalable, and high-throughput Kafka-based solutions. Overall, it is an excellent resource for both beginners and experienced developers working with Apache Kafka.
ReplyDelete
Replies

Add comment

Search This Blog

Engineering Leader | Mentor | Blogger

Scale or Signal? How One Config Change Rewrites Kafka's Behaviour

Mode 1: Scale (The "Work Sharing" Mindset)

Mode 2: Signal (The "Broadcast" Mindset)

The Partition Myth: Why 'More' Isn't Always 'Faster'?

How does this compare to the others?

Summary

Comments

Post a Comment

Popular posts from this blog

My Journey: From Village Schools to Engineering Leadership

Redis Optimization: How Local Caching Unlocked 10x Scalability

2026: The Year Your Job Becomes a Startup

Engineering Leadership: Why Ambiguity is More Dangerous Than Complexity

Matcha: Building a Local-First AI Resume–JD Matching Engine with Spring AI

Cassandra’s Identity Crisis: One Database, Three Personalities

The S.C.A.L.E. Framework: Designing a Streaming Giant (Case Study: Spotify)

Beyond the LGTM: The V.E.C.T.O.R. Framework for High-Scale Code Review

The Physics of Databases (Part 1): How Storage Engines Secretly Dictate Your System Design