- Kafka is not just a Big Data tool and can be used for small data as well.
- For most Big Data technologies, not having or having a Big Data problem in the future is the reason not to use them.
- Kafka is a distributed publish subscribe system that can provide value to companies without clear Big Data problems.
- Pros of using Kafka with small data include data replication, removal of single points of failure, and the ability for consumers to move freely through the commit log.
- Cons of using Kafka compared to traditional small data pub/sub include a more complex programmatic API and conceptually more complex partitions and offsets.
I’ve been teaching Kafka at companies without the textbook definition of Big Data problems. They don’t have, and will not have in the future, what you’d define as Big Data problems. As a result, the students ask me if using Kafka is appropriate for their use cases. Put another way, is Kafka only a Big Data tool?
For most Big Data technologies, not having or having a Big Data problem in the future is the reason not to use technologies like Apache Hadoop or Apache Spark. It’s a pretty clear pass/fail because the technical and operational overhead of these projects immediately negates any other benefits. Using Big Data for small data isn’t just massive overkill; it’s going to waste a lot of time and money.
For Kafka, it’s different. I define Kafka as a distributed publish subscribe system. Companies without clear Big Data problems are gaining value from it. They’re able to use the other interesting features of Kafka.
Here are some of the pros I see for using Kafka with small data:
- All data can be replicated to more than one computer
- Kafka removes single points of failure for the brokers
- Kafka removes single points of failure for consumers with consumer groups
- Consumers can move freely through the commit log and go back in time
- Consumers don’t miss data as a result of downtime because the data is saved
Here are some of the cons I see for using Kafka compared to a traditional small data pub/sub:
- Programmatic API is more complex than others
- Conceptually more complex (e.g. partitions and offsets) than others
- Ordering is no longer global and is only on a partition basis
- Consumer groups will need to handle state transitions for failures
- Fewer people available with Kafka skills (you will probably need to train)
- Operationally, more processes will need to be monitored
With these pros and cons in mind, you can make a choice between Kafka and your small data pub/sub of choice. If the pros are really compelling and outweigh the cons, I suggest you start looking at Kafka. If the cons outweigh, you’re probably better off with your small data pub/sub.
Learn more about how Kafka works here:
