Apache Kafka has grown a lot in functionality & reach in last couple of years. It’s used in production by one third of the Fortune 500, including 7 of the top 10 global banks, 8 of the top 10 insurance companies, 9 of the top 10 U.S. telecom companies [source].
This article gives you a quick tour of the core functionality offered by Kafka. I present lot of examples to help you understand common usage patterns. Hopefully you’ll find some correlation with your own workflows & start leveraging the power of Kafka. Let’s start by looking at 2 core functionalities offered by Kafka.
1. Kafka as a Messaging System
Messaging is widely used in 2 ways:
- Queuing (SQS, celery etc.): Queue consumers act as a worker group. Each message goes to only one of the worker processes effectively dividing the work.
- Publish-Subscribe (SNS, PubNub etc.): Subscribers are typically independent of each other. Each subscriber gets a copy of each message. Acts like a notification system.
Both of these are useful paradigms. Queuing divides up the work, and is great for fault tolerance and scale. Publish-Subscribe allows multi-subscribers which let’s you decouple your systems. The beauty of Kafka is that it combines both, queuing & publish-subscribe paradigms, into a single robust messaging system. Highly recommend reading documentation which explains the underlying design & how this combination is achieved with the help of topic, partitions, and consumer groups. To be fair, this functionality can also be achieved with RabbitMQ or SNS-SQS combination.
2. Kafka for Stream Processing
Once you have a robust, scalable messaging system all you need is an easy way to process the stream of messages. Stream API provides just that. It’s a Java client library (now Scala too) that provides higher level abstraction than producer & consumer APIs.
It makes it easy to perform:
- stateless operations such as filtering & transforming stream messages
- stateful operations such as join & aggregation over a time window
The stream API handles serializing/deserializing of messages & maintains the state required for stateful operations.
Show me some code
Here is a Stream API example which read plain text on the input stream, counts occurrences of each word, and writes the count to an output stream. See the full version here. With windowing it’s easy to aggregate over time range and keep track of things like top-N words that day (not demonstrated here).
Typical use cases of Kafka (examples)
- Imagine you run a travel website. Price of hotels, flights keeps changing all the time. A few components of your system (price alerts, analytics) needs to be informed of these changes. You post the changes on Kafka topics and each component that needs to be notified acts as a subscriber. All nodes of a single subscriber system forms a single consumer group. A given message is sent to only one node in the consumer group. This way each component gets the copy of the message plus work gets effectively divided inside each component.
- Website activity (page views, searches, or other actions users may take) can be tracked and analysed through Kafka. In fact, this was the original use case for which Kafka was invented at LinkedIn. Website activities are published to central topics with one topic per activity type. The feed can be processed in real time to gain insights into user engagement, drop-offs, page flows etc.
- Imagine you have location data coming in from GPS beacons or smartphone devices and you want to process it in real time to show vehicle path, distance travelled etc. Incoming data can be published on Kafka topics & processed with Stream API. Stateful processing with windowing comes handy when you need to extract & process all location data of a given user for a certain period of time.
When not to use Kafka
- If you can’t/don’t want to move to Java/Scala for services talking to Kafka cluster then you are going to miss out on all the higher level abstractions provided by Kafka Streams. Streams API is essentially a client library talking to Kafka cluster. Confluent, the company behind Kafka, is focused on Java at the moment. Popular language like Python also has an open issue for streaming support for over 1.5 years now.
- If all you need is a task queue, consider RabbitMQ instead. With Kafka, each partition can be consumed by single consumer only. And you have to decide the partition while putting the task on the queue. So it’s possible that a flood of tasks on a given partition can cause starvation and you can’t do anything as adding consumers doesn’t help.
- If you are only processing few thousand messages each day then Kafka is probably an overkill. Kafka is really built for handling large scale stream processing so setting up & maintaining it is not worth it if you don’t have/anticipate scale.
That’s all folks. This covers the important things you need to know about Apache Kafka. If you enjoyed reading it follow my blog & let me know if you would like to see an overview of any other tool.