March 14 2017
Apache Kafka is an open-source, distributed streaming platform developed by the Apache Software Foundation. It is written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. In other words, it is a scalable, fault-tolerant, publish-subscribe messaging system that allows creation of distributed applications and powers web-scale Internet companies. Some of the prominent examples of these companies include LinkedIn, Twitter, AirBnB, etc. Kafka has gained strong momentum among unicorns as well as traditional enterprises. As a Java web development company, we often work on projects that require distributed streaming in real-time. And Kafka excels in this.
Kafka is used largely in the big data realm as a dependable means to ingest and move massive amounts of data very quickly. One example of this is Netflix who moved from writing their own ingestion framework to using Kafka as its primary backbone for ingestion via Java APIs or REST APIs. It did so to meet the demand for real-time (sub-minute) analytics.
Kafka is often used instead of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication.
How does it work?
It resembles and operates like a publish-subscribe system that can deliver sequential, persistent and scalable messaging. It allows you to process streams of records in real-time. Kafka’s system design is akin to that of a distributed commit log, where incoming data is written sequentially to disk. The log is a time-ordered, append-only sequence of data inserts where the data can be anything. The four main components involved in moving data in and out of Kafka are:
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber.
Producers publish data to one or more topics of their choice. The producer also chooses which record to assign to each partition within the topic. Consumers subscribe to topics and process the published messages. Lastly, a Kafka cluster consists of one or more servers is called Brokers. Brokers manage the persistence and replication of message data.
Some reasons that have resulted in such wide-spread adoption of Kafka and steady rise in its popularity.
Kafka and Big data services
Owing to its performance capabilities and scalability, Kafka has seen a meteoric rise in adoption in the big data space. It is considered the most reliable way to ingest and move large amounts of information as seen in the Netflix example earlier. LinkedIn, where Kafka originated, has reported ingestion rates of 1 trillion messages a day.
Internet of Things (IOT)
While Kafka is extremely useful for big data ingestion, its "log" data structure has interesting benefits for applications built around the Internet of Things, micro-services, and cloud-native architectures too
Should you consider using Kafka? That really depends on your use case. If your solution can benefit from having multiple publish/subscribe and queueing tools, then Kafka is certainly worth considering. What are your views on this? Please share in the comments section.