Blog: What is kafka?
Have you also been looking for the know-how of kafka and getting to hear all the buzz it has in the big data world but not getting the proper idea? Don’t worry, this post will make your life easy.
I got introduced to kafka recently while working on my Web service testing project wherein a couple of the applications, databases and ESB are now getting integrated to kafka and I was keen to understand what this technology is and what it does.
To understand kafka, We first need to understand the message queuing/system paradigm.
What is messaging system?
While dealing with large amount of data, handling and analysing collective data is a challenge. To overcome this, a messaging system comes into play. It is responsible for transferring data from one application to data so that the application can focus on the data and not the sharing overhead. Distributed messaging is based on the principle of reliable message queuing. Messages are queued asynchronously between applications and message system.
Two types of messaging patterns:
· Point to point
In this model, the messages are initiated by the producer and stored in queue. Multiple consumers can interact with the queue to consume the message but one at a time.
· Publish- subscribe
In the Publish-subscribe model messaging model, the messages are published on a topic by the producer and multiple consumers can subscribe to one/more topics to consume the message from. It basically broadcasts the message to multiple consumers.
For ex: Dish TV , the network operator(producer) steams a large number of channels(the topic) out of which the user(consumer) can subscribe to the channels of their choice to view and pay for.
What is kafka?
Kafka is an open-source messaging system that is scalable, fast and fault tolerant. It handles large amount of real time data feeds with high throughput and low-latency. It is written in Scala and Java. It was developed by LinkedIn to analyse the connections amongst their professional users to build a network and was further passed to Apache foundation in 2011 to oversee the development of the software.
Lets take a real world example to understand it better. If you were to find out the “number of shoes sold in a month”, or “number of sales between 1pm and 2pm” in a shoe-store, you’d do it by analysing the data of sales. You could do it using a conventional database too which lets you store or sort information, but Kafka comes into play if there were a chain of shoe stores processing thousands of shoe sales every minute.This is achieved using a function known as a Producer, which is an interface between applications (e.g. the software which is monitoring the shoe stores structured but unsorted transaction database) and the topics — Kafka’s own database of ordered, segmented data, known as the Kafka Topic Log. Another interface — known as the Consumer — enables topic logs to be read, and the information stored in them passed onto other applications which might need it — for example, the shoe store’s system for renewing depleted stock, or discarding out-of-date items.
· Publish and subscribe to stream of records
· Store stream of records in a fault tolerant way
· Process the records to fasten analysis.
- Reliability: Due to its distributed nature and the streamlined way it manages incoming data, Kafka is capable of operating very quickly — large clusters can be capable of monitoring and reacting to millions of changes to a dataset every second. This means it becomes possible to start working with — and reacting to — streaming data in real-time.
- Scalability. Kafka is a distributed system that scales quickly and easily without incurring any downtime.
- Durability. Kafka uses a distributed commit log, which means messages persists on disk as fast as possible providing intra-cluster replication, hence it is durable.
- Performance. Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even when dealing with many terabytes of stored messages.
Architecture of kafka:
Kafka consists of Records, Topics, Consumers, Producers, Brokers, Logs, Partitions, and Clusters. Records can have key (optional), value and timestamp. Kafka Records are immutable.
Topics: You can think of a Topic as a feed name or category to which records are published. Topics in Kafka are always multi-subscriber — that is, a topic can have zero, one, or many consumers that subscribe to the data written to it. For each topic, the Kafka cluster maintains a partition log that looks like this:
Partition: A Topic Log is broken up into partitions and segments to handle large amount of data.
Broker: Brokers are servers responsible for maintaining published data. Kafka brokers are stateless, so they use ZooKeeper for maintaining their cluster state. Each broker may have zero or more partitions per topic.
For example, if there are 10 partitions on a topic and 10 brokers, then each broker will have one partition. But if there are 10 partitions and 15 brokers, then the starting 10 brokers will have one partition each and the remaining five won’t have any partition for that particular topic. However, if partitions are 15 but brokers are 10, then brokers would be sharing one or more partitions among them, leading to unequal load distribution among the brokers.
Clusters: The Kafka Cluster consists of many Kafka Brokers on many servers. These clusters are used to manage the persistence and replication of message data.
Zookeeper: They are used for constant monitoring of the brokers and coordinating kafka clusters. Kakfa zookeeper is used to notify the producer and consumer in case of addition of a new broker, broker depletion, inactivity and failures to which the producer and consumer can coordinate their tasks accordingly.
Kafka has four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
So this was pretty much the theoretical know-how of Kafka that I’ve learnt as of now. Would love to share more along the way of my learning.
Till then, Happy learning!
#ArtificalIntelligence #BigData #MachineLearning #Kafka #MessageQueue