Apache Kafka is an open-source stream-processing software platform which is used to handle the real-time data storage. It works as a broker between two parties, i.e., a sender and a receiver. It can handle about trillions of data events in a day. Lets learn all in Kafka Tutorial post.
Apache Kafka was originated at LinkedIn and later became an open sourced Apache project in 2011, then First-class Apache project in 2012. Kafka is written in Scala and Java. Apache Kafka is publish-subscribe based fault tolerant messaging system. It is fast, scalable and distributed by design.
Kafka is designed for distributed high throughput systems. Kafka tends to work very well as a replacement for a more traditional message broker. In comparison to other messaging systems, Kafka has better throughput, built-in partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-scale message processing applications.
What is Apache Kafka?
Apache Kafka is a software platform which is based on a distributed streaming process. It is a publish-subscribe messaging system which let exchanging of data between applications, servers, and processors as well. Apache Kafka was originally developed by LinkedIn, and later it was donated to the Apache Software Foundation. Currently, it is maintained by Confluent under Apache Software Foundation. Apache Kafka has resolved the lethargic trouble of data communication between a sender and a receiver.
We use Apache Kafka when it comes to enabling communication between producers and consumers using message-based topics. Apache Kafka is a fast, scalable, fault-tolerant, publish-subscribe messaging system. Basically, it designs a platform for high-end new generation distributed applications. Also, it allows a large number of permanent or ad-hoc consumers. One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
Kafka technology replaces the conventional message brokers, with the ability to give higher throughput, reliability, and replication like JMS, AMQP and many more. In addition, core abstraction Kafka offers a Kafka broker, a Kafka Producer, and a Kafka Consumer. Kafka broker is a node on the Kafka cluster, its use is to persist and replicate the data. A Kafka Producer pushes the message into the message container called the Kafka Topic. Whereas a Kafka Consumer pulls the message from the Kafka Topic.
Messaging Systems in Kafka:
The main task of managing system is to transfer data from one application to another so that the applications can mainly work on data without worrying about sharing it.
Distributed messaging is based on the reliable message queuing process. Messages are queued non-synchronously between the messaging system and client applications.
There are two types of messaging patterns available:
- Point to point messaging system
- Publish-subscribe messaging system
1. Point to Point Messaging System
Here, messages are persisted in a queue. Although, a particular message can be consumed by a maximum of one consumer only, even if one or more consumers can consume the messages in the queue. Also, it makes sure that as soon as a consumer reads a message in the queue, it disappears from that queue.
2. Publish-Subscribe Messaging System
Here, messages are persisted in a topic. In this system, Kafka Consumers can subscribe to one or more topic and consume all the messages in that topic. Moreover, message producers refer publishers and message consumers are subscribers here.
History of Apache Kafka:
Previously, LinkedIn was facing the issue of low latency ingestion of huge amount of data from the website into a lambda architecture which could be able to process real-time events.
As a solution, Apache Kafka was developed in the year 2010, since none of the solutions was available to deal with this drawback, before.
However, there were technologies available for batch processing, but the deployment details of those technologies were shared with the downstream users. Hence, while it comes to Real-time Processing, those technologies were not enough suitable. Then, in the year 2011 Kafka was made public.
Apache Kafka Components:
Using the following components, Kafka achieves messaging:
- Kafka Topic
- Kafka Producer
- Kafka Consumer
- Kafka Broker
- Kafka Zookeeper
- Log Anatomy in Kafka
- Data log in Kafka
- Partition in Kafka
1. Kafka Topic
A bunch of messages that belong to a particular category is known as a Topic. Data stores in topics. In addition, we can replicate and partition Topics. Here, replicate refers to copies and partition refers to the division. Also, visualize them as logs wherein, Kafka stores messages. However, this ability to replicate and partitioning topics is one of the factors that enable Kafka’s fault tolerance and scalability.
2. Kafka Producer
It publishes messages to a Kafka topic.
3. Kafka Consumer
Consumers take one or more topics and consume messages that are already published through extracting data from the brokers.
4. Kafka Broker
These are basically systems which maintain the published data. A single broker can have zero or more partitions per topic.
5. Kafka Zookeeper
With the help of zookeeper, Kafka provides the brokers with metadata regarding the processes running in the system and grants health checking and broker leadership election.
6. Log Anatomy in Kafka:
We view log as the partitions. Basically, a data source writes messages to the log. One of the advantages is, at any time one or more consumers can read from the log they select. Here, the below diagram shows a log is being written by the data source and read by consumers at different offsets.
7. Data Log in Kafka:
By Kafka, messages are retained for a considerable amount of time. Also, consumers can read as per their convenience. However, if Kafka is configured to keep messages for 24 hours and a consumer is down for time greater than 24 hours, the consumer will lose messages. And, messages can be read from last known offset, if the downtime on part of the consumer is just 60 minutes. Kafka doesn’t keep state on what consumers are reading from a topic.
8. Partition in Kafka
There are few partitions in every Kafka broker. Each partition can be either a leader or a replica of a topic. In addition, along with updating of replicas with new data, Leader is responsible for all writes and reads to a topic. The replica takes over as the new leader if somehow the leader fails.
Apache Kafka Architecture:
We have already learned the basic concepts of Apache Kafka. These basic concepts, such as Topics, partitions, producers, consumers, etc., together forms the Kafka architecture.
Below are the 4 important API’s of Apache Kafka:
1) Producer API:
This API allows/permits an application to publish streams of records to one or more topics. (discussed in later section)
2) Consumer API:
This API allows an application to subscribe one or more topics and process the stream of records produced to them.
3) Streams API:
This API allows an application to effectively transform the input streams to the output streams. It permits an application to act as a stream processor which consumes an input stream from one or more topics, and produce an output stream to one or more output topics.
4) Connector API:
This API executes the reusable producer and consumer APIs with the existing data systems or applications.
Following essential parts required to design Apache Kafka architecture.
Several applications that use Apache Kafka forms an ecosystem. This ecosystem is built for data processing. It takes inputs in the form of applications that create data, and outputs are defined in the form of metrics, reports, etc. The below diagram represents a circulatory data ecosystem for Kafka.
A Kafka cluster is a system that comprises of different brokers, topics, and their respective partitions. Data is written to the topic within the cluster and read by the cluster itself.
Apache Kafka Components like Kafka Topic, Kafka Producer, Kafka Consumer, Kafka Broker, Kafka Zookeeper ,Log Anatomy in Kafka ,Data log in Kafka and Partition in Kafka
Advantages of Apache Kafka:
Here, we are listing out some of the advantages of Kafka. Basically, these Kafka advantages are making Kafka ideal for our data lake implementation.
Without having not so large hardware, Kafka is capable of handling high-velocity and high-volume data. Also, able to support message throughput of thousands of messages per second.
2. Low Latency
It is capable of handling these messages with the very low latency of the range of milliseconds, demanded by most of the new use cases.
One of the best advantages is Fault Tolerance. There is an inherent capability in Kafka, to be resistant to node/machine failure within a cluster.
Durability refers to the persistence of data/messages on disk. Also, messages replication is one of the reasons behind durability, hence messages are never lost.
Without incurring any downtime on the fly by adding additional nodes, Kafka can be scaled-out. Moreover, inside the Kafka cluster, the message handling is fully transparent and these are seamless.
The distributed architecture of Kafka makes it scalable using capabilities like replication and partitioning.
7. Message Broker Capabilities
Kafka tends to work very well as a replacement for a more traditional message broker. Here, a message broker refers to an intermediary program, which translates messages from the formal messaging protocol of the publisher to the formal messaging protocol of the receiver.
8. High Concurrency
Kafka is able to handle thousands of messages per second and that too in low latency conditions with high throughput. In addition, it permits the reading and writing of messages into it at high concurrency.
9. By Default Persistent
As we discussed above that the messages are persistent, that makes it durable and reliable.
10. Consumer Friendly
It is possible to integrate with the variety of consumers using Kafka. The best part of Kafka is, it can behave or act differently according to the consumer, that it integrates with because each customer has a different ability to handle these messages, coming out of Kafka. Moreover, Kafka can integrate well with a variety of consumers written in a variety of languages.
11. Batch Handling Capable (ETL like functionality)
Kafka could also be employed for batch-like use cases and can also do the work of a traditional ETL, due to its capability of persists messages.
12. Variety of Use Cases
It is able to manage the variety of use cases commonly required for a Data Lake. For Example log aggregation, web activity tracking, and so on.
13. Real-Time Handling
Kafka can handle real-time data pipeline. Since we need to find a technology piece to handle real-time messages from applications, it is one of the core reasons for Kafka as our choice.
Disadvantages of Apache Kafka:
It is good to know Kafka’s limitations even if its advantages appear more prominent then its disadvantages. However, consider it only when advantages are too compelling to omit.
Here is one more condition that some disadvantages might be more relevant for a particular use case but not really linked to ours. So, here we are listing out some of the disadvantage associated with Kafka:
1. No Complete Set of Monitoring Tools
It is seen that it lacks a full set of management and monitoring tools. Hence, enterprise support staff felt anxious or fearful about choosing Kafka and supporting it in the long run.
2. Issues with Message Tweaking
As we know, the broker uses certain system calls to deliver messages to the consumer. However, Kafka’s performance reduces significantly if the message needs some tweaking. So, it can perform quite well if the message is unchanged because it uses the capabilities of the system.
3. Not support wildcard topic selection
There is an issue that Kafka only matches the exact topic name, that means it does not support wildcard topic selection. Because that makes it incapable of addressing certain use cases.
4. Lack of Pace
There can be a problem because of the lack of pace, while API’s which are needed by other languages are maintained by different individuals and corporates.
5. Reduces Performance
In general, there are no issues with the individual message size. However, the brokers and consumers start compressing these messages as the size increases. Due to this, when decompressed, the node memory gets slowly used. Also, compress happens when the data flow in the pipeline. It affects throughput and also performance.
6. Behaves Clumsy
Sometimes, it starts behaving a bit clumsy and slow, when the number of queues in a Kafka cluster increases.
7. Lacks some Messaging Paradigms
Some of the messaging paradigms are missing in Kafka like request/reply, point-to-point queues and so on. Not always but for certain use cases, it sounds problematic.
Kafka Use Cases
There are several use cases of Kafka that show why we actually use Apache Kafka.
Kafka is the best substitute for traditional message brokers. We can say Kafka has better throughput, replication, fault-tolerance and built-in partitioning which makes it a better solution for large-scale message processing applications.
Kafka is mostly utilized for operational monitoring data. It includes aggregating statistics from distributed applications to generate centralized feeds of operational data.
Kafka is a great backend for applications of event sourcing since it supports very large stored log data.
In this Apache Kafka Tutorial, we have seen the whole concept of Apache Kafka and seen what is Kafka. Moreover, we discussed Kafka components, use cases, and Kafka architecture. In next blog, we will see Kafka Alternatives and it’s comparisons.
I hope you all understood the basic concepts of Kafka Tutorial. For more details please refer below related post.
Kafka Tutorial !! Happy Learning !!