Kafka provides an abstraction for streams of records called Topics. A topic is a category for which the input records or messages are published. Kafka recommend single publisher to a topic, but a Topic supports multiple subscribers. So a topic can have zero, one, or many consumers that subscribe to the data written to it. For each topic, the Kafka cluster maintains a partitioned log like:
Each partition has an ordered sequence of records. Records are appended to rear end of the commit log. Each record in one partition is allocated a sequential ID number called the offset. Offset uniquely identifies each record within a partition.
Kafka cluster can retain all published records for a predefined retention period. The retention duration can be configured at topic level, without bothering for whether the records are consumed or not.
Assume the retention period is set to 30 days. Then the record will be available for 30 days after being published.After that, the record will be discarded to free up space. Kafka is designed to store data for longer periods without negatively affecting performance.
When multiple consumers are reading a log partition, each consumer only needs to keep track of the current offset. This offset is controlled by the consumer. A consumer can advance its offset after it reads records. However the consumer is free to read the records in any order, because the consumer itself controls the offset. A consumer can move to previous offset to re-read the record or move ahead ignoring the next record.
Kafka consumers are light weight on system resources. They can come and go without much impact on the cluster or on other consumers.
The partitions serve several purpose:
- They allow the log to scale beyond the size limit of any single server. Precisely, each individual partition must fit on one server, but a topic may span across partitions to store large amount of data.
- They define granularity of parallelism.
Topic Distribution :
- The partitions of log are distributed over the servers in the Kafka cluster. Each server handles data and requests for a share of the partitions.
- Each partition is replicated across a configurable number of servers for fault tolerance.
- Each partition has one server which acts as the “leader” and zero or more servers which act as “followers”. The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.