Storage and Streaming System

Kafka as a Storage System

  • Kafka’s publishing is not bound to consuming. So the messages that are published can live in Kafka very long time, until they are consumed. This makes kafka a good storage system for the in-flight messages.
  • Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn’t considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
  • The disk structures used in Kafka scale very well. Kafka’s performance remains good even though you have 50 KB or 50 TB of persistent data on the server.
  • Kafka allows the clients to control their read position. So it acts as a distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

Kafka for Stream Processing

  • It isn’t enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
  • In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
  • For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.
  • It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.
  • This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.
  • The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.