Apache Kafka is a broker application that stores the message as a distributed commit log.
The entire data storage system is just a transaction log.
LinkedIn created Apache Kafka to be the data exchange backbone of its organization. Linkedin Motivation: A unified platform for handling all the real-time |data feeds a large company might have.
The data dichotomy:
Data Systems are about exposing data,
Services are about hiding it
Apache Kafka is a distributed system designed for streams.
The Kafka log can be sharded and spread over a cluster of machines, and each shard is replicated for fault-tolerance. (Allowing parallelism, ordered consumption)
The log is the first data structure and tables are derived views and stored in RocksDB. Updates to these tables can be modeled as streams
Kafka can serve as a buffer for data:
Event thinking:
leads to this architecture:
forward-compatible data architecture the ability to add more applications that need to process the same data … differently
See also Kafka on Google because …
Kafka only stores Bytes – So where’s the schema?
See talk: Gwen Shapira discusses patterns of schema design, schema storage and schema evolution that help development teams build better contracts through better collaboration - and deliver resilient applications faster. Streaming Microservices: Contracts & Compatibility
To reduce the load time issue it’s useful to keep a snapshot of the event log using a compacted topic. It represents the ‘latest’ set of events, without any of the ‘version history’.
hold events twice: once in a retention-based topic and once in a compacted topic. The retention-based topic will be larger as it holds the ‘version history’ of your data. The compacted topic is just the ‘latest’ view, and will be smaller, so it’s faster to load into a Memory Image or State Store.
load one million products into memory, at say 100B each. This would take around 100MB of RAM and would load from Kafka in around a second on GbE. These days memory is typically easier to scale than network bandwidth so ‘worst case load time’ is often the more limiting factor of the two.