What is Kafka?
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
- Message broker
- Index based message read/write, that is why Kafka is fast
- Written with Scala and Java
- Name came from Franz Kafka
- Funded from LinkedIn
- Developed with leadership of Jay Kreps
Features
- Fast ( high throughput and low latency )
- Scalable ( Horizontally scalable with node and partitions )
- Reliable ( Fault tolerant and distrubuted )
- Durable ( Zero data loss, messages persisted to disk with immutable log )
Use Cases
- Application analytics
- Monitoring/Metrics
- Log collecting
- Stream processing
- Recommendation engine
- Fraud and anomaly detection
- Integrate systems
Companies Using Kafka
- Uber
- Netflix
- Spotify
- Activision
- Slack
- Shopify
Concepts
- Producer
- Producer acknowledgment
- acks = 0, Fastest but most risky, message loss possibility is high. Send message to kafka but don’t wait response and keep going
- acks = 1, Mid level fast and safe, message loss possibilty is little. Send message to kafka and wait until leader gets message, don’t wait for followers gets message.
- acks = all or -1, Slower but most safe, message loss possibility is none. Send message and wait untill leader and followers gets messages
- Producer acknowledgment
- Consumer ( Assign 1 consumer to 1 partition ⇒ best practice )
- Read Strategies
- At Most Once
- At Least Once ( most used )
- Exactly Once ( transactional, performance impact )
- Read Strategies
- Partition ( event/message/record holder )
- Record/Event/Message ( each item in partition )
- Offset ( message position/index in partition )
- Topic ( partition holder )
- Kafka Broker ( topics holder )
- Consumer Group ( allows parallel processing for partitions, like pub-sub pattern )
- Distrubuted Systems
- Leader ( Master )
- Follower ( Slave )
- Topic Based Scaling
- Partition Based Scaling
- Kafka Connect
- Kafka Streams
Related Techs
- Apache ZooKeeper ( Distribution management, Gossip Protocol ⇒ Who is leader? Who is slave? Ok you are leader, take this message )
- Confluent Cloud
- Apache Flink ( Stateful Computations over Data Streams )
- Apache Hadoop
Key Differences With Other Messaging Systems
- Kafka differs from traditional messaging queues in several ways. Kafka retains a message after it has been consumed. Quite the opposite, competitor RabbitMQ deletes messages immediately after they've been consumed.
- RabbitMQ pushes messages to consumers and Kafka fetches messages using pulling.
- Kafka can be scaled horizontally and traditional messaging queues can scale vertically.