Pulsar
Pulsar is a cloud-native, distributed, open-source pub-sub messaging and streaming platform. It was developed by Yahoo! and joined the Apache Software Foundation (ASF) in 2016.
Pulsar combines the best features of a traditional messaging system, such as RabbitMQ, with the capabilities of a pub-sub system like Apache Kafka.
What is an open-source pub-sub messaging system?
Pub-sub is short for publish-subscribe. With pub-sub, message senders or publishers do not send messages (or events) to specific recipients or subscribers. Instead, message consumers subscribe to topics or subjects that interest them. Whenever a message related to that topic is published, all subscribers receive it instantly.
Publishers do not know who the subscribers are or what topics they are subscribed to, and subscribers receive the relevant messages without the publisher's knowledge. They communicate independently. The asynchronous nature of pub-sub provides loose coupling and scalability, making it an excellent choice for distributed applications and serverless and microservices architectures.
How does Pulsar compare to other messaging systems, including Apache Kafka?
Unlike many other event streaming platforms, Apache Pulsar is cloud-native, highly scalable, and supports multi-data center and active-active configurations.
Pulsar has several advantages over Kafka:
- Cost
- Performance
- Ease of deployment
- Geo-replication
- Scaling
- Architecture (Pulsar has tiered storage, decoupled compute and storage, and multitenancy)
- Queuing
- Support for messaging semantics of MQ-based solutions
An analysis by GigaOm, a market research firm, revealed that Pulsar excels in terms of price and performance. Some findings from the report that highlight the advantages of Pulsar include:
- 81% lower cost compared to Kafka (over 3 years)
- 35% higher performance
- 73% savings for complex scenarios
- 81% savings for higher data volumes
Apache Pulsar Structural Basics
Let's take a look at the building blocks of Apache Pulsar.
Cloud-native architecture
Apache Pulsar is well-suited to cloud infrastructures because it uses a multi-layered approach that separates compute (broker) from storage (BookKeeper). Brokers are essentially stateless, and BookKeeper can be easily managed as a StatefulSet in container orchestration environments like Kubernetes, which is the de facto standard for cloud-native orchestration.
In fact, Apache Pulsar runs natively on Kubernetes and supports rolling updates, rollbacks, and horizontal scaling.
Client libraries
Pulsar has a wide range of client libraries managed by the core project, including Java, Python, C++, Golang, Node.js, and C#. If you prefer not to use Pulsar client libraries, Pulsar also includes a WebSockets proxy.
There are many other clients being developed by the community, such as Scala and Rust. If you prefer to use HTTP to send and receive Pulsar messages, you can use Pulsar Beam.
Multi-tenancy and namespaces
When you have a high-performance, scalable messaging system, you will want to share it among different teams and groups within your organization. It is not practical to duplicate a high-performance system or create a complex shared infrastructure to simulate multi-tenancy to ensure that different teams do not affect each other.
Pulsar was designed as a multi-tenant system from the beginning. Therefore, different teams can safely share the messaging system. Each tenant has its own authentication, authorization, and policies. Additionally, tenants can be divided into namespaces, which facilitate supporting different environments within a single tenant, such as development, staging, and production.
Features of Apache Pulsar
Apache Pulsar has a rapidly growing set of features. Let's review some of its key features.
Built-in schema registry
One of the biggest challenges of any messaging system is ensuring that producers and consumers communicate in the same language. Since producers and consumers are separated, it is easy to change the format of the messages they send or expect.
The solution is a schema registry that requires producers and consumers to use messages with a compatible schema. Pulsar includes a built-in schema registry. You just need to register the schema with a Pulsar topic, and it enforces the rules of that schema.
Built-in geo-replication
Replicating messages to remote locations is important for disaster recovery or enabling applications to work globally. When your application's users travel, you want them to have the same experience no matter where they are. With geo-replication, applications can connect to a local cluster and still send and receive data to and from global clusters.
With Pulsar, the replication of messages across geographies is a built-in feature. If you publish a message to a topic in a replicated namespace, that message is automatically copied to the configured remote location(s). There is no need for complex configurations or plugins.
IO connectors
One of the main functions of a messaging system is to connect data-intensive systems such as databases, stream-processing engines, and other messaging systems. It makes sense to provide a common framework and connectors to facilitate this. This is exactly what Pulsar does with its IO connectors.
Pulsar comes with a wide range of ready-to-use connectors, including MySQL, MongoDB, Cassandra, RabbitMQ, Kafka, Flume, Redis, and many more, making it easy to connect your systems together.
Benefits of Apache Pulsar
Apache Pulsar offers several advantages.
Infinite retention
One of the significant advantages of Pulsar's layered architecture is the ability to add new layers. For high performance, any durable messaging system needs to use high-performance disks, as messages may need to be written to and read from disk ultimately (if not consumed immediately). But what if you need to retain old messages for replay or event-sourcing purposes? What if you want to store these messages indefinitely? Storing these old messages on high-performance disks can be expensive.
To solve this problem, Pulsar supports tiered storage, allowing older messages to be offloaded to cheaper storage options such as S3 buckets. When a consumer needs an older message, Pulsar automatically retrieves it from the configured remote storage and delivers it to the consumer.
Yes, the performance will be lower. However, when dealing with old messages for months or even years, performance is not as critical. You just want these messages to be readily available when you need them without breaking the bank.
This is achieved by leveraging Apache BookKeeper.
Flexible subscriptions
Apache Pulsar supports four different subscription types: exclusive, failover, shared, and key shared. It also supports having multiple subscriptions on a single topic. By using subscriptions, you can easily configure messaging patterns such as queuing, pub-sub, fan-out, and competing consumers.
Apache Pulsar implements the competing-consumers pattern using shared subscriptions. With a shared subscription, you can smoothly scale the number of consumers up and down. Partitions are not involved, and adding a consumer allows it to start consuming messages immediately.
Low latency, high throughput
From the very beginning, Pulsar was designed to provide low and consistent latency at high throughput. It achieves this by decoupling message delivery concerns between producers and consumers and storing messages for persistence. Pulsar utilizes a layered architecture where messages are delivered by brokers and stored by Apache BookKeeper. Instead of building its own storage layer, Pulsar takes advantage of BookKeeper's performance and durability.
BookKeeper is a distributed log designed to store messages persistently with IO isolation between writes and reads. This means it can provide consistent low latency even when writing or reading large volumes of data. Unlike traditional storage systems, it does not suffer from performance degradation under high write or read (consumer catch-up) loads. BookKeeper is a distributed system that can scale horizontally without the need to rebalance storage assignments.
Apache Pulsar Use Cases
Let's take a look at the use cases of Apache Pulsar.
Pub-sub, streaming and queueing
Apache Pulsar is adept at handling high-volume pub-sub messaging as well as the more complex messaging patterns typical of a message queuing system. And these complex messaging patterns are handled by Pulsar - it is not left to the software developer to code using a complex application built on a simple client.
Retention and message replay
In a traditional messaging system, the system keeps track of whether a particular message has been consumed. When the consumer client is finished with the message, it informs the messaging system that the message is no longer needed. A traditional messaging system then deletes the message from the persistent storage. After all, the message is no longer needed.
In a perfect world this might be true. But in the real world, things go wrong, applications crash, availability zones close, and being able to retrieve lost messages can be critical to restoring your application status. This is why message retention is important. If something goes wrong, Pulsar can replay messages posted in a topic, even if they have already been consumed. After all, you never know when you might need that message again.
The ability to retain messages is also important for event driven application architectures such as event sourcing, where it is important to record every state change as an event in the order in which it occurred.
Dead letter topic, negative acknowledgment, delayed delivery
Apache Pulsar supports a variety of advanced messaging features that make it easy to build powerful and flexible applications. With negative acknowledgment, the consumer client can add a message to a topic for later processing or allow another consumer to try to process it. If a consumer cannot process a message, instead of being blocked, it can send the message to the dead-letter topic and unblock it, saving the problematic message for later analysis.
If you want to send messages after a delay, Pulsar can do this using the delayed delivery feature. When you publish a message, you can set a configurable waiting time for messages to be consumed.
Integrated streaming functions
We want to derive analytics from the data we collect in real time. Gone are the days when it was considered good enough to run an overnight batch job to process all the data and get analytics the next day. Today, we want our analytics in real time so we can react in real time.
To obtain real-time analytics, data needs to be processed in real-time. With Pulsar, you can seamlessly integrate lightweight functions into the message flow, performing real-time cleaning, enrichment and analysis of data. There is no need to throw everything into a data lake and process it later. With Pulsar Functions, you can process data as it flows through the messaging system. Pulsar Functions can be written in Java, Python, or Go and configured to run as Kubernetes pods.
Best practices for pub messaging with Apache Pulsar
To get the most out of Apache Pulsar, we recommend you follow these best-practices
If you store a lot of data in Pulsar, it can be very useful to run queries on that data and do so while Pulsar is doing its main job of sending and receiving messages. Pulsar makes this possible by leveraging the SQL query engine Presto. Pulsar integrates with Presto so you can perform SQL queries on data stored in topics. You can query data even if it is offloaded to tiered storage. And the queries bypass the broker, so they do not impact the Pulsar cluster's ability to send and receive messages in real time.
Apache Pulsar supports both partitioned and non-partitioned topics. For lower performance use cases, you can use a non-partitioned topic to simplify things. However, if you have a high-performance use case where you need to process high volumes of data in a single topic, you can use a partitioned topic to take advantage of parallelism in processing. As performance requirements increase, you can add partitions without any problems.
Like Kafka, Pulsar can guarantee message order if you publish your message with keys. By assigning messages with the same key to the same partition, Pulsar guarantees the order of messages sent to that key.
Persistent messages are sent to Apache BookKeeper for storage on disk. These messages are guaranteed to be delivered at least once, regardless of the failure of the network, application or even Pulsar.
However, there are some situations where this level of guaranteed delivery is not necessary and at-most-once delivery is sufficient. In these cases, Apache Pulsar supports non-persistent messages. Non-persistent messages are not stored on disk, providing high throughput and low latency while reducing resource requirements.
Sometimes, only the most recent instance of a piece of data is of interest. You don't care about all the historical values, just the latest value. If this is the case, you can use a compacted topic to store only the most recent value at a given key in a topic.
All data is published to a compacted topic. However, Pulsar periodically deletes old values of a key, leaving only the newest one. Compacted topics prevent the topic from growing indefinitely and provide quick access to the latest values in a topic.