[Webinar] Build Your GenAI Stack with Confluent and AWS | Register Now
Imagine your team wants to design a data streaming architecture and you’re in charge of creating the prototype. Within a few minutes, you provision a fully managed Apache Kafka® cluster on Confluent Cloud and run connectors that stream data from different event sources and database tables into the cluster. Then you focus on writing an application to process that data, cleanse and denormalize it, and apply your business logic to it. You show your prototype to your team, how they can discover data through the Streams Catalog, browse record schemas, and trace flows in the lineage graphs. Your team says, “Wow, that was incredible turnaround time!”
You continue on and show them the dashboards and the production and consumption metrics, and they ask, “Can it go even faster?”
But…what does “fast” mean in the first place? In some architectures, “fast” means low latency, which is the elapsed time moving records end to end, e.g., from producers to brokers to consumers. One scenario for low latency is a financial transaction where account balances should reflect debits and credits with as little delay as possible. In other architectures, “fast” means high throughput, which is the rate that data is moved, e.g., from producers to brokers or from brokers to consumers. One scenario for high throughput is an e-commerce site processing millions of events per second.
While latency and throughput do vary with workload, chances are your streams of real-time transactions or telemetry data or customer event information, or whatever, are already performing plenty fast. This is because you can achieve a good balance between low latency and high throughput. With regard to latency, Kafka brokers and the client application default parameter values are already optimized to deliver low latency, and regardless, network latency usually has the bigger latency impact. As for throughput, Confluent Cloud Basic and Standard clusters elastically scale from zero up to 100MBps, and 8.6TB per day is substantial throughput, especially considering that the average message size is small, about 1KB.
Over the years, incredible technical content has been written about data plane performance, general principles and tradeoffs, cloud-native architectures, etc. These writings describe how you can get low latency and high throughput without compromising on a mature and reliable platform that provides persistence, no data loss, audit logs, processing logs, and more—all the things that enable you to go from proof of concept to production. This blog post highlights the top five reading recommendations to help you gain a deeper understanding of what makes applications that run on Confluent Cloud so fast. They cover the key concepts and provide concrete examples of how we do it, and how you can do it too, with specific benchmark testing and configuration guidelines.
This whitepaper talks about key principles that go into a cloud-native design. It’s a very thorough whitepaper, but some of the design considerations discussed that impact performance include:
Related resources:
This whitepaper describes how to optimize your Kafka client applications for low latency or high throughput and how to monitor your application performance, consumer lag, and throttling, using JMX and the Metrics API. It discusses best practices for setting producer configuration parameters such as:
batch.size linger.ms compression.type acks buffer.memory enable.idempotence max.in.flight.requests.per.connection session.timeout.ms
And setting consumer configuration parameters such as:
fetch.min.bytes enable.auto.commit isolation.level
This whitepaper presents benchmarks that Confluent ran on a 2-CKU multi-zone Dedicated cluster, and shows the ability of a CKU (Confluent Unit for Apache Kafka®) to deliver the stated client bandwidth. The paper describes the setup up and execution of the benchmarks in sufficient detail, which may be useful if you want to apply the benchmarking framework to your own scenarios. It’s important to test in your production cloud environment so that the benchmarks reflect realistic cloud network performance.
This blog post shows how Kafka-based architectures deliver high throughput while providing the lowest end-to-end latencies up to the p99.9th percentile. It goes into technical detail on:
This blog post explains end-to-end latency, and how to configure and scale your application for throughput while keeping bounds on latency. It goes into more technical detail on the inner workings of Kafka and the various factors that impact latency. Some of the key callouts include:
There is one more perspective to going “fast”. Thus far we’ve talked about it in terms of data plane latency and throughput, but some people think of “fast” in terms of project time. There is the time to build a prototype, to extract and load data, to test the applications end-to-end, to integrate with operational tools for security, monitoring, and billing, to deploy the solution, and the whole ecosystem of things that achieve the business outcome. Overall this project time is called time to market and everyone wants that to be very fast!
When you’re developing the prototype, you want to focus on developing a client application that consumes that data and apply your own cool business logic. What you don’t want to do is take on additional operational burden, and Confluent takes care of that: Load is automatically distributed across Kafka brokers, partition leadership is automatically reassigned upon broker failure, consumer groups automatically rebalance when a consumer is added or removed, the state stores used by ksqlDB and applications using the Kafka Streams APIs are automatically backed to the cluster. There isn’t a document to point you to that would speak better than just telling you to try it out for yourself.
The fact that you can spin up a prototype quickly on Confluent Cloud, focus on your application, then work with your broader team to integrate it into your business toolchain and go live, enables your business to recognize new opportunities or revenue streams faster. You want something that allows developers to be highly productive and has reduced operational complexity so that it’s faster to deploy to production.
An event streaming architecture built on Confluent Cloud gives good performance on a mature and reliable platform that elastically scales to fit your business requirements. For more details on achieving low latency and high throughput, here is the summarized list of the five key resources for learning about how to make Kafka go fast(er) on Confluent Cloud:
When performance matters, you should always benchmark in your particular Confluent Cloud environment and with the expected production workloads. Refer to the resources above for guidance on how to configure your Kafka applications and do the testing. If you have any questions along the way, please ask us in the Confluent Forum! If you’re ready to get started with a free trial of Confluent Cloud, use the code CL60BLOG to get an additional $60 of free usage.*
We are proud to announce the release of Apache Kafka 3.9.0. This is a major release, the final one in the 3.x line. This will also be the final major release to feature the deprecated Apache ZooKeeper® mode. Starting in 4.0 and later, Kafka will always run without ZooKeeper.
In this third installment of a blog series examining Kafka Producer and Consumer Internals, we switch our attention to Kafka consumer clients, examining how consumers interact with brokers, coordinate their partitions, and send requests to read data from Kafka topics.