Développez l'apprentissage automatique prédictif avec Flink | Atelier du 18 déc. | S'inscrire
Since we launched the Open Preview of our serverless Apache Flink® service during last year’s Current, we’ve continued to add new capabilities to the product that make stream processing accessible and easy to use for everyone.
In this blog post, we will highlight some of the key features added this year. To start, we’ll deep dive into Topic Actions, which simplify the deployment of stream processing workloads for common use cases and enable users to focus on core tasks without having to become Flink experts. We’ll then cover additional enhancements we have added to the product, including Terraform support for Flink and expansion into GCP and Azure.
If you're interested in trying out the public preview of our cloud-native, serverless Flink service, sign up for a free trial of Confluent Cloud.
So, let's start by taking a look at Topic Actions and how they empower users to harness the full potential of streaming data without the steep learning curve.
As a product manager at Confluent, I have the privilege of working with some of the most sophisticated engineering teams in various industries. Our customers teach us a lot about their domains, use cases, business requirements, and preferences. One common challenge they face is building non-trivial complexity that is outside their core domain, which is a burden they would rather avoid.
Our customer base provides us with a unique perspective that helps us identify common overhead use cases and patterns. For instance, a mission-critical use case may require a topic to be free of duplicate records. While there are many ways to perform deduplication, most of them are not domain-specific. Many organizations perform deduplication in similar ways, yet independently of one another. Deduplication is just one of the many requirements that must be implemented as part of a larger use case, adding to the overhead.
To address this challenge, we have introduced Actions, which are pre-packaged, turn-key stream processing workloads that handle common, domain-agnostic requirements. Users only need to provide a minimal amount of basic configuration before submitting an Action, which runs internally on Flink. Actions are designed to be extremely simple to use, allowing users to leverage the power of Flink with just a few clicks. We are starting with support for two common workloads—topic deduplication and field masking—and plan to add more in the future.
In the next section, we will provide an example of how to use Topic Actions.
To illustrate how to use the deduplication Action, we will apply it to our orders topic. Our system that generates order events uses at least once delivery semantics, which means that retries can sometimes result in duplicate records. These duplicates cause problems for our downstream clients, who have to perform their own rudimentary local deduplication, consuming memory, CPU, and engineering time. To make our infrastructure more robust and efficient, we want to eliminate duplicates upstream before consumers receive them. This will greatly simplify our clients and reduce the resource overhead they incur.
For our use case, we define a duplicate record as one whose fields are identical to a previous record's fields within a time bound of 1 hour. Although most duplicates occur within a few seconds of the original, we use a bit of a time buffer to ensure accuracy.
To access Topic Actions, we use the Data Portal, which is a self-service interface for discovering, exploring, and accessing data on Confluent Cloud. Since Actions are applied at the topic level, we start by selecting the topic we want to deduplicate, which in this case is orders.
Clicking into it, we see that there is a new Actions button available to us. This button is the entry point for all Topic Actions that are currently available for turn-key processing in Confluent Cloud.
Clicking this button presents us with a list of available Actions to apply to the topic. We simply select the Deduplicate topic Action, and begin configuring it to align with our requirements.
Each Action has a unique configuration panel that requires only the essential information needed to run the workload on Flink. To put this configuration panel in context, let's review the requirements for our deduplication use case:
Duplicates occur when all fields in a record are identical to those in a previous record. To meet this requirement, we select the Deduplicate entire message radio button.
Duplicates are expected to occur within 1 hour of the original record. To meet this requirement, we set the Lookback Period field to 1 hour.
The remaining configuration options allow us to choose the compute pool where the Action will run and specify its security permissions.
Before we submit the Action to begin running, let's briefly explain how it works internally. Actions are simple. They use templatized forms of common workload elements and substitute user input into the templates to create a runnable SQL statement. This SQL statement is then automatically submitted to Flink, where it runs just like any other Flink SQL statement.
This simple design ensures that Actions rely exclusively on Flink's user-facing API, which means that there is no operational divergence between user-submitted Flink SQL statements and Actions-generated statements. They all share the same cost accounting, health and monitoring interfaces, and operational functionality.
If you want to understand exactly what an Action is doing under the hood, you can toggle the Show SQL switch near the bottom of the Action's configuration panel. For our deduplication Action, we see that two SQL statements have been generated based on our configuration.
The CREATE TABLE
statement creates the sink table that the deduplicated output will be written to. This is the table that we would connect our downstream consumers to, ensuring that they are only receiving deduplicated data.
The second statement—the INSERT
statement—is performing the heavy lifting for this Action, continuously INSERT
‘ing rows returned by its inner SELECT
statement into our sink table. We originally configured the Action to deduplicate across all topic fields, which are reflected in the inner SELECT
statement’s PARTITION BY
clause. If we had configured the Action to deduplicate over a subset of the topic’s fields, that subset would appear in the PARTITION BY
clause instead.
At this point, industry veterans may have a question: What if the generated SQL for an Action doesn't quite meet my needs? What if I need to modify the SQL?
Actions provide an escape hatch for this scenario. In the image above, just below the SQL preview area, you'll see an Open SQL editor link. This link opens a SQL Workspace pre-populated with the generated SQL, which you can edit to your liking. From here, you can make any necessary modifications and submit the resulting SQL via a Workspace instead of an Action. Both options run the same way from that point forward.
Clicking the Confirm button in the Actions configuration panel above will submit the generated SQL to Flink, providing you with feedback containing key metadata and a link to the statement that is now running. Clicking into this link will inspect the statement, allowing you to confirm that it is processing records as expected. Once the statement is up and running, the sink table will receive deduplicated records from the original input topic.
As mentioned above, this initial release of Topic Actions focuses on two specific actions: event deduplication and field masking. These two workloads are very common among our customer base. However, we are aware of a long list of common workload elements that many of our customers must write boilerplate code to support. Often, these tasks are outside the scope of their core business and are therefore largely considered as overhead.
Actions are designed to save customers time and effort for these kinds of use cases and requirements. You can expect the list of turn-key Actions offered in Confluent Cloud to continually grow over time. We are already working on our next set of Actions, and if there is an Action you would like to see us provide, please do not hesitate to let us know!
In addition to Topic Actions, we've added a number of additional enhancements to our fully managed Flink service. Here’s a roundup of the highlights:
By introducing Terraform support for Flink statements, the deployment process can be automated, which removes manual operations in order to support repeatable and production-ready deployments. The code for the Flink statements can be stored in a version control system, such as Git, which allows developers to track changes, collaborate on the code, and keep a history of the changes made to the code. This can help ensure that the code is consistent, repeatable, and reliable, and can reduce the risk of errors or downtime during deployment.
With Terraform support, Flink statements can be deployed consistently and reliably across different environments, such as development, testing, and production. Terraform can be integrated into the CI/CD pipeline, which can further automate the deployment process and reduce the manual effort required to deploy changes.
Together with programmatic management of Flink-specific API keys and compute pools to easily scale workloads up or down based on demand, you can now define your entire Flink pipeline using infrastructure-as-code, ensuring faster and more reliable delivery of Flink applications.
Additionally, developers can integrate with existing tools, such as GitHub, GitLab, and Jenkins, to automate their build, test, and deployment pipeline. Here's an example of an end-to-end CI/CD workflow that deploys a Flink SQL statement programmatically on Confluent Cloud for Apache Flink.
When we launched at Current, our fully managed Flink service was available for preview in a few select regions on AWS. As we continued to build out the offering based on feedback from customers, we expanded to four additional AWS regions and also made Flink available for preview in Azure in Q1.
Today, we're excited to announce that we're expanding Flink into GCP as well! With this new capability, our Flink offering now stands as a true multi-cloud solution. This means that our customers can seamlessly deploy their Flink applications across all three major cloud service providers, without being locked into a single one. This demonstrates our commitment to enable serverless stream processing everywhere your data and applications reside. Please refer to our docs for the newest supported regions.
We look forward to expanding our Flink service to additional regions in the future. Stay tuned!
We’ve come a long way since our initial announcement of Flink support on Confluent Cloud. Through continued innovation and product investment, we’re able to empower users to create stream processing applications more easily and efficiently, without the need for advanced Flink expertise.
With pre-packaged, turn-key stream processing workloads called Actions, users can leverage the power of Flink with just a few clicks, without having to become Flink experts. By simplifying Flink code development, users can harness the full potential of streaming data without the steep learning curve, fostering innovation and efficiency in stream processing workflows.
Additionally, we've added several other enhancements, including Terraform support for Flink and availability on GCP and Azure. These features provide users with more flexibility, control, and convenience when deploying and monitoring their Flink workloads on Confluent Cloud.
We have an exciting journey ahead, and this is only the beginning! We look forward to adding exciting new features, such as support for programmatic APIs including UDFs and Table API, and a host of other exciting capabilities. If you want to learn more, hear from our Flink and Kafka experts at Kafka Summit London 2024!
As of today, Confluent Cloud for Apache Flink® is available for preview in select regions on AWS. In this post, learn how we’ve re-architected Flink as a cloud-native service on Confluent Cloud.
In this blog post, we will provide an overview of Confluent's new SQL Workspaces for Flink, which offer the same functionality for streaming data that SQL users have come to expect for batch-based systems.