Développez l'apprentissage automatique prédictif avec Flink | Atelier du 18 déc. | S'inscrire
Abstract: Traditional data management approaches, with their heavy centralization and batch-based nature, are struggling to keep up with the demands of today's organizations. That's why many companies are turning to new paradigms like data mesh and event-driven architecture (EDA) to address their data management needs. This post dives into how these modern approaches can be used to manage data platforms and how to take advantage of them. Specifically, we'll show you how to formally define streaming data products using descriptor files and how these files can be used to automate the entire lifecycle of a data product, from deployment to retirement. We'll use the open Data Product Descriptor Specification (DPDS) for defining descriptor files and the Open Data Mesh (ODM) Platform as an example to demonstrate how it all works in practice. So, if you're looking to improve your data management strategy, read on!
Traditionally, analytical data platforms have been characterized by a highly centralized technological and organizational architecture and by predominantly batch-integration modes. These two aspects, following the constant growth of the number of data producers and consumers, have generated several problems, including difficulty in evolving, increased management complexity, and a reduction in agility.
It is therefore not surprising that many organizations see data and its management as a strategic element, but also one of the main barriers to innovation. In particular, it is precisely the constant growth in the complexity of integrations and the resulting costs that make the current paradigm unsustainable. In this sense, there are two interesting trends that organizations face: data mesh and event-driven architectures (EDA).
Data mesh is a holistic approach that emphasizes the importance of both organizational and technological factors in building decentralized data management systems. This approach prioritizes decentralization of data ownership and management of data as products, with a focus on establishing a federated governance system to define global policies and ensure interoperability. Additionally, a self-serve platform is crucial for enabling cross-functional teams to access and utilize data in a decentralized manner.
Event-driven architectures (EDA), on the other hand, are based on the exchange of events on a common communication medium that decouples producers and consumers without the need of a centrally predetermined workflow: this allows applications to be managed independently. At the infrastructure level, streaming platforms enable the reliable, secure, and scalable collection and distribution of events in real time between different applications. This architectural pattern is widely used in the operational world and is also spreading into the analytical world, thereby promoting the unification of the two integration stacks in an architecture generally referred to as a digital integration hub (DIH). In this architecture, all data enters the integration platform in real time as events (kappa architecture) and is then transformed and rematerialized in different storage systems for different workloads.
Data mesh and event-driven architecture are not alternative choices. Many organizations are implementing a combination of the two approaches to achieve optimal results. Regardless of the specific approach taken, the idea of managing data as a product and automating as much as possible the data management activities through a self-serve platform are common pillars in all modern data strategies. Before delving into a hands-on example to see them in action, let's briefly introduce these two key concepts.
Managing data as a product means applying to the development of integration logic (i.e., collection, transformation, and distribution) the same principles and operating models that are used in the development of traditional applications. Three basic elements define a data product:
Ownership: Each data product must have a clear owner, responsible for its design, development, and maintenance.
Scope: Each data product must have a clear scope that unambiguously defines what is part of it and what is not. A data product is composed of data, metadata, application code, and infrastructural resources—everything that is required to manage it as a single independent unit of deployment.
Interfaces: The public interfaces of a data product include input and output ports that manage incoming and outgoing data. Interfaces also include other useful ports for operational management, such as control ports, discovery ports, and observability ports.
To ensure interoperability between the various data products, it is necessary to centrally define shared specifications and protocols. This is a responsibility of the federated governance team, while the product teams are responsible for complying with these standards.
To facilitate the development of data products it is useful to centralize some common services through a shared self-serve platform. This platform enforces the shared standards, makes it easy to create and manage data products, and reduces both the TCO and manual management activities. Its services can be organized into these three groups:
Utility plane: Provides services with access to the underlying infrastructural resources. The utility plane decouples its consumers (i.e., the data product experience plane) from the actual applications and resources provided by the underlying infrastructure.
Data product experience plane: Provides operations to manage the lifecycle of data products—creation, update, versioning, validation, deployment, search, and decommissioning.
Mesh experience plane: Provides a marketplace where data products can be searched, accessed, and queried.
Confluent Cloud provides all the basic elements necessary to develop (i.e., Connectors, ksqlDB, Stream Designer), validate (i.e., Schema Registry), deploy (i.e., Terraform provider), and operate (i.e., Stream Governance) a data product that manages real-time data throughout all its lifecycle. A simple example of how the individual services listed above can be used in practice to create a streaming data product is shown in this blog post on how to build data mesh with event streams and in the associated e-book.
The services exposed by Confluent Cloud are already self-serving and declarative in nature. Therefore, although not much work is needed at the utility plane level, it is still important to implement services at the data product experience plane level. This allows for easy and atomic orchestration of the underlying utility plane's services in key phases of the lifecycle of a data product such as creation and deployment.
With an example inspired by the logistics world, let’s see how to handle a stream application not as an ad hoc integration flow, but as a real data product. The example is an extremely simplified version of a shipment track and trace application that focuses on monitoring the vehicles that move goods. Each vehicle engaged in a trip periodically notifies its status through events, as shown in the following table:
tripId | tripStatus | position | timestamp |
1 | planned | (lon: 9.189982, lat: 45.46427) | 2022-01-01 12:00:00 |
1 | position-notified | (lon: 9.189982, lat: 45.46427) | 2022-01-01 12:15:00 |
1 | loading-started | (lon: 9.189982, lat: 45.46427) | 2022-01-01 12:30:00 |
1 | loading-ended | (lon: 9.189982, lat: 45.46427) | 2022-01-01 13:00:00 |
1 | departed-from-origin | (lon: 9.189982, lat: 45.46427) | 2022-01-01 13:15:00 |
1 | position-notified | (lon: 9.229982, lat: 45.42427) | 2022-01-01 14:15:00 |
1 | … | … | … |
1 | position-notified | (lon: 9.289982, lat: 45.40427) | 2022-01-01 16:15:00 |
1 | arrived-at-destination | (lon: 14.258824, lat: 40.84844) | 2022-01-02 02:00:00 |
1 | unloading-started | (lon: 14.258824, lat: 40.84844) | 2022-01-02 02:30:00 |
1 | unloading-ended | (lon: 14.258824, lat: 40.84844) | 2022-01-02 03:00:00 |
1 | completed | (lon: 14.258824, lat: 40.84844) | 2022-01-02 03:30:00 |
Some events represent an actual change in the status of the trip. Others are just an update of the vehicle's position. The track and trace application has the task of reading the events coming from the field and providing its consumers with two distinct interfaces: one to read the updated status of a trip (i.e., TripStatus) and one to access the historical record of the vehicle's positions (i.e., TripPosition).
TripStatus
TripPosition
tripId | position | timestamp |
1 | (lon: 9.189982, lat: 45.46427) | 2022-01-01 12:00:00 |
1 | (lon: 9.189982, lat: 45.46427) | 2022-01-01 12:15:00 |
1 | (lon: 9.229982, lat: 45.42427) | 2022-01-01 14:15:00 |
1 | (lon: 9.249982, lat: 45.41427) | 2022-01-01 15:15:00 |
1 | (lon: 9.289982, lat: 45.40427) | 2022-01-01 16:15:00 |
1 | (lon: 9.319982, lat: 45.39427) | 2022-01-01 17:15:00 |
1 | (lon: 9.349982, lat: 45.38427) | 2022-01-01 18:15:00 |
1 | (lon: 9.389982, lat: 45.37427) | 2022-01-01 19:15:00 |
1 | (lon: 14.258824, lat: 40.84844) | 2022-01-02 02:00:00 |
Meeting these requirements is easy with Confluent Cloud by using the following components:
Topics on Confluent Cloud Clusters, to store events
Schemas on Confluent Schema Registry, for governance purposes and to enable the usage of ksqlDB queries
ksqlDB, to transform input data into the desired outputs through a set of ksqlDB instructions
However, implementation is only one part of the solution—the application must also provide a real data product.
There are three fundamental ingredients to make an integration application a true data product:
A contract that formally describes external interfaces and what data is provided.
A self-service platform capable of maintaining these contracts, enforcing them, and automating the data product life cycle.
A person who owns the contract, provides support for its consumers, and uses the platform to manage the data product.
The contract is foundational since it guarantees a shared and clear method to describe all the elements of the data product.
First, it is necessary to define a shared specification to describe the contract exposed by a data product. While it is possible to manage the lifecycle of a data product manually, it remains error prone and inefficient. Instead, you should build a platform to automate and manage the data product’s lifecycle, including registration, validation, build, and deploy.
The platform should utilize the information within the contract (i.e., data product descriptor) to register and validate the data product, and then deploy and manage its running instance (i.e., data product container), as depicted in the image below.
In the next sections you will see:
How to define the contract of the track and trace application using the open Data Product Descriptor Specification (DPDS).
The high-level architecture of a platform for managing the lifecycle of a data product. For reference, we will use the platform we use internally in our projects, the ODM Platform, which will be open sourced in the first half of 2023.
The Data Product Descriptor Specification is an open specification that declaratively defines a data product and all its components using a JSON or YAML descriptor document. It is released under the Apache Kafka® 2.0 license and managed by the Open Data Mesh Initiative.
DPDS is technology agnostic and allows you to leverage your previous investments by specifying all the components used to build the data product, from the infrastructure up to its interfaces.
Let’s take a look at the DPDS components for our track and trace example application:
An input port that receives notification events of a change in the trip status
Two output ports, one that continuously reports the current snapshot of each trip status in real time and one that tracks all the positions recorded for the vehicle doing the trip
An internal application (i.e., ksqlDB) that transform the incoming events read from input port in the data exposed to consumers through output ports
Three infrastructural components (i.e., topics), one to store input data and two to store output data
Next, let’s take a look at how these elements relate to the descriptor (the full descriptor definition is available here).
The info field is useful for other teams to understand what the data product is for and who owns it:
The interface components describes the inputPorts and outputPorts, along with descriptions and code references for each:
All interface components, regardless of the type of port, describe the agreements between the product developer and its consumers in terms of promises (what is guaranteed by the product) and expectations (how the product should be used). Among the promises, particularly important are the APIs through which data is exposed and the SLOs provided by the data product owner.
Both input and output ports of this use case rely on AsyncAPI v 2.5.0 since it provides a comprehensive definition for streaming data interfaces. Let’s go deeper with the tripCurrentSnapshot
output port to see how it is defined:
As you can see from the descriptor, we specified that the output port is a topic that generates a stream of events, as well as the servers information about where it can be found and how to connect to it:
Furthermore, we defined the format of each message by using an Avroschema:
Finally, the internal components are further divided into two groups: infrastructural components and application components. While the infrastructural components are the muscles responsible for providing all the resources required by the data product, the application is the brain that delivers the results to the output ports. Below is part of the descriptor that defines both these components:
Let’s take a closer look at the infrastructural component definition:
It uses Terraform(provisionInfo section)to provision the required topics on Confluent Cloud. Other provisioners can be used depending on your needs (e.g., CloudFormation). This is possible by specifying in the provisionInfo section the provisioner service to use, the template to pass to the service with information on what needs to be provisioned, and optionally a set of configurations to inject into the template at provision time. The same strategy is used to make the build and deployment information of the application components independent from the specific CI/CD service used. In this case the ksqlDB application is deployed using a custom script contained in the template file of deployInfo
block.
The ODM Platform is the solution we use internally as a self-serve platform to manage the lifecycle of data products from deployment to retirement.
It’s designed to support and facilitate the transition toward the data mesh paradigm, providing data teams with automation services and high-quality standards. The platform will be open sourced in the first quarter of 2023 on the Open Data Mesh Initiative GitHub page.
The ODM Platform is made up of two primary planes: the utility plane and the product plane. Let’s take a look at the utility plane first, and then follow up with the product plane.
The utility plane exposes the main APIs of the underlying infrastructure and services. It is an abstraction layer that decouples the product plane services from the underlying infrastructure, allowing it to evolve independently with minimal impact on already implemented products.
The services exposed by the utility plane offer high-level capabilities to product teams for their self-service data product needs. Some common examples of services exposed by the utility plane include:
Policy Service, which creates, manages, and enforces global governance policies
Meta Service, which propagates the metadata associated with the data product to an external target metadata management system, such as Confluent Schema Registry, Collibra, or Blindata
Provision Service, which creates and manages the infrastructure
Build Service, which compiles and packages all the applications
Integration Service, which deploys the application artifacts
The following image illustrates how the product plane interacts with the APIs of the underlying infrastructure through the abstraction provided by the utility plane.
In our use case the Provision Service uses a Terraform adapter that, thanks to the wide range of providers supported, enables the management of the infrastructure for almost any kind of product. The Meta Service uses a Confluent Schema Registry adapter to record the schemas of events used by the data products, while the Integration Service uses a custom script adapter to publish ksqlDB queries on Confluent ksqlDB. In another context, we could use the same platform with a completely different set of adapters, such as CloudFormation, Blindata, and Jenkins. ODP’s architecture provides a very flexible platform, allowing you to integrate with the technologies of your choice.
The data product experience plane is built on top of the utility plane. It exposes the services necessary to manage the lifecycle of a data product. Among the basic services that a typical product plane must provide are certainly the Registry Service and the Deployment Service.
The Registry Service allows for publishing a new version of the data product descriptor and making it available on demand to consumers for data discovery and access. To do this, it orchestrates the services offered by the utility plane in the following way:
Validates the syntactic correctness of the descriptor.
Verifies that the new version is backward compatible with previous ones.
Verifies the descriptor's compliance with globally defined governance policies using the Policy Service (e.g., APIs that expose streaming services must be described using AsyncAPI, schemas must be in AVRO and compliant with the cloud event specification).
Saves the metadata contained in the descriptor in an external governance system using the Meta Service.
Below is an image of the detailed workflow.
The Deployment Service is responsible for managing the creation and release of the data product container, which is created using the descriptor. To do this, it orchestrates the services offered by the utility plane in the following way:
Parses the application code and, if compilation is needed, generates the executable artifact (e.g., jar, war, exe, etc.), using the Build Service.
Creates the infrastructure using the Provision Service.
Records all metadata in the designed external metadata repository using Meta Service.
Deploys applications using the Integration Service.
Note that every time there is a significant change to the status of the data product, the Policy Service must be called to verify the compliance to global policies. Below is an image of the detailed workflow.
Data management is still a major challenge for many organizations. To address this challenge, many companies are turning to modern, decentralized approaches that focus on real-time data. Data mesh and event-driven architectures are two popular options to consider. One key aspect of these approaches is to treat data as a product, which is crucial for their successful adoption. This post shows how to formally define a data product in a streaming context using the DPDS specification, and how this definition can be used to design a platform that can automate its lifecycle as much as possible. For more information about DPDS and the ODM Platform you can check the Open Data Mesh Initiative website and join the community.
Data mesh. This oft-talked-about architecture has no shortage of blog posts, conference talks, podcasts, and discussions. One thing that you may have found lacking is a concrete guide on precisely […]
Decentralized architectures continue to flourish as engineering teams look to unlock the potential of their people and systems. From Git, to microservices, to cryptocurrencies, these designs look to decentralization as […]