[Webinar] Build Your GenAI Stack with Confluent and AWS | Register Now
Confluent Cloud is a hosted platform for Apache Kafka®, Stream Designer, Connect, ksqlDB, and Schema Registry. All services that power Confluent Cloud are either packaged as a Kubernetes Helm chart or as a custom format that maps to a Custom Resource Definition (CRD). This blog post describes how we operationalized and migrated some of the most critical Helm-based services to our in-house deployment management platform.
Helm: A package manager for Kubernetes manifests. Users can package their Kubernetes manifests as Helm charts. Optionally, the manifests can be templated using go templates. Helm comes with an API and a CLI to deploy Helm charts onto Kubernetes clusters.
CRD: Kubernetes API can be extended with Custom Resource Definitions (CRD). These are resources that are specific to the user infrastructure that are not natively supported by Kubernetes. Confluent Cloud uses CRDs to promote self-healing, elastic, and highly available deployment patterns. For more details, please refer to the Kubebuilder book.
Vault: Vault is an identity based secret management system. It supports several kinds of identity providers and several kinds of secrets. It has an internal identity system that acts as a bridge between custom identities and the secret engines. Vault allows users to export this identity as JSON Web Tokens (JWT).
JSON Web Token: A tamper-proof token that includes several predefined claims like when the token expires, who is this token meant for, and an opaque identity of the token holder. These claims are later signed and encrypted by a private key. Typically, a provider of JWTs also exposes an endpoint that is used to fetch public keys for decrypting these JWTs. Additionally, the specification also allows for custom claims, which can be used by services to include additional metadata.
Service version YAML: The declarative YAML spec used by our deployment platform to manage Helm repository, Helm version, and Helm values and other notification-related metadata.
Confluent Cloud supports AWS, Azure, and GCP cloud providers and runs on their managed Kubernetes offerings. Thus Confluent Cloud needs to deploy Helm-based services on multiple Kubernetes across all cloud providers. Before we built our new deployment management platform, deployments at Confluent were known as “Big Bang Releases” and happened every month. Every release had a directly responsible individual (DRI), who was responsible for getting sign-offs and deploying the entire infrastructure, enabling Confluent Cloud through either Terraform or Helm. This was a painful, undifferentiated, and error-prone process.
We built this deployment management platform to help alleviate some of the operational toil.
This section provides a brief overview of both the requirements and the high-level system architecture.
Our platform embraces the controller pattern. Every Helm service specifies Kubernetes selection rules. During deployment time, the selection rules specified by the service are evaluated against all Kubernetes clusters and the service is queued for deployment on the selected Kubernetes instances. The controller orchestrates all necessary infrastructure changes to maintain this desired state. A reconcile loop runs periodically and attempts to deploy each service on all selected Kubernetes.
The reconcile loop encapsulates the core deployment logic and can be summarized as follows:
Additionally, the platform provides the following features:
The task was to migrate all control plane services that power Confluent Cloud to our new deployment platform. This migration was necessary to make the deployment process more secure, auditable, and less error prone.
Some of the key tenets were:
Our overall migration strategy was based on the following observation:
As long as no immutable attributes of Kubernetes manifests (like Helm-release name, Kubernetes deployment selectors, etc.) change, the migration can be done in place. In particular, we don’t need to spin up a new instance of the service in a separate namespace. Further, if there wasn’t any change in the corresponding workload’s podTemplateSpec, it would be a zero downtime migration.
This observation led to a simple migration strategy:
We also made several improvements that simplified both the YAML definition step and the verification step:
Additionally, as a way of earning trust, our team signed up to migrate the most critical Confluent Cloud service early in the migration timeline. Once we demonstrated how seamless the migration process was, it created a snowball effect and everyone signed up to migrate their services over.
Our new deployment management platform is composed of several components. Each is managed as a Kubernetes Deployment workload with a single pod. At startup, each component loads all its state into memory and serves all read requests from memory. Thus our deployment management platform was only vertically scalable.
The services that power Confluent Cloud are heavily developed and hence could pose significant load on the deployment platform. One option was to delay the migration until after the platform was made horizontally scalable. However, the migration was very critical in securing and improving the deployment experience for our internal users and delaying it was the last resort. So we identified several improvements and wanted to test the effects of making these improvements.
Together, these changes helped to reduce the memory footprint from over 3GB to about 256MB. We essentially applied a bunch of tactical fixes and performed load testing to verify that we will be able to absorb the services. Please scroll down to learn more about our future plans.
The next step was to do load simulation and verify the limits of our platform.
For any performance testing, it is helpful to come up with a load factor and important service metrics. Then, we iteratively modify various dimensions of the load factor and measure the service metrics. The point at which service metrics drop below our SLO is typically considered a tipping point.
As mentioned earlier, the core component of our architecture is a controller process. Its sole job is to ensure that all service deployments in all Kubernetes clusters are successful. As of Q2 2021, it took about 800ms to run an iteration of the controller loop. Thus, our load factor is defined as the cross product of the number of Kubernetes and the number of services deployed in each Kubernetes.
Our testing strategy was to increase the number of Kubernetes and the number of services deployed in each Kubernetes and measure both the end-to-end deployment latency and the controller runtime.
Provisioning a large number of new Kubernetes clusters for testing purposes would have been prohibitively costly. Hence we simulated lots of Kubernetes by spinning up a separate process which ran thousands of goroutines, each corresponding to a modified Kubernetes. Additionally, we also simulated fake services that would be deployed to each Kubernetes.
With this setup, we were then able to quickly iterate on testing.
We were able to prove that our platform could support up to 10x the expected new load without too much degradation in controller runtime or end-to-end deployment latency.
The deployment management platform uses Vault to perform service-to-service authentication. In particular, it uses the Identity secret engine. The basic idea is that the deployment management platform exposes an OIDC provider endpoint using Vault to export authenticated Vault-tokens as signed JSON Web Tokens (JWTs). These JWTs have custom claims that can be used to uniquely identify the caller and associate permissions. Let’s look at the user journey.
All Kubernetes agents used the same Vault authentication role. Thus a compromised Kubernetes can impersonate any other Kubernetes and fetch the manifests not meant for it.
At a very high level, we dynamically created Vault authentication roles for each new Kubernetes. We also migrated the existing Kubernetes off of the shared Vault authentication role. The biggest challenge was arranging coordination between the various actors in a backwards compatible manner.
This solution involved several actors. The following lists each of them and their role in the solution:
In addition to securing Kubernetes agent communication, we also added support for RBAC in each endpoint of the deployment management platform. This helped us restrict:
Our team is experiencing a challenging yet exciting time!
We’re beginning to see the limits of vertical scaling in our aggregation server. We’re also beginning to see some operational challenges in using Vault-based service-to-service authentication. Users are coming to us with increasingly complex services that fully embrace controller patterns and services that need target selection at runtime and across different failure domain boundaries.
This is forcing us to go back to first principles and re-architect the system from the ground up. As we go through this process, we are seeing how much our users care about our platform, which makes all of us super proud!
If managing cloud-native infrastructure at scale interests you, please reach out to Rashmi Prabhu, Decheng Dai, or Ziyang Wang—we’re hiring!
While the authors were responsible for planning and executing this migration, we received a lot of support and guidance along the way.
First and foremost, we wish to thank all Confluent teams for trusting us enough to onboard some of their critical workloads on this platform. They onboarded in a timely manner and provided tons of feedback for improving the platform as well. Second, we want to thank the founding members of this platform for their tireless efforts in developing and productionizing it. We also thank the entire leadership team and the TPM team for being enthusiastic supporters and guides. Last but not the least, we wish to thank our Kubernetes infrastructure group of teams. They worked closely with us to arrange the security fixes.
An inevitable consequence of rapid business growth is that there will be incidents: even the best, most well-planned systems will begin to fail when you expand them very quickly. The […]
Twenty years ago, the data warehouses of choice were Oracle and Teradata. Since then, growth and innovation has shifted to the cloud, and a new generation of data systems have […]