Published at 23.03.2023
Are you building your first Kubernetes extension but have no idea how to deploy and manage its lifecycle?
In this article, we will discuss the general tasks of an operator and, right after diving into the details of the lifecycle of an operator. We will then evaluate different tools that have emerged in the Kubernetes eco-system and discuss how suitable they are for handling the lifecycle of operators at scale.
First, let’s take a step back and focus on lifecycle management and what that is.
In essence, lifecycle management aims to automate tasks arising during the life of an application, like the setup and installation, as well as typical second-day operations such as updates and migrations or regular tasks like taking a backup.
In the Kubernetes context, such cases are usually handled by the operator(s) of the application.
If you don’t know what an operator is, you can check out the CNCF whitepaper on operators. In summary, their goal is to replace a human operator by providing automation for lifecycle tasks that keep the managed application in the optimal state, as it is specified by the application developer.
A Kubernetes user should only have to determine the desired state by applying or configuring a Custom Resource on the cluster – the operator is responsible for watching those Custom Resources (CRs) and reconciling the actual state to match the desired state specified by the CR. This could, for example, mean that the operator has to create a StatefulSet representing a database.
But what about the operators? They must also be installed, updated, etc., while keeping their managed application running.
This problem might seem simple for one or two operators, but when you consider that the CNCF white paper states that an operator should focus on a single application in a real-world cluster, you, therefore, often have a multitude of operators running.
Even worse, some operators might depend on each other; for example, your web application operator might need a database operator to function correctly. So the question we will focus on in this article will be how to operate operators at scale.
But before we focus on solutions, let’s define the challenges we want to solve by going through essential phases in the life of an operator.
A typical operator comprises a set of Custom Resource Definitions (CRDs) and the operator’s deployment. A CRD describes the structure of a data type called Custom Resource (CR), for which Kubernetes then provides an endpoint and persistence for.
Operators are then typically deployed only using a simple deployment or replica set since all state is stored in the CRs.
Additionally, there might also be some ConfigMaps; they could, for example, be needed to specify a backup store for your application data.
For users to get started as quickly as possible, you would like a simple way for them to create the necessary config maps and secrets in the correct format without having to read detailed documentation.
During the life of your operators, their API – the CRs and CRDs – will most likely change, be it because you want to expose more configuration options or to simplify the overall structure.
While adding new CRDs is primarily straightforward, deprecating and removing old versions is not as easy.
The Kubernetes documentation provides detailed information on this subject. Still, in summary, you first mark your old version as deprecated to stop users from creating CRs with that API version and then migrate the stored entries in etcd, which can be done manually or using the storage version migrator tool.
When the migration is successful, you are sure there is no object left in the old version. You will then need to manually remove the old versions from the ‘status.storedVersions’- Unfortunately, Kubernetes does not do this automatically- before you can finally remove the obsolete CRD.
As you probably have already noticed, this sounds like a tedious task for which we would like to have automation, but be careful. When deleting a CRD version, all CRs of that version will be gone too, which could be fatal for stateful applications, like a database, since usually the Persistent Volumes will also be gone.
Another aspect to consider is that it could be necessary to perform a migration task during an upgrade, like reformatting a ConfigMap of the operator, for example, by splitting it into multiple ones. Other tasks should be performed before a migration, like taking a backup of all databases. The same could be true for other lifecycle events like a deletion, where it might also be advised to take a backup before deleting all PersistentVolumes forever.
To automate these tasks, the community came up with a variety of different tools, the two most popular ones being Helm and the Operator Lifecycle Manager (or OLM).
We will take a closer look at both of these tools in the next section and consider a recent addition to the Kubernetes toolchain, namely kapp and the kapp controller included in the Carvel project.
Helm is probably the most prominent and mature tool for packing applications for Kubernetes; one of the reasons is that it is easy to understand and use.
It provides a CLI to help you to interact with the templating mechanism that can be used to create arbitrary Kubernetes resources. Helm also allows you to specify tasks performed before or after an installation/upgrade/deletion through so-called chart hooks, making it possible to execute, for example, a job to handle migration.
The significant advantages of Helm are that it is easy to use, does not require any components running on the target cluster, it’s supported by a wide range of toolings such as Terraform, ArgoCD, or Crossplane, and it can handle arbitrary Kubernetes objects.
Therefore it is one of the most commonly used tools to manage extensions for Kubernetes.
Still, Helm has some significant limitations, mainly the lack of understanding of the nature of CRDs. For example, Helm does not perform any upgrades – it does not even apply a new version of a CRD – or deletions of CRDs; they are only applied once to the cluster during installation.
While you can work around this limitation in some ways – for example, by creating a separate CRD chart, thus forcing at least the installation of new CRDs – it still does not come with automation for the migration away from the old version. If you want to learn in detail the problems with CRDs and why Helm treats them that way, you can read more in this Helm hip.
Helm does not include a mechanism to detect whether dependencies are already met on a cluster. For example, an operator is often meant to be installed cluster scoped, so it watches all CRs on the cluster, regardless of their namespace.
In Helm, dependencies are always applied as a new Helm release, so a new instance of the operator will be created each time it is included as a dependency.
For operators not designed to support such a deployment, this can cause serious issues. While it is possible to create your own logic like it is done in the Hypper project; still, it’s desirable to have such functionality that is aware of an operator being installed in the cluster, built into your tooling, eliminating the risk of flaws or the need to install any Helm plugins – which are often not supported by tools like ArgoCD.
As its name suggests, the OLM was designed to handle operating operators’ challenges.
In contrast to Helm, it comes with in-cluster components responsible for automatically satisfying the dependencies of an operator – under the premise that all dependencies are available as OLM packages to the operator- installing it to the cluster and keeping it up to date.
Here the user triggers the deployment by creating a Subscription CR for the particular operator. This CR also allows you to, for example, enable automatic upgrades or to provide some configuration to the operator by specifying environment variables or volume mounts.
In contrast to Helm, the update mechanism also includes the update and deprecation of CRDs, including automated checks for stored versions, making it a more suitable framework for automating the whole lifecycle management of operators. If your operator was created with the go operator SDK, creating an OLM package is already heavily automated.
The drawback of the OLM approach is that it is only meant for operators, not tailored for different workloads like, for example, deploying DaemonSets.
Moreover, you lose the convenient templating mechanism, making it harder for end users to manage the configuration of an operator. For example, if your operator needs a secret for a backup store in a specific form, there is no way in the OLM to convey this to the user.
Additionally, the OLM currently does not come with a comparable mechanism to chart hooks, making it harder to perform migration tasks- they can still be done, for example, by init containers of the operator.
The upgrade of the CRs to a new version has to be done manually. If all CRs are migrated, though, OLM will update the storedVersion field during the upgrade.
If you have worked with Kubernetes, you will most likely have heard of Helm, and maybe also from the Operator Lifecycle Manager, kapp & kapp controller, on the other hand, will most likely be new to you.
It is part of the VMWare Tanzu Carvel project, which has been recently made open source and is not yet widely adopted.
It consists of the kapp CLI, similarly to Helm, resides on your local computer, and can be used to apply manifests to a cluster. Here you can use kustomize, ytt – their own templating mechanism – or even plain static yaml files.
In a diff stage, the CLI will now first compute the changes that must be made to the cluster, and if the user gives the ok, apply them. In contrast to Helm, kapp also updates immutable fields, like labels in a StatefulSet, by using its replace strategy to delete and create a replacement object.
For Carvel to lifecycle manage your application on its own, you will have to install the kapp controller to your cluster.
Similar to the OLM, users also express the desire to install an application via a CR, in this case, the APP CR, with the difference being that it also supports embedding values for templating mechanisms like Helm, ytt and kustomize; which will be used during the deployment.
This makes customizing the operator installation a lot easier by allowing the user to set all exposed template values, and in contrast to Helm, kapp can also install and update CRDs.
Still missing, on the other hand, is a dependency management system and no open source registry like the Artifact Hub for Helm or the Operator Hub for the OLM. Another thing to consider, the project is not widely adopted by other tools like Argo, probably because it still has a small but growing community.
As always, there is not one single tool to recommend when considering the lifecycle of your Kubernetes extension. It heavily depends on what you are building, how much you plan to automate it, and how your automation will be used in the end.
If you just want to deploy simple applications to a cluster, Helm is easy to recommend and can be learned quickly.
But if you just built an operator with the Operator SDK, you can start creating OLM packages using the provided makefile. Here you get the possibility of having a fully automated lifecycle, keeping the deployments of your operator up to date without involving a Platform Operator.
On the other hand, if your operator comes with many configuration options and you want an easy mechanism to run tasks before and after lifecycle events, which also supports continuous deployment and automated updates, it’s worth looking at kapp.
Of course, this is not the only approach to lifecycle management of complex frameworks. This is not a complete list of tools, and there might be other things to consider; for example, if you are using gitops, you might be fine just sticking to yaml files and kustomize in combination with a continuous deployment pipeline in ArgoCD or Flux. But we hope you now have a better understanding of what to consider and a place to start.
What are the challenges you encountered while attempting to automate the lifecycle of your Kubernetes extensions? What tools are you using? Hit us up with your thoughts.