State-Drift? Build Self-Healing Kubernetes Controllers

Kubernetes has long established itself as the de facto standard for orchestrating containers, but its influence goes far beyond that. Its architectural paradigm is centered on a declarative API and a control plane built around reconciliation loops that has fundamentally shaped how cloud-native infrastructure is designed and operated.

This control plane model, where controllers continually work to align the actual state of the world with the user’s desired state, has proven to be general and powerful enough to manage not only containers but also virtual machines, data services, networking, and more. In fact, the Kubernetes ecosystem has exploded with custom controllers and APIs managing all kinds of infrastructure.

However, while building a basic custom controller is relatively straightforward thanks to modern libraries like Kubebuilder or Operator SDK, production-grade controller design remains complex. One often-overlooked challenge is that the “real world”—the external systems your controller interacts with—can change independently of Kubernetes. In these cases, your controller must be aware of state drift, even if no Kubernetes API change has occurred.

At anynines, we’ve been building a control plane for managing the lifecycle of data services via custom controllers. Along the way, we’ve run into this very challenge: when your controller interacts with external systems, simply reacting to Kubernetes API events is not enough. The system must also detect and respond to changes in the real state of the world.

Let’s dig into the problem of handling external state drift in Kubernetes control planes, and explore several patterns and design options to address it.

The External State Problem: Why Kubernetes Controllers Fail at Self-Healing

Let’s learn why your Kubernetes controller isn’t truly self-healing and how to fix it. A typical Kubernetes controller (custom or not) is defined as a loop that:

Watches Kubernetes API objects.
Reconciles the actual system state to match the desired state described by those objects.

The reconciliation process can involve mutating other Kubernetes resources or side effects in external systems. For instance, a controller might translate a custom Backup resource into a Kubernetes Job, or it might directly call an external API to create a database user.

If your controller only mutates Kubernetes resources, the system benefits from the built-in watch and reconciliation infrastructure. Kubernetes ensures your controller is notified of changes, and reconciliation logic can fix drift.

However, when your controller manages resources outside the Kubernetes API–like cloud infrastructure, databases, or legacy systems–those resources may independently change without triggering any Kubernetes event. That’s a problem. If your controller only watches Kubernetes, it will never know that the external state has changed, and therefore cannot reconcile it.

Let’s walk through an example.

Example: Container Crashes Outside the Controller’s Awareness

Imagine a custom controller that launches a container (not using Kubernetes Pods, just directly using Docker, for example) in response to a custom resource. If the container crashes due to a bug or external condition, but the Kubernetes API object hasn’t changed, your controller receives no notification. It doesn’t know anything went wrong—and can’t fix it. This breaks the self-healing promise Kubernetes usually provides.

This scenario illustrates a broader truth: controllers that manage external systems need to watch more than just the Kubernetes API, they also need visibility into the real world.

Patterns and Strategies for Handling State Drift

1. Avoid Managing External State If Possible

The simplest (and best, if feasible) option is to avoid managing external state altogether. If your controller only interacts with built-in Kubernetes resources, reconciliation remains fully inside the Kubernetes ecosystem and the API server takes care of notifying you about changes.

For example, a custom controller that creates StatefulSets or Jobs doesn’t need to worry about external drift. You benefit from Kubernetes’ existing control loops.

However, many real-world use cases (like data service automation or infrastructure provisioning) require reaching outside the cluster.

2. Periodic Reconciliation with Idempotent Operations

If your controller’s actions are idempotent, you can periodically re-reconcile resources regardless of API changes. Kubernetes controller libraries often support periodic reconcile triggers (e.g., RequeueAfter in Kubebuilder).

This is simple to implement, and if external state has drifted, the next scheduled reconciliation can fix it. If not, the operation is a no-op.

Pros:

Easy to implement
No need to detect drift explicitly

Cons:

Wastes compute cycles on unnecessary reconciliations
Choosing the right interval is tricky (too frequent = waste; too infrequent = delayed healing)

3. Explicitly Poll Real State and Trigger Reconcile

Instead of reconciling blindly, you can build logic that polls external state and only triggers reconciliation if a mismatch is detected.

There are two flavors of this pattern:

Inline Polling: As part of each reconciliation, the controller polls the real world and compares it to the desired state.
Sidecar Poller: A separate background process periodically checks the real state and enqueues reconciliations only when necessary.

Pros:

More efficient than blind periodic reconciliation
Keeps external state checks decoupled

Cons:

Requires additional logic and maintenance
Polling intervals still matter

4. Listen to External Change Events (“Real State Notifications”)

If the external system supports it, your controller can subscribe to notifications directly from that system.

Example: A controller managing PostgreSQL users might subscribe to PostgreSQL’s native NOTIFY/LISTEN mechanism or use logical decoding. When it receives a change event (e.g., a user was deleted), it triggers a reconciliation for the relevant resource.

Pros:

Highly responsive
No unnecessary polling or reconciliation

Cons:

Requires the external system to support change events
Adds complexity in handling dual event sources (Kubernetes + external system)

5. Use a Poller That Updates the API Status

If direct event subscriptions aren’t possible, you can deploy a dedicated poller component that compares external state to desired state and updates the .status field of the corresponding Kubernetes object.

This status change triggers a notification that the controller can respond to.

Pros:

Retains Kubernetes-native workflow (controller only watches API)
Centralizes drift detection logic

Cons:

Introduces an extra component to maintain
Update frequency still needs tuning

Conclusion

In a Kubernetes-native world, reconciling desired and real state is at the heart of infrastructure automation. But when your controllers reach beyond Kubernetes to manage real-world systems, they must account for drift that Kubernetes can’t see.

If you’re building production-grade controllers for infrastructure, databases, or legacy systems, it’s not enough to watch the Kubernetes API. You need visibility into real-world state and mechanisms to trigger reconciliation when that state changes independently.

Whether through idempotent periodic reconciliations, real-world polling, or external event subscriptions, your design should ensure the system remains truly self-healing.

Being aware of this issue and choosing the right mitigation strategy is critical to building resilient, production-ready control planes.

Real-World Drift: A Hidden Challenge in Kubernetes Controller Design

The External State Problem: Why Kubernetes Controllers Fail at Self-Healing

Example: Container Crashes Outside the Controller’s Awareness

Patterns and Strategies for Handling State Drift

1. Avoid Managing External State If Possible

2. Periodic Reconciliation with Idempotent Operations

3. Explicitly Poll Real State and Trigger Reconcile

4. Listen to External Change Events (“Real State Notifications”)

5. Use a Poller That Updates the API Status

Conclusion