Rebuild Instead of Fixing It
Full lifecycle management for a range of data services is hard; as every data service is different and comes with different edge cases. The automation needs to cover basic developer needs from the beginning and then may mature over time. An essential cornerstone, therefore, is to provide a fallback for zero-day failures: failures, which have not been seen and resolved in the automation, yet.
While there might be multiple strategies falling into this category, one has been proven many times: The approach to rebuild a failed system instead of fixing it.
This universal strategy has its history. Even in IT, it goes back to the design of the Unix operating system. A historical discussion between the well-known UNIX developer Dennis Ritchie and Multics developer Tom van Vleck is about using a kernel panic / reboot strategy rather than writing endless error recovery code:
„I remarked to Dennis that easily half the code I was writing in Multics was error recovery code. He said, “We left all that stuff out. If there’s an error, we have this routine called panic, and when it is called, the machine crashes, and you holler down the hall, ‘Hey, reboot it.‘“ – Tom Van Vleck
Since this discussions the concept of a kernel panic has made it to nearly every operating system there is. Its value is about reducing wasteful tasks from the development by providing a much simpler fallback strategy.
Re-Building Failed Instances
A specific occurrence of this strategy in both infrastructure as code as well as container systems is to resurrect failed VMs and containers.
This behavior belongs to the 1×1 of cloud operations. Applications containers, such as in the Cloud Foundry Application Runtime are stateless by application-design. They can be easily restarted somewhere in the cluster.
An easy and yet effective self-healing mechanism, which can be applied whenever the container fails its availability test.
For stateful containers and VMs it is a bit harder as state needs to persisted across re-creation of the ephemeral container/VM. This requires a separated storage backend.
For fully automated data services the goal is to provide a simple user interface to create and maintain data services. This can be realized providing the developers access via the Open Service Broker API. Platforms such as Cloud Foundry and Kubernetes will then provide a convenient command line interface (CLI) to interact with the service broker.
For this reason, implementing the re-build strategy for data services is a bit more complicated.
The automation implementation has to hide this complexity behind the Service Broker API. This includes the provisioning of VMs, installation, configuration, backup, update and monitoring of the data service.
Those who have operated clustered data services in production environments know about possible failure scenarios and how hard they can be to recover.
In contrast to stateless containers, be sure that anybody who operates a clustered data service has strong interest to keep data safe – why else bother with the complexity of a cluster? Consequently, nobody will happily throw away any clustered data service instances such as a RabbitMQ or PostgreSQL cluster. Regardless how well designed a backup strategy is, it is usually a magnitude worse than a replication node and falling back to a backup always implies a loss of data.
On a regular basis, it is highlighted how impossible a data loss is and that it won’t be tolerated but in the end, there is always a tradeoff between development & operation speed, cost as well as the degree of data safety.
Therefore, the mission statement for data service automation is obvious: take any reasonable and adequate measure to protect data.
But at the same time you have to be prepared for the worst case: a failed cluster beyond repair. In the case of such a disaster, it is essential to keep up the on-demand-self-service-promise:
A developer can always recover his failed data service instance without external help and without administrator privileges.
This may require the hard-reset of a service instance by destroying the VMs, persistent disks, rebuilding them from scratch and recover a backup. This fallback strategy may not be ideal but it works at scale for thousands of data service instances and can be applied for nearly any data service type.
It represents an automatable solution that will get an application up- and running again. This is a good example for how an issue can be handled on the level of an automation framework. Also, it underlines the importance of a solid backup and restore strategy.