In the first part of the blog post series about “How to build a PostgreSQL Cloud Foundry service” we talked about the design goals and the design decision that had to be made.
Now we are going to have a look on the automation technology (BOSH) we have chosen and the requirements which led to this decision.
First we had to answer the question – which automation technology to use?
There’s a list of proven and upcoming promising technologies such as:
After using Bosh to operate open source Cloud Foundry we’ve learned a lot about the pro’s and cons of Bosh.
Even with a Chef team available within our company Bosh was clearly the first choice; which will be explained in more detail in the following.
With the given requirements and top priority of being infrastructure agnostic, an automation technology is required that comes with a clean separation of the following concerns:
VM and persistent disk management
- Initial creation but also changes to VMs and persistent disks
- Operating system independence
- With this OS independent description of software packages
- Generic description of distributed systems
- Deployment topology to specify concrete deployments
- Enhanced monitoring and self-healing capabilities
- Enhances lifecycle management over fire and forget deployment strategies
- Bosh does all of that
- Bosh’s infrastructure independence has been proven many times
- Pivotal VMware → AWS
- anynines VMware → OpenStack → AWS
Using Bosh also fulfills the requirement that all resulting data services built using the service framework, in this case the PostgreSQL, will run wherever Cloud Foundry runs. This is because wherever Cloud Foundry runs, there’s a Bosh CPI and that is the Bosh Cloud Provider Interface – where the infrastructure support is located.
Bosh and Cloud Foundry also share many architectural thoughts.
Where Cloud Foundry implements the twelve factor manifest for stateless apps, Bosh solves the harder problem for stateful applications and services still being very close to the twelve factor manifesto.
The comprehensive approach of Bosh taking care of the entire lifecycle of a distributed system with a single technology is extremely handy in the given use case of provisioning a large set of PostgreSQL clusters.
Bosh takes care of the creation of virtual machines, the creation of persistent disks and most part of the networking, enabling on-demand provisioning of PostgreSQL server VMs.
Bosh also facilitates the scale-out of a given PostgreSQL server or cluster.
Described in a single Bosh Release it’s just another Bosh manifest to be generated and deployed to make Bosh increase the corresponding virtual machines and taking care of scaling CPU, memory and – most importantly – persistent disks. In an OS and Infrastructure independent manner!
Bosh’s way to wrap software packages by using compile scripts in Bosh releases may seem a little tedious in the beginning but enables a proper OS independence. No if-else-clauses needed to support multiple operating systems.
Not only Bosh helps with the creating of virtual machines – our PostgreSQL server and cluster VMs – more than that, it also helps to keep them running.
Bosh’s introduces multiple levels of self-healing.
Whenever a PostgreSQL server process dies, the built-in Monit coming along with each Bosh provisioned VM will automatically restart PostgreSQL. All you have to do is to provide a proper monit script as part of the corresponding Bosh release.
Whenever a Monit process should die on a Bosh provisioned VM the Bosh agent running alongside will recognize this and restart the Monit.
Should a Bosh agent fail, the Bosh heal monitor will recognize the missing heart beats from the agent and restart the entire VM.
This also covers the scenario of a failed VM causing the same absence of Bosh agent heartbeats and thus also the resurrection of the corresponding VM.
So while there are multiple options to automate the provisioning of dedicated service instances, Bosh is an excellent fit!
With Bosh being the preferred automation solution, the question on whether to rely on VMs or containers to isolate PostgreSQL server VMs from another is somehow obsolete.
It is obsolete because this is now a Bosh responsibility to whether provide a proper container support or not. All the PostgreSQL service will do is talking to Bosh.
Currently this – for production – implies the usage of virtual machines provided by the infrastructure of choice.
Data Redundancy / Clustering
Along with the decision about a shared vs. dedicated deployment strategy, this is probably the second most important decision.
It is about answering the question on whether a single replica of your data is enough or you rather want to go with a data redundancy and replicate data across several nodes?
With Bosh being in place one might argue that Bosh resurrection will recreate a failed database server and this is true. Bosh does a great job when it comes to repairing a failed VM.
However, there Bosh resurrection has limitations:
Most obviously, the creation of a new VM takes multiple minutes not seconds. For some production databases, a multi-minute downtime outside of scheduled maintenance windows, is already a disaster scenario.
Also a PostgreSQL database server could be corrupted on the filesystem level. In such a case Bosh won’t be able to recover as the damage has happened on a layer outside of Bosh’s scope.
Both issues can be addressed by replacing a PostgreSQL single VM server with a multi-VM PostgreSQL cluster using data replication.
With replication multiple servers, often three, will have the same or similar data set at any given time. So whenever the master database goes away, there’s a slave server being ready to be promoted within seconds. A much more appealing time to repair.
In case the failed server has been taken down entirely, let’s say caused by the failure of a physical host, Bosh’s resurrection can be used to recover the cluster from degraded mode (two instead of three nodes), automatically.
The corrupt filesystem scenario can also be covered by switching to a slave but requires some intelligence in failure detection to work fully automatically.
As already mentioned database replication can be used to create and maintain “copies” of a database by synchronizing all data changing events to other PostgreSQL database servers.
PostgreSQL comes with a built-in replication feature and there’s variety of replication tools to support this. One of the essential decisions to be made around replication is the question about the replication method which can be either synchronous or asynchronous.
With synchronous replication a majority of cluster nodes have to confirm a transaction before the transaction is applied to the dataset. This keeps the dataset across the cluster in sync. The major and name giving characteristic of synchronous replication.
As postulated in the CAP theorem, this results in a drawback in partition tolerance. Especially in case of unexpected network latencies or network partitioning a synchronous replicating cluster cause SQL transaction to be heavily delayed or even turning the cluster into split brain situation.
Therefore, synchronous replication should be preferred in environments with a constantly low network latency as well as high data consistency requirements.
Depending on the implementation another benefit of synchronous replication can be a multi-master scenario where load balancing of reads and writes can be balanced across all database nodes.
With asynchronous replication there is a master node, the PostgreSQL database master server. All applications will write to and often read from the master. A load balancing across the cluster nodes is often not meaningful while read-only operations such as creating backups can be run against slave database servers.
In an asynchronously replicated scenario a cluster manager is required determining the current master node. This ensures that all writes go to the master whose changes then will be replicated across the number of slaves avoiding data conflicts.
As the async replicated PostgreSQL server does not have to wait for other servers to confirm, the delay of SQL transaction is very low. Consequently, writes to slaves occur later than writes to the master resulting into a difference in data – the so called replication lag.
The replication lag is the number of transactions a slave is behind its master before the asynchronous replication has catched up with the master.
This behaviour results in the disadvantage of a potential data loss in case of a failing master but also into the advantage of being able to replicate across large distances and across networks with high latency. In many situation it is better to have an old copy of the data than to have none at all.
Failure Detection, Leader Election and Automatic Failover
With both replication approaches two nodes won’t be optimal. In asynchronous replication a majority of servers is necessary to decide upon the success of a transaction and in order to receive some degree of fault-tolerance a third server is necessary to have a majority in case one node fails. Otherwise, in the case of a network partition each node might try to promote himself to a master.
This is similar for asynchronous replication where the election of a new master is also based on a majority based quorum.
The implication of using quorum based cluster strategies reaches down to the infrastructure level. A meaningful separation of infrastructure into at least three availability zones is required. This ensures that the failure of a single physical node – or even an entire availability zone – won’t take down a PostgreSQL cluster while maintaining the possibility to perform majority based leader election.
For the PostgreSQL service asynchronous replication has been chosen, as it was planned to provide a synchronously replicated RDBMs with MySQL / MariaDB Galera in a different anynines data service later.
Although PostgreSQL comes with built-in replication facilities, an additional component is used to provide advanced support for PostgreSQL’s replication, particularly when it comes to detecting and performing automatic failovers.
While there are several options the repmgr turned out to be a good choice as it is – in contrast to other cluster managers – fairly automation friendly. One of repmgr’s main responsibilities is to perform the above mentioned leader election.
The repmgr allows to configure a so called promote script to elect a new master database server.
In the context of Cloud Foundry apps the promotion of a new master leads to the problem of an ip address change as the potential new master PostgreSQL server has a different IP-address than the old one. So the challenge is how apps bound to the corresponding database will be informed – or even better – reconfigured to point to the new database master.
One way to achieve this in an infrastructure independent manner is to make use of a clustered consul acting – among other useful functions – as a highly available and REST API programmable DNS server.
This way the PostgreSQL service broker won’t return a master db server IPaddress but an internal master DB server DNS name, instead.
The promote script triggered by the repmgr will then talk to consul and update the master DB DNS record accordingly, pointing to the new master’s IP address.
In our last part of the blog post series we will give you an idea about the Cloud Foundry User Experience the architecture and also details about what we’ve learned by building a production grade PostgreSQL Cloud Foundry service. So stay tuned.