In this first part of a blog posts series I want to guide you through the journey of building a production grade PostgreSQL Cloud Foundry service. The goal of the blog post is to share experience collected in the anynines team from years of building production grade backing services. PostgreSQL in particular is a good example as it has never been built to be fully automated.
Challenges such as setting up a proper replication, detect failure scenarios and perform automatic failovers are interesting challenges with many pitfalls for beginners. Learn about production requirements, technical challenges and see an exemplary solution architecture.
How we came to the problem
The anynines journey started back in 2013 with bootstrapping a Cloud Foundry based Platform as a Service (PaaS) offering called anynines. By now the company has turned into a Cloud Foundry consultancy offering training, development, consulting and operations.
With the anynines PaaS we decided to strongly emphasize the requirement of being independent from any infrastructure technology.
Therefore, the Amazon Web Services (AWS) data ecosystem has never been an option to us.
We wanted to be able to move to any infrastructure at any time and allow growing customers to go on-premise with the platform they know. In the latter case, the customer should not have to adapt to other data services but see the same or at least similar platform privately and publicly.
In the later stage of the anynines PaaS, we’ve been based on an open source OpenStack installation and seen many infrastructure issues. The platform and data service layer must be smart, redundant and robust. No solution was in sight so we started building our own.
A diversity of platform environments leads to a diversity of the notion “production grade” thus to the need for customized solutions.
A public PaaS on a less stable infrastructure is a good learning environment for building data services.
PaaS users are unknown. They might even be vicious. So tenant isolation is tested intensively.
The following requirements have been identified as relevant and taken into account during the design of the anynines PostgreSQL service.
The PostgreSQL solution may not depend on infrastructure specific functionality.
In case infrastructure specific behaviour cannot be avoided a pluggable, infrastructure specific strategy is to be applied.
Reusable Service Framework
A rdbms such as PostgreSQL is surely one of the most frequently asked data services. However, other data services need to be provided, too. The automation logic used for PostgreSQL should therefore be highly reusable and part of any anynines Service Framework instead of being an isolated solution.
High availability & Robustness / Fault tolerance / Self healing
Infrastructures heavily vary in their quality of service. High-end hardware and Infrastructure as a Service (IaaS) live-migration features should not be mandatory. The data service should be able to run on fair-enough-quality infrastructures, such as AWS.
Occasionally failing virtual machines, for example, should be recovered as quickly as possible and have a minimum impact on PostgreSQL database instances. This should cover both failures caused by the infrastructure and from within the virtual machine, such as a kernel panic.
Data services processes, e.g. the PostgreSQL server may fail and should be automatically restarted to overcome simple-to-repair failure scenarios, automatically.
Data redundancy should be applicable optionally so that the Cloud Foundry user may decide upon a tradeoff between operational costs and improved robustness.
Data redundancy also helps to decrease time-to-repair significantly, when used in conjunction with automatic failure detection and failover but comes as the price of increased infrastructure resource consumption, e.g. by using multiple virtual machines (vm) to form a PostgreSQL master/slave cluster.
After virtual machines of a cluster have failed, the corresponding clusters should be recovered from degraded mode automatically.
Isolation & Security
On a public PaaS fairly unknown customers are operated right next to each other. They are – potentially – neither known to the platform provider nor to platform neighbour customers. For this reason a strong tenant isolation is required to prevent accidental as well as intentional harming behaviour to the scope of the corresponding customer.
From the platform provider’s perspective
The PostgreSQL service should be able to deliver database instances to a growing number of customers. Ideally, the infrastructure represents the sole bottleneck.
From the platform user’s perspective
Each PostgreSQL service instance should be scalable to grow along with the corresponding app(s).
Developer friendliness & Accessibility
A remotely managed database with no direct vm access may become a burden to developers.
Hence some degree of remote accessibility must be provided like temporary port forwardings opening end-to-end encrypted tunnels to access the database instance from remote machines.
Automation & On-demand self-service
The database lifecycle management is aimed to be automated as far as possible in order to reduce manual operation efforts to a minimum over time. A platform user should be able to provision, bind, unbind or deprovision PostgreSQL database instances at any given time using the CF service interface.
There is a variety of ways designing a software fulfilling the necessary requirements but looking at the Cloud Foundry ecosystem from both a technological and methodological perspective, many decisions fall into place.
Dedicated vs. Shared
The most important decision: what is your service instance going to be?
One cluster for all service instances vs. dedicated virtual machine(s) per service instance.
With a shared database approach the service broker will provision a database within a single database server per service instance. With a dedicated database approach the service broker will provision a dedicated server per service instance.
In a shared scenario a single PostgreSQL server or PostgreSQL cluster would be deployed by the PostgreSQL Service Broker. Service Instances would be represented by separate PostgreSQL database within the shared PostgreSQL database server / cluster. Service Instance = PostgreSQL database cluster.
- Easier to implement in the first place
- Uses less infrastructure resources
- Separate databases within a single PostgreSQL database server are not entirely isolated leading to weak noisy neighborhood protection
- Visibility of neighboring databases → Privacy flaw
- Tenants share a single cache → Performance issue
- Shared disk I/O and CPI → Performance issue
- The failure of a single PostgreSQL server has impact on a large number of PostgreSQL database instances and thus on a large number of applications and platform users. → Large impact → Critical incident
When using a single shared PostgreSQL server or cluster the amount of service instances is limited by the number of databases that can be provisioned on a single server. The server is limited by the vertical scale of a virtual machine.
A solution to this limitation would require the provisioning of multiple PostgreSQL database servers. But this would lead to a largely increased overhead caused by the complexity of managing the distribution of service instances across available database servers. The task is similar to the placement challenge between apps and runners with the difference that the database has to deal with state.
In case the requirement for small and very cheap PostgreSQL service instance is not a top priority a dedicated approach should be considered.
A PostgreSQL service implementing the dedicated strategy, the corresponding service broker returns entire PostgreSQL servers or PostgreSQL clusters represented by either a single or multiple virtual machines as service instances.
Drawback: Service instances being represented by vm based PostgreSQL servers or clusters are more costly as they consume more infrastructure resources.
Advantage: However, this apparent drawback also is the biggest advantage.
Due to the strong isolation of infrastructure virtual machines a CF user always gets a service instance with clear service levels guaranteed by the assigned infrastructure resources.
The amount of memory and disk space is clearly known. There is no bad neighborhood as the CF user is the sole user of his PostgreSQL server. The cache is his.
Memory and all available disk I/O are solely dedicated to him.
With this strong tenant isolation in place, damage caused by poorly designed or even malicious apps is contained and will only affect the corresponding database service instance. No other customers will be affected.
Drawback: The automatic provisioning of a large number of clusters introduces the challenge of handling the entire lifecycle of database servers and clusters, in an highly automated manner. Automation is key to this approach and must be driven rigorously.
Advantage: Once, the automated provisioning of PostgreSQL servers is implemented, there’s is basically no limit on how many service instances can be provisioned. The sky is the limit or more precisely: the infrastructure is the limit.
The decision is easy. On a public platform onboarding foreign users, maintaining a service level is key to the acceptance of the platform. No other approach than dedicated PostgreSQL servers as service instances can – by design – guarantee an acceptable service level.
Therefore, the following design decisions will be based on the assumption that the strategy of dedicated service instances is applied.
This has consequences. The most obvious consequence is the challenge on when and how to provision service instances.
Pre-Provisioning vs. On-Demand Provisioning
On the question when to provision service instances, two strategies immediately come to mind:
The pre-provisioning strategy means that service instances – PostgreSQL servers/cluster vms – are provisioned before the CF user demands them. This comes with the advantage of a fast provisioning time being appropriate in a continuous integration scenario, for example.
However, this approach has the obvious drawback of an increased infrastructure resource allocation as memory and disk resources are blocked by running virtual machines that have no purpose, yet.
A temporary implementation advantage of this strategy is that the service broker, once established, will work even without a service instance provisioning automation which can be then added later during the implementation.
Such a temporary workaround of course comes at the price of a limited scalability as there’s only a limited number of pre-provisioned service instances. Once these service instances have been allocated, no further service instances can be delivered without manual intervention. A clear violation of the on-demand self-service requirement.
A on-demand provisioning strategy means that service instance vm(s) – PostgreSQL server/cluster – will be provisioned the moment a platform user issued a cf create service command.
Inverting the profile of the pre-provisioning strategy, the on-demand strategy does not block infrastructure resources aside from those allocated to service instances but consequently needs more time to make service instances available.
This delayed provisioning time needed to create and install the PostgreSQL virtual machines representing the service instance naturally should use Cloud Foundry’s async service broker API.
The on-demand provisioning inherently requires the presence of a full provisioning automation right from the beginning but then solves the scalability challenge as described before.
In co-existence with the on-demand provisioning strategy a maximum of use cases can be coped.
For the fast creation of service instances, a certain pool of pre-provisioned service instances can be held available. Other instances can be provisioned on an on-demand basis. The combined strategy could be configured per service plan. The combined strategy is therefore to be preferred.
With the demand for a maximum degree of automation, the choice of the most suitable automation technology becomes key.
In the next blog post I will give you an overview about the automation technology, Data Redundancy / Clustering & replication amongst others – so stay tuned.