Design for Scale
Why Scale Matters
Automation does not equal automation. Automating the operations of a single microservice-based application system is not trivial but it has nothing to do with the challenge of developing an application platform to serve thousands of customers with ten thousands of applications. The same applies to data service automation: scale changes everything.
Scale changes Everything
A data service automation for a modern platform needs to provide the ability to cope with thousands or ten thousands of data service instances. Most platforms require a multi-million invest. Referring to the economy of scale, such a business requires a certain size to be economically viable. Scale is good for automation as automation requires repetition to amortize its engineering efforts.
Implications from this scale requirement sets legacy Devops scenarios – with focus on the automation of a few application systems – apart from scenarios with platform scale – where many thousand application systems will be operated.
The impact of scale is illustrated in this article about the Evolution of Software Development and operations using the visual example of selling sausages. Selling a few per hour requires a very different setup than selling thousands, e.g. during an event with many visitors. It’s basically the same thing: selling sausages but it’s the scale that changes everything.
How Scale influences Data Service Automation
Before the impact of scale on data service automation can be explored the term scale has to be investigated. It has to be defined what system property is subject to scalability before a deeper analysis makes sense.
Dimensions of Scale
Depending of the perspective different aspects of the system may receive varying load and subsequently may need to be scaled.
Some load scenarios may evolve around the following factors:
- Number of application requests towards a service instance
- Amount of data written to a service instance
- Number of service instances coexisting
- Number of data services (types) available in the market place
- Number of service broker requests
- Number of service plans offered per data service
- Number of target platforms to deliver the automation to
- Number of target infrastructures to deploy the automation to
- Number of target packaging systems to wrap the automation
- Number of environments (~ Number of customers) to deliver the automation to
This incomplete list can be categorized into:
- Service instance scalability
- Service broker automation scalability
- Automation release management scalability
Service Instance Scalability is essential for platform users. Developers using the data service automation want to use as little resources as possible to save money while they want to experience the elasticity associated with cloud services. Therefore, offering a solid scaling mechanism is important.
Developers need the ability to quickly respond to increasing service instance load by performing a service instance scale out. This in turn is a challenging task as should happen quickly and ideally without service instance impact. A valid strategy to perform such a low-impact scale-out is to replace nodes of a clustered service instanced with a rolling scale-out by replacing cluster-nodes one by one.
This complies to both strategies “Solving issued on the framework-level” – as it can be done independent from the data service – as well as “Re-Build instead of fixing it” – as it replaces a node with a bigger one instead of trying to modify it.
Example: A user creates a PostgreSQL database cluster with three nodes and 16 GB of memory each and later wants to scale it to nodes with 32 GB.
Service broker automation scalability addresses the ability of the data service automation itself to cope with an increasing load of service broker requests. In a production environment it is not the number of service instances itself that creates the challenge. It is also the frequency of developers performing expensive service broker commands such as creating new service instances, modifying them e.g. performing updates, scaling them and so on.
Just to draw two very different service broker load scenarios with the same total amount of service instances. One scenario uses 1.000 data service instances to run a hundreds of microservices. Service instances are provisioned and have a fairly long life. Updates of service instances happen at the frequency of their releases and as most microservices have their lifecycle so that updates naturally distribute over a larger period of time.
The other hypothetical scenario is much more volatile. The microservices and data services in this scenario are not running constantly. They represent short-lived workloads and these application systems including their corresponding data services that will be provisioned on-demand when customers sign up, produce a heaving short-burst workload and will be shortly after de-provisioned.
In the same time span with the same average number of data service instances the second scenario produces way more load than the first one. This load may cause bottlenecks at the infrastructure level as creating VMs (more than containers) is expense. Also the automation tool itself may start queueing requests at a time depending on its scalability.
Example: A small POC environment may cause a few dozens service broker requests. A production environment with seven data services being highly frequented by more than a thousand developers data service instances are under constant change.
The entire data service automation chain from the service broker, the automation tool such as BOSH to the infrastructure e.g. OpenStack are under constant stress. Increasing load requires the scale-out of these components, e.g. adding more service broker instances, increasing the number of BOSH workers or adding more OpenStack compute nodes.
Automation release management is another category of scalability for data service automation. It describes the necessity in complex platform ventures to cope with a multitude of data services to be delivered to various infrastructures, platforms resulting into a large number of total environments.
Imagine a two pizza sized team delivering seven data services to six target infrastructures supporting multiple target platforms including three different packaging systems shipping to hundreds of target environments deploying thousands of data service instances.
This is the magnitude of scale and automation challenge at hand!