Published at 11.05.2018
Table of Contents
With the full automation of the entire lifecycle of data services as a goal, the choice of data service to automate has a large impact on the resulting challenge. Large enough to make the data service choice a critical success factor. The selection of data services is rarely happening on a green field.
Often there’s already a list of applications requiring certain data services or developers have made decisions so that the choice of data services to be automated is not free from constraints.
Designing a platform for public audience may also add the quest to identify the data services developers require and desire. Depending on the particular context, it is recommendable to gather requirements from both a development and operations perspective.
In any case, it is worth a closer look as a deeper understanding will help to make informed decisions or at least enrich internal data service discussions with strong arguments.
In order make decisions about which data service to choose, it is beneficial to look at list of generic data service selection criteria. These criteria may have legal and/or technological aspects concerning the platform strategy as well as developer and operator concerns.
As mentioned before, a data service might be selected because there is a large number of applications relying on it. Such a constraint can rarely be ignored. In the worst case, the data service is rather hard to automate and subsequently will consume much engineering effort.
Luckily, many cloud transition projects come with the willingness to change. Developers are then allowed to use new data services for new applications and sometimes refactor an existing codebase using modern architectural patterns and use contemporary data services.
Modern software development often makes use of microservice-based architectures. This allows to split large projects into several small services. The project benefits vastly from the local autonomy as developers of a service are usually free to choose from a set of programming languages and data services for the task at hand.
Broadly speaking, this leads to an incomplete list of frequently asked data service categories:
The relational database management system (RDBMS) is mandatory. It can be found on every list and usually comprises databases such as **PostgreSQL or MySQL.
**PostgreSQL is often favored by engineers moving away from legacy enterprise databases as PostgreSQL is very feature rich.
Also, PostgreSQL comes with a list of feature rich plugins including a geo spatial plugin called PostGIS. PostgreSQL also supports asynchronous replication but requires additional effort around cluster and replication management. Not an easy automation target, for sure.
MySQL or MariaDB are interesting as their Galera version offer a synchronously replicating cluster.
The document database of choice is regularly MongoDB. There are many message brokers but the most frequently demanded is still RabbitMQ. When it comes to key value stores, Redis is the 1st choice of the average developer. With search & analytics, there’s Elasticsearch with a solid cluster facility. As distributed systems require a logging and metrics solution, the Elastic stack (formerly known as ELK Stack) has become an industry standard for logging and Prometheus is about to claim the metrics end of the spectrum.
A modern platform should have these data service categories covered.
Clearly, high-availability introduces the most complexity to data service automation and hence will consume most of the engineering efforts. This is because replication and cluster management often requires the orchestration of multiple components, are complex to configure and many failure scenarios need to be handled by the automation.
Therefore, the necessity for clusters is sometimes doubted as it is tempting to leave out the complexity in automation. But cutting this corner still leads to a significant loss in service quality and puts data to an increased risk. So, even with self-healing VMs or containers, replication is still recommended to maximize data service availability. With a properly set-up infrastructure – being distributed across at least three disjunct availability zones (AZs) – the time to repair of a cluster failover lies within seconds and outperforms any other self-healing mechanism.
Replication is also the foundation for multi-datacenter and multi-region concepts. Especially these are scenarios a data service should support multi-dc natively, as adding this functionality at the operational level becomes very complex and context specific, quickly.
Surprisingly, a clustered data service is not necessarily robust. Robustness has a broader meaning and while a data service may be clustered to survive the outage of an entire availability zone, it may collapse with data loss if the disk space has been entirely consumed.
Therefore, it is important to know how a data service behaves under extreme conditions and at least the most common failure scenarios should be covered with tests. These tests need to ensure that the automation is able to handle the situation gracefully and makes the best out of the situation.
Automation can fill a gap sometimes but in the end it is the data services that sets the service quality and therefore should be investigated before committing to it.
Another surprise is that clustered data services do not necessarily scale well. An asynchronous replication of a RDBMS is a good example. The primary database node in such an async scenario receives all the write requests and not often even all read-requests as read-write-splitting is rarely supported by the database drivers. Luckily this trend is about to change and databases are increasingly better in making use of this redundancy.
However, it is essential that the scalability of a data service is closely researched. It should be clear whether a data service scales horizontally – by adding more cluster nodes – or needs a vertical treatment – by replacing nodes with larger ones having more CPU and memory.
Looking at scalability also comprises the differentiation between load and capacity scale-outs with respect to which ability is to be scaled: coping with more requests or storing more data.
The latter is either supported by adding more nodes horizontally or by sharding them. Sharding is when data is not replicated to all but a subset of nodes. Often this increases the deployment complexity which in turn becomes a time consuming task in automation.
The performance aspect of a data service should be investigated in the context of application requirements. As stated in the paragraph about scalability, some data services allow scalability to also increase the ability to cope with increasing loads.
A different aspect of performance is when a data service is specifically designed to maximize a performance for a certain use case but it put into suboptimal operational conditions during automation. This may be a data service with strong hardware affinity, for example.
OpenStack Swift, for example, a data service that is actually more of an infrastructure that a platform service requires specialized hardware. Special requirements including the number and topology of disks is just one of many aspects that require attention to maximize its performance. Putting such a bare metal affine data service into a virtual machine may conceptually work but may miss the actual purpose.
The lifecycle of data services consisting of multiple components are often complex. Additionally, some services are designed with manual operations in mind and assume expert knowledge and readiness to manually attend operational lifecycle events such as updates.
These complex data services with a strong affinity towards manual awareness are the most painful data services to be automated and set the bar for any automation framework and automation team.
Automating a data service often place resulting instances in a shared network. While automated network segmentation is desirable, not every platform environment is equipped with it. A number of data services has been developed in the context of physically isolated networks.
These data services do not have a user management and authentication mechanism which puts a burden on automation. The automation then has to create a non-obtrusive front to add this essential functionality to shield data service instances from unauthorized access.
This shield not only protects from other platform tenants but also includes the isolation of different data service instance bindings, e.g. in a microservice environment with multiple applications sharing a set of data service instances such as a RabbitMQ. Each application should have a separate binding to a particular service instance and this binding should represent a unique set of access credentials. Therefore, data services with a basic user and access management is clearly to be preferred.
Another aspect is protecting data at transit. A data service should support the encryption of data streams of both routes: between the application and the data service instance as well as among data service instance cluster nodes. If a data service does not support this, it’s particularly hard to replace on the operational level. Simplified, a VPN between communication nodes has to be established which may require modifications of the application environment which is usually totally out of the scope of the data service automation team.
This topic will be described in a further article.
Some freemium data services offer a community version with optional enterprise counterparts. Considering to use such a data service should include looking closely at the functionality of the free and commercial editions. Some freemium licensed data services lack essential features such as high availability in their community editions. There should be clarity about wether the enterprise edition will be required for later productive use and how much additional budget has to be allocated to this.
More than that, data service vendors are sometimes quite territorial about their enterprise versions. In such cases, the automated provisioning of enterprise editions may be prohibited as the vendor claims the exclusive right to distribute the software, the pricing model and/or license mechanism is incompatible with automated on-demand provisioning. So picking a data service backed by a freemium business model should include looking closer at functionality and legal aspects.
This term is relatively new. Without striving for a precise definition, it is safe to say that a cloud-native data service has been developed with the changes in software development and operations of the past decade. Within the boundaries of the CAP theorem it provides a solid tradeoff among the before mentioned qualities and is designed to be operated in a full automated environment.
These cloud native data services will be the new trend setters and already influence the roadmap of established data services. Digital transformation is everywhere and data services can be disrupted too.
It is therefore to be expected that the full automation of the entire lifecycle of data services will become easier over time as less complexity has to be handled by the automation. It will be moved where it belongs: to the data service.
In the next article of this series we will focus on Scale.
Check out the full series here: Principles and Strategies of Data Service Automation