Evolution of Software Development and Operations – Part 4

This entry is part 4 of 5 in the series Evolution of Software Development and Operations

Application Platforms

After the walkthrough on how to build clusters, we have seen how many moving parts even a simple application cluster may have: load balancers with a cluster manager, application servers, database servers with replication and a database cluster manager, an object store, just to name a few.

Contemporary microservice based applications tend to be even more demanding. What has been a single full-stack app is now a collection of smaller web services. Each of which may use one or multiple relational and/or NoSQL databases, make use of a search server and communicate using a message queue data service.

Building a cluster is costly and complex and requires a large team of Devops, an economy-of-scale effect kicks in. This leads to the idea of modern application platforms: Applying an economy-of-scale to a large number of applications by gathering them on a common clustered and highly automated platform.

This leads to the idea of modern application platforms: Applying an economy-of-scale to a large number of applications by gathering them on a common clustered and highly automated platform.

Platform Scale

For the reason of reaching an economy-of-scale effect, a platform needs to reach a certain scale itself. Therefore, this is not about a few hosts or a small cluster. It is about building a platform for thousands of platform developers and many thousand applications.

Scale changes everything

The following little story is meant to visualize scale, using a physical example but similar effects can also be observed with software. Imagine the business is selling sausages. As a true believer of The Lean Startup paradigm a Minimum Viable Product (MVP) is carved from the product vision. It applies Occam’s Razor to the business idea. Cut everything from it until it doesn’t work anymore. The last piece taken will be re-added. Theoretically, this leads to the simplest  possible solution. For the sausage business, this means starting with the smallest setup to roast a sausage.

This grill is small in size, low in costs and can be operated by a single person. Most importantly: it roasts sausages. White sausages, red sausages or even both. It’s fantastic. This is what you will get when asking for a small budget sausage grill. Analogously, this metaphor works for a software that solves a particular problem at a small scale. It fulfills the specification „roasting a sausage“ but has a limited  throughput. The throughput may — wildly guessed — be around one sausage per 5 minutes. This works perfectly at the local town where only occasionally someone from a modest sized crowd asks for a sausage.

So now the business owner wants to grow bigger. He managed to get a fantastic opportunity to sell sausages at the fun-fair. He stands in front of a way bigger audience now and his first experience is devastating. Instead of earning multiple times, he has found himself in front of a raging crowd as he is unable to keep up the sausage quality and process all incoming sausage requests gracefully. The same thing may occur with software: a system that worked well for a particular user load may fail if the load increases. Anybody who has ever experienced such a situation also knows that a user crowd in front of a non-responding application is not any better than a hungry crowd demanding the sausage they ordered.

The solution to our sausage business is simple but not trivial. We have to re-invent the sausage grill to work at a greater scale. Instead of a grill mounted to a human body, we go for a custom built trailer.

This setup allows two to three people to work in parallel, hence multiplying the sausage throughput. Now you can image that increasing the production even further may require additional modifications. Similar rules apply for software. Re-design and sometimes even re-invention is necessary to scale a system to cope with rapidly increasing workloads over time.

If you look into the code history of Cloud Foundry, you will see multiple fundamental re-designs. The application container logic in Cloud Foundry has  been written three times at least. Pre-DEA, DEA and Diego. The point is: everything is different if you want to do it at scale.

The point is: everything is different if you want to do it at scale.

Anatomy of an Application Platform

We’ve seen exemplarily how clusters can be build to make applications highly available. It’s generally about creating redundancy and allowing horizontal scaling wherever possible.

Multi-Tenancy Isolation

An application platform such as Cloud Foundry brings this to a new level. Not only will countless applications run on such a platform. More than that, platform users are likely to lack any trust relationship among another. Therefore, the worst case is that there vicious users on the platform who try to harm others. This makes multi-tenancy capabilities with a reliable tenant isolation a priority.

Cloud Foundry, for example, separates tenants into so called organizations. Each organization may have one or multiple-users allowing groups of developers with different roles to access the organization. Spaces help to segment organizations even further. This structure is essential to organize a large group of people within a tenant.

Dealing with Application Heterogeneity

Speaking of countless applications: applications are different. They use different languages and frameworks. There are many different application servers such as Apache HTTPD, Apache Tomcat, NGINX, etc. Some programming languages and their frameworks tend to use Apache HTTPD, others prefer NGINX and in the end, each developer has its own preferences.

So the application platform should not restrict the choice of the programming language or framework as this freedom is an essential part of modern software development efficiency: choosing the right tool for the right job.

Applications may come written in Ruby, Java, Node.js, you name it. The application platform has to deal with it.

Commonly, this language heterogeneity is abstracted either by applying buildpacks or by using container images. A buildpack is a piece of code that takes source or binary code and turns it into something that can be started as a process and run in a container. According to the 12 factor manifest, this process is called application staging. Container images do not require an application staging phase as this phase has already been processed when building the container image.

Containers

In the early days of shared hosting, isolation applications from another was true pain. Mostly due to Linux’ lack of means to properly isolate processes. Other operating systems — Solaris and OpenSolaris, for example — had extraordinary container functionalities, called Solaris Zones, which in combination with the ZFS filesystem where pretty awesome. Years before, Linux added corresponding kernel features.

With the introduction of Linux Namespaces and CGroups, the foundation for Linux container support entered a new level.

The availability of these Linux Kernel features allow the creation of containers and subsequently let to the creation of several container formats.

Containers help isolating processes from another. This isolation prevents one process to access information and consume resources of others. Application instances of different sizes with individual amounts of allocated memory, CPU shares, network and disk I/O as well as disk space can be created programmatically and coexist peacefully.

Container Orchestration

The usage of a large number of containers leads to the container orchestration challenge. Very similar to building infrastructure layers, a comparable list of requirements results. Containers need to be spread across multiple hosts to form a scalable platform. As such placement algorithms, preventing cluster fragmentation, need to be used. A software defined networking is required. Creating network routes among container hosts, restricting access, … the entire SDN story, replayed.

Load Balancing Revisited

In any case, an application platform needs to know where instances of individual applications reside. Such an application registry is important, as a huge cluster is subject to constant change.
With a thousands of developers deploying even more apps, something is changing at any given time. Whenever applications are started or stopped, the application router has to be instantly aware of this change. This has nothing to do with traditional load balancers where load balancer targets barely changed. Therefore, the application routing needs to receive special attention, especially regarding performance and reliability.

Delegating Application State

Some application platforms delegate application state. Some don’t.

Cloud Foundry

In Cloud Foundry applications are stateless. This is inherited from the 12factor manifest and enables container disposability. It makes self-healing fast and easy as the placement algorithm doesn’t have to care about the application’s data. All it has to do is to re-create failed application instances somewhere in the container cluster.

The obvious drawback of any 12-factor based platform is: applications need to be 12-factor compliant. They need to follow certain architectural rules. Otherwise they simply cannot be operated on the application platform.

This is a huge burden if you are an enterprise and sitting on thousands of legacy apps. For applications under active development, this shouldn’t be too much of a problem unless they are horribly mis-designed in the first place or do not belong to an application platform anyway but rather should be operated in a BOSH-like environment. Long story and lot’s of potential for discussions here.

Data Services store Application State

Stateless applications are easier to scale horizontally. Just add more instances. They are also easier to heal. Just start a new instance. Sounds wonderful but there is a gotcha. Somebody has to do the heavy lifting and store state: Data services.

A data service is anything that is specifically designed to maintain state. It can be anything from a relational database management system (RDBMS) such as PostgreSQL or MySQL, a document based  database like MongoDB, a search server such as Elasticsearch, a key-value store such as Redis, a massaging system à la RabbitMQ or even composites of several services like an ELK-stack.

The data service solution of a platform needs to comply to the exact same requirements than the platform itself. This includes: on-demand self-service, scalability, robustness, ease of use, full lifecycle automation and so on.

The assumption of application platforms, such as Cloud Foundry, is that someone else takes care of data service automation. The reason for this is obvious. Maintaining state causes many challenges and there are many ways to solve them. This creates a large number of data stores each acting differently when it comes to operations. They replicate data differently, provide varying levels of consistency, let you model and query data in various ways.

As designing an application platform is hard enough, the data service challenge is often left aside.

But no application platform can be operated at scale without a production-grade, fully automated approach to data services.

4 thoughts on “Evolution of Software Development and Operations – Part 4

  1. Pingback: How Software Development has Changed over the Years | anynines blog

  2. Pingback: Evolution of Software Development and Operations - Part 3 | anynines blog

  3. Pingback: Principles and Strategies of Data Service Automation - Part 1 | anynines blog

  4. Pingback: Evolution of Software Development and Operations - Part 5 | anynines blog

Leave a Reply

Your email address will not be published. Required fields are marked *