Published at 10.11.2014
Thursday August 14th we attended the London PaaS User Group (LOPUG) meetup at OpenCredo and gave a talk while we were at it. But only after Richard Davies, CEO and co-founder of ElasticHosts Ltd, talked a bit about Linux Containers used in tools such as LXC and Docker, their use in IaaS and PaaS platforms, and their benefits over traditional visualized servers.
Next it was up to me to share something on combining Cloud Foundry and OpenStack. Speaking from experience – anynines runs a public Cloud Foundry on a self-hosted OpenStack infrastructure – I shared some of the challenges and benefits of running on OpenStack. And mentioned virtual machine availability, failover methods and implications on service design. And I pointed at things.
The anynines stack is built up from rented hardware in a datacenter, with (initially) VMware and then Cloud Foundry on top of that. We then migrated from a rented VMware to a self-hosted OpenStack (because of reasons).
Table of Contents
Our OpenStack run Cloud Foundry has been running for more than 6 months. We started looking into OpenStack maybe 2 years ago, its major release back then was Diabolo. We learned a lot over time. Before Grizzly OpenStack was not ready for production. The update process included a lot of manual work, there were no automated (script driven) upgrades. With manual database schema migrations and configuration file changes the risk of breaking stuff was tremendously high. We would usually just wipe all VM’s, install the upgrades and hope for the best.
With Grizzly things changed and our sysops were optimistic that we could run Cloud Foundry on top of OpenStack. We still ran our OpenStack setup alongside our VMware setup, to make sure everything runs smoothly. The switch from Havanna to Icehouse was the next upgrade on our list. This was the first production upgrade – which is exciting. We used Chef to roll-out Icehouse including its configuration changes and the upgrade was well tested on a separate multi-server OpenStack staging system.
Rolling upgrades are supported with Icehouse on. The promise is that you don’t have shut down VM’s doing updates. No downtime of the entire cloud.
OpenStack is not VMware and we have seen some VM’s dying, Pivotal has seen a similar problem when it switched from VMware to AWS (or PWS). VMware’s high availability features are pretty neat. So what kills VM’s? In our case: random kernel panics (kernel bug) and hardware outages.
OpenStack comes with a concept that’s called availability zones. You build disjunct networks, racks, etc and each disjunct zone is an availability zone. You tell OpenStack about these availability zones. Whenever you provision a virtual machine you can choose the availability zones for the VM’s and build your Bosh releases accordingly.
OpenStack aggregates are similar to the availability zones concept, although the intention is not about ‘failing over’ something but selecting hosts with certain attributes (e.g. SSD-aggregate). Where availability zones take care that outages don’t escalate too much, aggregates help you pick VM’s from hosts with attributes.
OpenStack’s load balancer is not inherently clustered at the moment, and thus a single point of failure. But a LBaaS failover can be realized using the pacemaker/corosync and GlusterFS.
In contrast to VMware where we can rely on highly available virtual machines, much like Amazon we wanted to have less expensive hardware, with the probability that hardware would go down. To harden our systems on all layers, we defined three VM failover strategies:
Wardenized services (community services) are cute for pet projects, yet not suitable for production. Implementations are often outdated, and more importantly: one size doesn’t fit all. There’s no production-ready Cloud Foundry without high quality, clusterable services.