Maybe you heard about my talk on the Cloud Foundry Summit Europe 2019. This is a written down version without additional information so you can watch the video at the end of this article if you prefer without missing out.
How did we get there?
Pretty simple. The a9s Data Service Framework deploys dedicated instances, so each of the 1200 Deployments we have running today on that one single AWS Production BOSH director are backing services for customers using Cloud Foundry.
We also have around 600 deployments on the associated Staging system, but more to that later. That system went live in January 2018 and we reached 1000 deployments after nearly a year. During this time we encountered some issues and I will explain some of them and how we and the BOSH core fixed them.
How do we set up an environment?
As you might imagine, we want a similar setup in each environment. We bootstrap all environments either via Terraform or Cloud Formation. We avoid manual changes because you will sooner or later stumble over it. We have set up 3 BOSH directors per environment which all manage different things. We also changed the architecture in the meantime, so newer environments are a bit different. So the new setup looks like this:
In the old setup, we did not have a dedicated CredHub that is shared between the directors, and the Underbosh used the CredHub instance running on the Overbosh for credentials management, which brings some disadvantages, as suddenly the Underbosh depends on Overbosh. Originally the Overbosh was also deployed with create-env, but we decided that create-env one director is enough and safer.
Why do you need 3 directors?
When we started we did not find information on how far BOSH scales. So we decided to keep deployments separate. Nothing is worse than having a broken runtime, but cck does not get over to it because there are 50 service tasks enqueued. The monitoring is kept separate in the same way.
Now that we have set up our directors and deployed the Runtime and Data Service Framework on their respective directors it was time to run it. Each IaaS you deploy to is different.
Amazon has a pretty good Cloud, but you pay for everything separately and those hidden costs can stack. Azure has some interesting ways for configuration and is for people that like to wait 10 minutes for a VM recreate. vSphere is very low level but once you get it to run it runs. Alibaba Cloud has a good bit of documentation and error messages only in Chinese, so you need a translator handy.
What the IaaS automation and BOSH achieve is to abstract those differences from the users and deployments. This still means you as operator have to know how the Infrastructure works to give a good quality service. Let us go for an example:
AWS EC2 Credits
AWS has 3 types of credits. Network, Disk and CPU credits. Network Credits cannot really be tracked and seem to follow the scheme of the bigger the instance, the more burst you get. You cannot monitor them.
CPU credits are something you encounter when using T2/T3 instances. One credit represents 1 minute of full load on a single CPU core. T2 and T3 instances gain them at a steady rate and most of the times you are fine. Although you should keep an eye out, or just go unlimited where you can consume more CPU credits than you technically get and pay a few cents per CPU hour you went over.
Disk IO credits are a beast because you do not expect them before they cause issues the first time. They only affect gp2 and io1 volumes and that’s why I try to avoid them, although AWS makes this harder by deprecating standard and only allowing Magnetic volumes (limited by data transfer rate in MB/s) only available for larger volumes
How do Disk IO credits work and why are they so tricky?
Disk IO in gp2 works by giving you 3 IOPS which can pool up a second. If you read more, you take away those credits. Volumes start with millions of IOPS credits so, in the beginning, you might not notice issues. Let us say you have a database that needs many reads and writes on a small data set.
BOSH for example. We have not seen the RDS data disk go over 5GB used (btw, you get no disk IO metrics for RDS disks). So at first, you might think that 50GB should be luxuriously enough for that small DB. Until you cannot do anything anymore, because your credits on the disk ran out. We have one AWS director run on gp2 and one on the deprecated “slow” standard.
To get around the same performance the gp2 RDS needs 1TB or gp2 while the standard one sits at 50GB, which is a difference in costs of factor 40. At some size, it even becomes cheaper to just go to a magnetic disk (worth it above around 250gb of gp2), while io1 unless you just hammer it with a lot of random IOPS is just overpriced for your needs.
Another comfort feature that we missed was the fact that you cannot set up a general alert, but have to set up a separate CloudWatch alert for each disk, which you have to delete again if a disk is deleted, which is a tad bit annoying. But enough bashing of single cloud providers (I know I went soft on azure but they have their own issues, like confusing docs).
Some Issues we encountered
These are some issues we encountered on our way to 1000 deployments. Those issues all occurred on AWS, as it is the fastest growing and largest environment.
This was the first time we ran into real issues. We waking up to a BOSH director which took minutes to answer, sometimes not responding at all. We scaled the BOSH, we scaled the RDS but to no avail, although an m4.2xlarge DB brought some relief. We investigated the issue and saw that the director was just sending the same request to the Database over and over.
Wondering if we found some sort of endless loop we talked to the BOSH core team and found out that we were not the first to run into this, and that the latest BOSH director release had a fix for this. So we updated. After the update, we scaled down everything again and it worked. So, what went wrong?
We used the Prometheus BOSH exporter, which works by regularly running
bosh vms --vitals. And every 2 minutes it enqueued 670 tasks. And for each of those 670 tasks, BOSH tried to find the deployment configs, which were not put out in
bosh vms. Another fix was to make the Director less blocking by using puma as a web server for the director, which allowed the Director to answer multiple requests in parallel, reducing the amount of connections standing in a congestion on the NGINX in front of it.
Fortunately for us, this solved the issue in less than a day, so there is not too much that further happened, apart from us writing a Root Cause Analysis (RCA) for the customer.
This time we woke up again to a very slow director which turned unresponsive whenever we tried to start a task. We were also unable to upload anything to the director. A quick
bosh vms —vitals revealed no real cause. CPU was idling happily, half of the RAM was free and the disk only filled 50%. Guessing that we might have a read-only file system we checked on the VM and got the error that the disk is full.
Checking the disk we saw that while the disk was half-full, the inodes on the disk were exhausted. For people that do not know what an inode is. It is not some Apple protocol, it is how UNIX file systems store information. When you create a file that file will get written into 1-n inodes and the file system adds a reference.
If you add a hard link to the file, another reference is added. Once an inode has no more references, it gets reclaimed by the file system. We had so many files that we ran out of those inodes before reaching the end of the file system capacity. On investigation, we saw that we had nearly 1.8 Million folders on the persistent disk from tasks.
These were caused by the Prometheus bosh exporter starting 900 tasks every 5 minutes and BOSH not clearing them fast enough. Each of these folders had 0 (logs were already deleted), 1 (logs are already zipped) or 3 (CPI, Debug and CLI log unpacked) in them.
To get out of this issue we first removed most of the old logs and only kept the last 10.000 logs. We also scaled the disk. To prevent this from further occurring we notified the BOSH core team which introduced a more aggressive purging of old tasks, which nowadays keeps our task folders on disk at a manageable amount of 18.000. We also finalized the transition to the Graphite HM Plugin for Prometheus, which is much lighter on BOSH compared to the BOSH exporter.
We were running an update and reached the service when suddenly the Underbosh locked up and stopped working properly. For minutes at a time, we did get no response from it. Whenever a service was deployed or updated, BOSH stopped for minutes at a time. Knowing that this issue came with the update we checked what the update changed.
The a9s Data Service Framework in the previous version used a bosh CLI to follow the deploy task to determine if a task ran or not. But this had some issues. First, it would break if there was a network hiccup. Second, it was not that light on the CPU of the director if 6 Brokers followed n tasks.
So to check the status they just ran a
bosh tasks -r to determine the last status. The only downside was that the BOSH director was a bit lazy at task purging and has 3.5m tasks in the database. The query used by BOSH was also not tuned for speed:
SELECT * FROM “tasks” WHERE (“deployment_name” = ‘d27eda6’) ORDER BY “timestamp” DESC LIMIT 30 which in case the deployment did not have 30 tasks runs backward through a table with 3.5m entries, which takes a while. And while BOSH waited for the query to return the broker timed out usually.
So how did we tackle this?
First, we changed the -r for -r=1 to only query for one task. To make sure one was found we created one deploy task for each deployment. We opened an issue and BOSH first put an index on task types and made a more aggressive purging implementation which purged logs 5 times as fast, leading to the millions of tasks to become 1100 tasks in the database.
There are things which you probably monitor from the start. For example, I assume monitoring availability of apps, API and Diego capacity is one of the first things that come to mind. Here are some of the metrics we found out are also very important:
- Network IP exhaustion. While it is IaaS and use case dependent, in our case, the provisioning of services is triggered by end-users, so running out of IPs is suboptimal as it breaks the user workflow
- Disk IOPS, depending on the IaaS provider
- Quota limitations, depending on IaaS increases to this can take minutes, days or even weeks in case of on-premise where new hardware has to be ordered.
- CPU credits on important instances like BOSH, the API or the UAA if unlimited is no option
- Disk inode usage
- Certificate expiration
- Checks that validate that all metrics are still coming in
What did 1200 deployments teach us?
Of course, we learned something while using BOSH far away from the usual settings. Here are some of our findings.
If you manage to bring the BOSH director to just lock up, the BOSH core team will fix this issue rather fast. BOSH is also rather stable and most issues stem from lazy cleanup and not from the director getting into state issues or stopping midway in a task.
For smaller to medium environments, the BOSH director does not need to be big a t2.large or your IaaS equivalent (2 CPU and 8gb RAM) will bring you over the first 100 deployments. If you go larger an m5.xlarge or m5.2xlarge will be enough as this will be the point where the Disk IO and Network speed will be the main issue, not CPU.
I also have some advice:
Don’t overdo it on workers, our biggest director only has 9 workers, and the others run with 3-4 workers. You have to keep the CPU power around to be able to keep the Director stable when all task slots are filled at the same time, and most of the time people can wait another 5 minutes.