This is the second part of our series about Swift.
In case you missed part one – have a look here.
Let’s get more tech
Now let’s get to the part about the process line and what is done on which step. So it’s more interesting for technical readers instead of business-related ones.
First of all, you finished your conceptional work including all cluster requirements. The next step is pretty simple, just install your cluster in a stage environment. In our case in means as follows:
- 2 Proxy nodes
- 3 Storage nodes
with all final settings which we planed in part 1.
These settings includes:
- network related
- Swift configs
- kernel settings
- load balancer (for proxy)
At the end you should have your final setup in a staging environment. Now is time for our first test run which is almost entirely designed before you plan your environment. A good starting point for tests is:
- concurrency (GET / DELETE / PUT / UPDATE) processes
- load (fast HTTP calls)
- big files upload/download
- in our case, TempURL
If you’re running these tests, you should monitor your whole cluster to see any issues or bottlenecks occurring, especially, if the replicator and updater processes are running. Therefore the TIME_WAIT connections should be checked on each storage node.
Inexperienced users should now play with the rings. How is a ring working and what would happen if a ring is changed with new nodes, or if it want to disable a node. In our case, we tested the rings too, because we wanted to know if different versions of Swift storage nodes are working together correctly.
The question was, how would everything work if we have data on two older Swift version storage nodes, then increase the replica by 1 and add new storage hosts. Normally it should work perfectly because the ring Hash isn’t changed. Our tests pointed out some issues with the account data because the old account databases wasn’t mapped to any ring policy.
Ring policy is quite new, but normally each account/container/object will map to it.
This migration, if the replicator can’t find a policy, is buggy in version 2.2.2 (Ubuntu cloud archive package). You can solve this issue by reading this bug report and restarting your processes. It works perfectly and without any risk.
Once all your tests are done and you have seen some issues and solved them. Now you can start your production process. At the beginning, we started with our proxy nodes and switched to them.
As we mentioned couple of paragraphs above, our old proxy nodes were running on the same machine as all the other storage processes. So the switch to separate proxy nodes would affect the old nodes to decrease the load on these. We need all power for the replication and balancing process when we add the new storage nodes.
In the next step, change the ring to add new storage nodes and change the replica to 3. After that you need to roll out your ring files and restart all of the related services. You should do this maybe in more steps if your existing cluster is under middle or heavy load. First container and account, after that one storage node and replica to 3 and at the end – the remaining two storage nodes. To conclude, delete the old storage nodes or if you just need to upgrade these nodes to a newer Swift version.
One hint for the end of the blog post: definitely use installation management tools like Puppet or Chef. Otherwise this game is not really that fun and you would produce issues on your own through reference misconfiguration!