The Quest For Balance

Gluster

2012-05-10

One of the key features of GlusterFS, or any horizontally scalable system like it, is the ability to rebalance data as servers are added, removed, etc. How is that done? Come to think of it, what does “balance” even mean in such a system, and why is it so important to have it? Intuitively, balance has to do with the idea of each node doing a fair share of the total work. If you’re in balance, each node is doing very close to its fair share. If you’re out of balance, some nodes are doing far more while others are doing far less. For us, “work” means hosting files – more files on a server means more work for that server. In steady state, with a large number of files over a long period of time, consistent hashing does a remarkably good job of achieving this balance. However, just after a server has been added or removed, we’re way out of balance. The new server has no new files, and the DHT layouts won’t put any there. Thus, we need to adjust the layouts to include a hash range for the new server so that files will be placed there. In some cases that’s sufficient, and we can let allocation of new files take care of restoring balance, but in most cases we’ll need to move files there more proactively. In any case, the big question becomes how to assign the new server’s hash range. To do that, we need to consider two things.

A metric by which we determine the ideal size for each server’s hash range – and therefore whether it’s currently over- or under-loaded.
A method by which we determine the ideal position for the new server’s hash range.

For the rebalancing metric, we have several choices. The simplest is to say that each server has the same value, so each server gets an equally sized hash range. (This actually all happens per “brick” rather than per server, but I’m using the more familiar term to avoid having this be unnecessarily GlusterFS-specific.) Another option would be to assign ranges proportional to each server’s total storage capacity. A third option is to assign ranges proportional to each server’s free storage capacity. This has the effect of driving new file allocations toward (relatively) empty servers. For archival workloads where deletes are very uncommon, this will cause the system to converge “naturally” on a near-optimal balance without the need for proactive data migration. There is a slight downside to having the layouts not match the actual current distribution or files, making lookups ever so slightly slower, but I doubt the effect would even be noticeable. More importantly, for workloads where files are being deleted and recreated frequently, this approach could lead to the new servers becoming overloaded and require a data migration back to the old servers. In general, it would be unwise to use this metric without recalculating the layouts periodically.

For the rebalancing method, we have two slightly-conflicting goals. One is to achieve perfect balance in the long term. The other is to minimize data motion in the short term. As it turns out, most methods tend to favor one goal over the other. For example, the simplest method is just to assign the new server’s range at one end of the total range, and shift every other server’s range to make room. This gives “perfect” distribution according to our rebalance metric, but often involves horrendous amounts of unnecessary data motion. To see why this is the case, consider a system as it goes from four servers to five. For simplicity we’ll assume the server’s rebalance metrics are equal and that we have 100 possible hash values.

Server A starts with range 0-24, and ends with 0-19, so 5% of the total range is reassigned from A to B.
Server B starts with range 25-49, and ends with 20-39, so 10% of the total range is reassigned from B to C.
Server C starts with range 50-74, and ends with 40-59, so 15% of the total range is reassigned from C to D.
Server D starts with range 75-99, and ends with 60-79, so 20% of the total range is reassigned from D to E.
Server E starts with nothing, and ends with 80-99.

Notice how the percent-moved number keeps increasing at every step? That’s the direct result of adding at the end, leading to 50% data motion as we add 25% of the original capacity. Let’s contrast this with another “optimal” method that yields the exact same range sizes but tries to choose a minimally disruptive place to add the new range (which happens to be in the middle this time).

Server A starts with range 0-24, and ends with range 0-19, so 5% of the total range is reassigned from A to B.
Server B starts with range 25-49, and ends with range 20-39, so 10% of the total range is reassigned from B to E (not C).
Server C starts with range 50-74, and ends with range 60-79, so 10% of the total range is reassigned from C to E.
Server D starts with range 75-99, and ends with range 80-99, so 5% of the total range is reassigned from D to C.

This gives us exactly the same ultimate result as before, but this time only moving 30% of the data. That could be hundreds of terabytes in many GlusterFS installations, so it’s a pretty big deal. But wait, we can do better. If we wanted to minimize data movement even more, even at the expense of a slightly less optimal final distribution, we could try a couple of other approaches.

“Split one” picks the most overloaded current server and splits its hash range proportionally according to our metric for the two servers involved (one old plus one new). This has an almost fractal behavior of dividing the hash space first into halves, then into quarters, etc. The difference between the largest and smallest range can be as much as 2x, but data migration is very low (approximately 1/2^n).
“Split two” picks the most overloaded two current servers with adjacent ranges, and splits the combined range proportionally according to our metric for the three servers involved (two old plus one new). This gives more nearly optimal final placement, with hardly any increase in data movement.

There are many more possibilities, and more still if we consider the algorithms one might use for adding more than one server at a time, but this should give some idea of the possibilities. We already have three metrics and four methods, as follows.

Metrics: equal, total size, free size
Methods: append range, insert range, split one, split two

The good news is that we have nowhere to go but up. GlusterFS currently uses equal+append, which is worse than every other metric and every other method. I already have code to use total or free size as the metric, and insert range or split two as the method. In the not-so-distant future I hope to finish the work that’s necessary to make these available as options in GlusterFS itself (currently the code only does calculations and is not integrated with the actual fix-layout code). I think that most people would be best off using either free+insert or total+split2 most of the time depending on whether their workload matches the “archival” pattern mentioned above, then using total+insert less often (e.g. during planned downtime or predicted slack periods) to fix up the accumulated imperfections in the resulting layouts. However, the real point of both the code and this article is to let users make their own choices about how best to determine what “balance” means for them and how to restore it in a way that best suits their needs.

The Quest For Balance

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...