Multi Ring Hashing

Gluster

2012-07-16

Ring-based consistent hashing is one of my favorite algorithms. It’s an elegant solution to a common set of problems, and many times I’ve seen people’s eyes light up when they realize that finding data among a set of servers doesn’t have to involve central directories or expensive lookups. Systems based on this capability are still emerging and evolving rapidly, because it’s so powerful. That said, once you get into actually implementing such a system, you start to realize that there are still some secondary problems that have to be resolved before you have a working system. One of these is the assignment of virtual node IDs or tokens within the hashing ring. Sam Overton at Acunu has recently written up an excellent overview of this issue, rightly pointing out the problems of both one token-per-server and infinite-token approaches. Of the three approaches he considers, random token assignment with more than one token per server really does seem like the best choice.

However, I think there’s another way that’s even better than that: multiple hash rings. Instead of each node having several tokens in one ring, it has one token in multiple rings. Part of the hash for a key is used to select a ring, then the remainder of the hash is used to find the key in that ring. This is mathematically equivalent to a certain way of allocating multiple tokens within a single ring, such that each of N contiguous ring segments contains a token for each node, as shown in the following diagram (colors represent nodes, three tokens per node).
ring equivalence
Mathematical equivalence isn’t the whole story, though. There are dozens of programming languages that are all Turing-complete and thus mathematically equivalent, and yet some are still better than others either for specific purposes or in general. In this case, one very important difference between the two approaches is that with multiple rings each ring closes itself while with the single ring each segment leads into the next. To see why this matters, consider how such hashing rings are often used to choose replicas as well as initial locations by simply moving around the ring (usually clockwise). It’s highly desirable in such cases for each of a node’s N tokens to have different predecessors and successors, so that when the node fails its load is distributed instead of being concentrated on one alternate. When the replication level R is more than two, it’s desirable for R-1 predecessors and successors to be different, which is even more difficult.

With that in mind, look at the single ring in the diagram above and consider what happens when R=3. A replica group starting at the purple 7:00 segment would be back on purple at 9:00. Similarly, a group starting on red at 11:00 ends up on red again at 1:00. Now try to fix that, without introducing a similar problem elsewhere. Just for extra fun, consider where you’d have to add the three tokens for a fifth node to preserve proper replication behavior. Token assignment can become quite a puzzle even at this size. It gets considerably more difficult as the node count increases, and the possibility of assigning tokens without dividing them into one-per-node sets doesn’t help (it almost seems to make things worse). By contrast, this problem never even exists in the multi-ring alternative, because the replica sets never overflow from one set of per-node tokens to another set. It’s still necessary to make sure that the rings are in different orders, but even that turns out to be dirt simple. Just assign tokens to each node based on a different “stride width” in each ring, skipping over already-assigned nodes. Here’s how it worked out for the example above.

Basic order is: blue, red, green, purple.
First ring, stride=1. Just follow the basic order.
Second ring, stride=2. First slot is blue, then progress two through the order to green. Another two brings us back to blue, which is already taken, so we skip to red. After that we revert to counting by two, which gets us purple.
Third ring, stride=3. First slot is blue again, then progress three to purple. Three again gets us to green, and finally red.

Ideally, the stride widths (after the first) should be prime with respect both to each other and to the total node count, and greater than the maximum replica count, but that’s not possible with small numbers of nodes. That’s why we end up with failover imbalances such as 2/3 of purple’s load failing over to blue, but there’s never a case where two replicas (even with R=4) end up on the same node in normal operation. In the end, the additional token-assignment flexibility of the single ring provides little practical benefit, while the multi-ring alternative makes life a lot simpler in the quite common case of using the ring(s) to place replicas as well as initial copies.

BLOG

06 Dec 2020
Looking back at 2020 – with g...

2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...

Read more
27 Apr 2020
Update from the team

It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...

Read more
03 Feb 2020
Building a longer term focus for Gl...

The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...

Read more

Multi Ring Hashing

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...