The Gluster Blog

Gluster blog stories provide high-level spotlights on our users all over the world

Different Forms of Quorum

Gluster
2012-11-09

Several of us had a discussion on IRC about the almost-a-year-old GlusterFS quorum enforcement feature, and what quorum enforcement is likely to look like in the future. I tried to explain briefly some plans that had previously been discussed in email among some of the developers, but hadn’t circulated beyond there, and then I realized I should just write a blog post about that.

First, let’s review what quorum is for. The basic idea is simple: if different sets of nodes in a distributed data-storage system become isolated from one another, it becomes possible for multiple mutually-oblivious groups to make conflicting updates to the same object. The system therefore has two choices: allow this to occur and deal with the conflicts somehow, or prevent it. (Insert all of the usual verbiage about the CAP theorem/conjecture, eventual consistency, etc. here yourself. I’m bored with it.) One way to prevent such conflicts is to enforce some kind of a quorum rule, so that only one still-connected group of nodes can still apply updates. The rest are either limited to reading possibly-stale data, or can’t do anything at all. The key here is that There Can Be Only One such group. It must do something that no other group can do simultaneously. Here are some possible quorum rules:

  • Whichever group can set up disk fencing first has quorum. I know most people see fencing and quorum are separate, but experience has taught me that they’re not. That way lies madness.
  • Whichever group (if any) contains a majority of nodes has quorum. This is by far the most common rule.
  • Whichever group (if any) contains a weighted majority has quorum.
  • Whichever group (if any) has a majority of anchor nodes has quorum. Mango used this rule, not that anybody cares.

Yes, I realize that the weighted and anchor rules are special cases of the weighted-majority rule. The important thing, as I said, is that all of these rules satisfy the TCBOO condition. If your group has a majority, then no other group does . . . just like our recently concluded election. Even with simple majority quorum, though, there are still some questions that stretch the election analogy even further.

  • Who does the counting? Clients or servers?
  • What kind of objects do you count? Servers, bricks, replica sets?
  • What filter do you apply to those objects? Only those relevant to a particular volume, or all in the cluster?
  • What do you do when N=2, so that the only way to meet quorum is to have all relevant objects present?

Right now the only quorum we have: clients count all bricks within a replica set, and we do nothing special when N=2. That’s useful, better than no quorum leading to massive split-brain problems all the time, but we can do better. What will future quorum look like? It turns out that it’s going to be totally different.

  • Who counts? Servers.
  • What kind of objects are counted? Servers.
  • What filter do we apply? Right now, none; all servers throughout the cluster count toward quorum.
  • What do we do if N=2? Again, nothing.
  • The above is already represented in a patch. I think it’s awesome, but I’d still like to build on it in two ways. The first has to do with filters. I think it’s unacceptable that a volume which only exists on A+B can go read-only because C+D+E are down. Whole-cluster quorum is fine for management operations, but not for the data path. Therefore, once Pranith’s patch gets merged, I’ll probably submit another on top of it to implement an “only servers with an interest in this volume” filter.

    The second issue is trickier. What should we do when N=2? In some cases, allowing a single failure to make a volume read-only (the current behavior) is fine. In others it’s not. One idea would be to “promote” from all bricks/servers in a replica set to all in a volume to all in the cluster. Unfortunately, that gets us nowhere in a two-server two-brick cluster, which is very common especially in the critical case of people trying GlusterFS for the first time and seeing how it responds to failures. The other idea is arbiter nodes, which hold no data but pump up the quorum number for the cluster (or for a volume if that’s how we’re counting). Thus, if we have a volume on two nodes in a cluster with more than two, a third node will be (automatically?) designated as having an interest in that volume so that the effective quorum is two out of three. Since tracking servers’ interests in volumes will already have to be part of the second patch I’ve mentioned, adding an arbiter is just a simple matter of manipulating that information so it should be a pretty straightforward third patch.

    So that’s where we are with quorum today, and where we’re likely to be going. Please feel free to add any further questions or suggestions in the comments.

BLOG

  • 06 Dec 2020
    Looking back at 2020 – with g...

    2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...

    Read more
  • 27 Apr 2020
    Update from the team

    It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...

    Read more
  • 03 Feb 2020
    Building a longer term focus for Gl...

    The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...

    Read more