Replication is the most necessarily complex part of GlusterFS – even more than distribution, which would probably be the most common guess. It’s also one of the things that sets GlusterFS apart from most of its obvious competitors. Many of them simply require that you implement RAID (to protect against disk failures) and heartbeat/failover (to protect against node failures) externally. Thanks, guys. The worst thing about that approach is that it depends on shared storage at least between failover sets, and the whole idea of a scale-out filesystem is supposed to be that it can run on cheap commodity hardware. A few alternatives do implement their own replication, and kudos to them, but I’m not here to describe how other projects do things. In GlusterFS, the approach is based on using extended attributes to mark files as potentially “dirty” while they’re being modified, so that they can be recovered if the modification fails on one replica in the middle. I posted a description of these extended attributes a while back, so here I’ll focus more on the algorithms that use them. The conceptual process of writing to a file consists of the following steps.
Note that this is just the conceptual view. In reality, many optimizations can be applied. For example:
With these optimizations, the number of network round trips per write can go from an execrable five (or more) down to one – just the write. Even better, with write-behind enabled, these kinds of access patterns become much more likely. Unfortunately, many workloads either can’t allow write-behind or don’t provide an access pattern that it can optimize, so these optimizations won’t be optimized either and IMO treating them as the default for measuring performance is tantamount to cheating on benchmarks.
All of this might seem complex, but the real complexity is in how we use these changelog values to do self-heal. Here, we encounter some more unique terminology based on which replicas have non-zero changelog entries for which others. If X has a non-zero entry for Y, we say that X “accuses” Y (of having incomplete operations), and this leads to the following “characters” for each replica.
The algorithm for determining which way to self-heal is really complicated, so I’ll just hit some of the highlights. You might recall that a replica does not have a changelog entry for itself[1], so how does it accuse itself (i.e. become a fool)? The secret is that an implicit self-accusation is made in some conditions – I confess that even I don’t fully understand how the code is making this distinction. The key is that this separates the aggregation of state from decisions about state, allowing the decision logic to work the same way in a whole bunch of different conditions. Some of the most common or important cases are:
There’s a lot more – feel free to explore the vast forest of calls into and out of afr_build_sources if you really want to appreciate how complicated this is – but this is getting long already and we still need to discuss how self-heal is actually triggered. Historically, this has evolved quite a bit over time. The first idea was that it would be done “behind the scenes” when a file is looked up, and users wouldn’t have to worry about. Not too surprisingly, people who actually deployed GlusterFS were uncomfortable with having a potentially large and – more importantly – unknown number of “vulnerable” files after a failure. Thus, it became common for people to run a full scan across the entire volume (using “find | stat” or “ls -alR”) to force a quicker self-heal after a failure. Recently GlusterFS has started to do this automatically through its own self-heal daemon, and even more recently code was added to log which files need self-heal instead of requiring a full scan which can take days to weeks. (This is the same basic idea I had demonstrated back in November of 2010, but is implemented quite differently.) In GlusterFS 3.3 or 3.4, the result will be an automatic and (reasonably) efficient self-heal process, which might be one of the most significant improvements since the new cluster- and volume-management framework was added in 3.1.
I was going to write about some emergent properties of this approach, and some directions for the future, but this post has gotten quite long enough. I’ll save that for next time.
[1] Update March 13: Avati points out that bricks do have a changelog entry for themselves now, and I’ve verified this to be the case. Mystery solved.
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...