GlusterFS Algorithms: Replication (future)

Gluster

2012-03-29

In my last post, I promised to talk a bit about some emergent properties of the current replication approach (AFR), and some directions for the future. The biggest issue is latency. If you will recall, there are five basic steps to doing a write (or other modifying operation): lock, increment changelog, write, decrement changelog, unlock. For many workloads, certain optimizations apply. “Changelog piggybacking” can skip the decrement/increment for batched writes. “Eager locking” can do likewise for the unlock/lock. Thus, for streaming writes you might see: lock, increment, very many writes, decrement, unlock. Pretty cool, huh?

Unfortunately, it doesn’t work so well for random synchronous writes, such as we’d see for virtual-machine-image workloads or applications using lightweight embedded databases. In a simple implementation, both changelog piggybacking and eager locking apply only when we can actually see that there’s a second request in progress (or at least in our queue) when the first completes. Sure, we could implement a Nagle-ish approach of waiting for some amount of time before we send the decrement/unlock, just in case another write comes along, but that can get complicated pretty quickly. How long should we wait? If we don’t wait long enough, we might be incurring extra context switches to handle the timeout and send the deferred cleanup messages. If we wait too long, then we might be blocking out other clients for too long (for the deferred unlock) or expanding the window during which a client disconnection would cause an actually-unnecessary repair on the file (for the deferred decrement). Wouldn’t it be better if we didn’t have to rely so much on optimizations to a fundamentally inefficient full-serialization approach?

That brings us to High Speed Replication or HSR. (Hey, if the old stuff is “advanced” then the new stuff can be “high speed” IMO.) The basic idea here is to reduce latency, in several steps. First, we eliminate the “pessimistic” locking and serialization in favor of a more “optimistic” model of detecting and resolving conflicts. I played around with some pretty complicated vector-clock schemes, but eventually settled on a method using simple versions of a file at each server. The version acts as a predicate: if the version the client presents (i.e. the last one it knows about) doesn’t match the one at the server, the write is rejected. If a write is rejected by any server, the client simply retries it at all. This doesn’t provide any fairness or forward-progress guarantees, obviously, but that kind of conflict isn’t the point. This isn’t supposed to be something that you could use to simulate shared disks, or for highly concurrent I/O as is seen in the MPI world (go use PVFS). AFR won’t perform well in those cases either. Its serialization is there mostly to ensure that all servers execute writes in the same order even if they receive them out of order, to prevent permanent inconsistency between the copies. The conflict resolution and retry in HSR serves the same need; if two requests hit two servers in opposite orders, at least one will get a version conflict and be retried, so the copies will converge.

What if a server or client dies in the middle? For that, we still have the changelogs. Yes, we still use them, and they still mean the exact same thing. This was a very deliberate decision, so that the existing self-heal mechanisms could still work (most of my fancy vector-clock schemes required completely separate repair procedures). However, while the bits on disk mean the same things, the way they’re updated is a bit different. We do the increment on the server side as part of the write, and we do the decrement asynchronously. Thus, one network round trip is eliminated and one is moved out of the latency path, but if we’re interrupted in the middle of a write then the existing self-heal will still take care of it. That means we have to do some things to make sure that the existing self-heal does in fact still work even if it’s run concurrently with our writes, but I don’t need to go into that level of detail. The important thing is that a non-contending user write now consists of only two parts: a write over the network before we return, and a changelog decrement that’s done after. The results are pretty satisfying.

Those are 4KB random synchronous writes using two of Storm On Demand‘s 12GB SSD servers and a single client. I chose those because they represent a realistic configuration where performance is bound by network latency (plain old GigE AFAICT) rather than storage; I can actually get even more dramatic numbers in more artificial configurations, but that wouldn’t be useful. The “afr default” line is AFR as it exists today. Clearly it is in fact network-bound, since these machines are capable of doing about 30K IOPS each. The “afr write-behind” line was supposed to measure the effect of eager locking, but after I’d run the tests and been surprised by the results I realized that the real difference came from enabling write-behind (which is required for eager locking). Personally I don’t consider that a valid configuration, because if the user asked for synchronous writes they shouldn’t be queued in memory for write-behind, but it’s informative just to see the effect of that change. The “afr fast-path” line represents AFR with all locking and changelog updates disabled. This is a totally unsafe and invalid configuration, but it establishes a “speed of light” for replicated writes. No matter how well we optimize, replicated writes won’t be faster than that number on this hardware (especially on this network). The real point of the graph is the “hsrepl” line – twice as fast as plain AFR (the only other valid configuration IMO) or 2/3 of the speed of light, depending on how you look at it.

That’s exactly the kind of improvement I was hoping for. Our performance on this most demanding kind of workload has been a stumbling block for many, especially those who want to use GlusterFS for hosting virtual-machine images, and this just might bring those workloads “into the fold” for us. 3000 IOPS wouldn’t be that impressive for a local SSD, but this isn’t local storage. This is distributed storage, so that data survives even when nodes fail, so local-storage numbers don’t apply. It’s also synchronous I/O, so “throw everything in a huge unordered batch and tell me about my million IOPS later” doesn’t apply either. Scale this up to twenty servers on 10GbE, with more clients to feed them, and you’re probably talking about at least 200K globally available crash-safe IOPS for much less than the only real alternatives – SAN or other scale-out NAS – would cost.

Before I wrap up, I’d like to reiterate how this does not displace AFR. HSR still passes every request other than simple writes through to AFR, which must be present, to be handled the same way as before. HSR also relies on the AFR self-heal infrastructure, including all past and future improvements there. Users whose performance is not limited by network latency, either at the low end because they’re bound by slow disks or at the high end because they have a fast network, probably won’t see much benefit from HSR. So might users whose workloads are already well served by the recent improvements to AFR. In those cases, they might as well just stick with AFR to reduce the number of moving parts. HSR is simply about extending our replication strategy to cover new use cases, not to displace AFR in old ones where it has already served quite well.

GlusterFS Algorithms: Replication (future)

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...