GlusterÂ blog stories provide high-level spotlights on our users all over the world
A lot of people seem to be curious about how GlusterFS works, not just in the sense of effects but in terms of internal algorithms etc. as well. Hereâ€™s an example from this morning. The documentation at this level really is kind of sparse, so I might as well start filling some of the gaps. Today Iâ€™ll talk about DHT, which is the real core of how GlusterFS aggregates capacity and performance across multiple servers. Its responsibility is to place each file on exactly one of its subvolumes â€“ unlike either replication (which places copies on all of its subvolumes) or striping (which places pieces onto all of its subvolumes). Itâ€™s a routing function, not splitting or copying.
The basic method used in DHT is consistent hashing. Each subvolume (brick) is assigned a range within a 32-bit hash space, covering the entire range with no holes or overlaps. Then each file is also assigned a value in that same space, by hashing its name. Exactly one brick will have an assigned range including the fileâ€™s hash value, and so the file â€śshouldâ€ť be on that brick. However, there are many cases where that wonâ€™t be the case, such as when the set of bricks (and therefore the range assignment of ranges) has changed since the file was created, or when a brick is nearly full. Much of the complexity in DHT involves these special cases, which weâ€™ll discuss in a moment. First, though, itâ€™s worth making a couple more observations about the basic algorithm.
So, those are the basics. How about those special cases? Itâ€™s probably easiest to look at the â€śreadâ€ť path first, where weâ€™re trying to find a file that we expect to be there. Hereâ€™s the sequence of operations.
Whatâ€™s a link file, then? Have you ever looked on one of your bricks and seen zero-length files with weird permissions (sticky bit set)? Those are link files. If you look closer, youâ€™ll also see that they have trusted.dht.linkfile xattrs with brick names in them. Thatâ€™s how we avoid the â€śbroadcastâ€ť mentioned above. On subsequent lookups, if we find a link file we just follow it to the real brick. Considering that we only go through this lookup procedure once per file per client anyway (location information is cached), the cost of â€śguessing wrongâ€ť is therefore pretty minimal. I once implemented a scheme where we do an exponentially expanding search instead of an immediate broadcast, hoping to achieve a better balance of lookup latency vs. network traffic, but in the end it just didnâ€™t seem to make a difference so the extra complexity wasnâ€™t worth it. Now, letâ€™s look at the file-creation path.
This brings us to rebalancing, which is one of the key challenges â€“ and therefore one of the most interesting research areas IMO â€“ in this kind of system. The first thing to know about GlusterFS rebalancing is that itâ€™s not automatic. If you add a new brick, even new files wonâ€™t be put on it until you do the â€śfix-layoutâ€ť part of rebalance, and old files wonâ€™t be put on it until you do the â€śmigrate-dataâ€ť part. What do these do?
In my personal opinion, there are problemsenhancement opportunities in both of these areas. Letâ€™s take these in reverse order. Migrate-data is slow. What it should do is run in parallel on all of the bricks, with each brick either â€śpushingâ€ť data that is currently local but needs to be elsewhere or â€śpullingâ€ť data thatâ€™s the other way around. What it does instead is run on one node, potentially moving files for which it is neither source nor destination. This is a big deal, because it causes rebalance to take days when it should take hours â€“ or weeks when it should take days, on larger installations. The amount of I/O involved is also why you donâ€™t necessarily want this to be an automatic process.
While the migrate-data issue is at the level of mechanics and implementation, the fix-layout issue is at more of a conceptual level. To put it simply, when we add a new brick we should reallocate approximately 1/new_brick_count hash values. Because our layout calculations are naive, we will usually reallocate much more than that â€“ exacerbating the migrate-data problem because reallocated hash values correspond to moved files. Time for a picture.
The outer ring represents the state with just three bricks â€“ hash value zero at the top, split into three equal ranges. The inner ring represents the state after adding a fourth brick. Any place where the inner and outer rings are different colors represents a range that has been reassigned from one brick to another â€“ implying a migration of data likewise. If you look carefully, youâ€™ll see that weâ€™re moving half of the data when it should be only a quarter â€“ 8% blue to orange, 17% orange to yellow, and 25% yellow to green. What could we do thatâ€™s better? Not much, if we stay within the limitation of a single brick claiming a single range, but there really doesnâ€™t need to be such a limitation. Instead, we could borrow from Dynamo and assign multiple â€śvirtual node IDsâ€ť for brick four, giving it a total of 25% drawn equally from bricks one through three. (If you look at this as a clock, thatâ€™s one hour each at three, seven, and eleven oâ€™clock). Taking this approach too far can cause layout tables to get unreasonably large, so sometimes it makes more sense to forego this optimization or even reverse it by combining/swapping ranges even though that will cause â€śunnecessaryâ€ť data migration. Done sensibly, this can keep the table size under control without resorting to the scale-limiting â€śhash bucketâ€ť approach of some Dynamo derivatives. There are even more sophisticated ways to address this problem, but thatâ€™s a whole different post and this should be enough to convey the general problem space.
Thatâ€™s how DHT works today, and some thoughts about how it might evolve in the future. The next article will focus on replication.
mountpoint Weâ€™ve published the program schedule for mountpoint! Registration available at: http://mountpoint.io/ 4.1 Released: our 4.1 release came out in June! Have you installed it? Do you have thoughts on how our releases could be improved? Our 4.1 retrospective is out! https://www.gluster.org/4-1-retrospective/ Community Meeting postponed due to July...
The Gluster community is pleased to announce the release ofÂ 4.1, our latest long term supported release. This is a major release that includes a range of features enhancing management, performance, monitoring, and providing newer functionality like thin arbiters, cloud archival, time consistency. It also contains several bug fixes. A selection...