Gluster blog stories provide high-level spotlights on our users all over the world
In a distributed hash table lookup, like that used by GlusterFS, misses are expensive. Let’s look at how it works and why misses are “bad”.
When you open() a file, the distribute translator is giving one piece of information to find your file, the filename. To determine where that file is, the translator runs the filename through a hashing algorithm in order to turn that filename into a number.
#!/bin/env python import ctypes import sys glusterfs = ctypes.cdll.LoadLibrary("libglusterfs.so.0") def gf_dm_hashfn(filename): return ctypes.c_uint32(glusterfs.gf_dm_hashfn( filename, len(filename))) if __name__ == "__main__": print hex(gf_dm_hashfn(sys.argv).value)
You can then calculate the hash for a filename:
# python gf_dm_hash.py camelot.blend 0x99d1b6fL
From this the distribute translator looks to see if it has the mappings for that directory cached. If it doesn’t, it queries all the distribute subvolumes for the dht mappings for that directory. Those mappings are stored in extended attributes and look like:
# getfattr -n trusted.glusterfs.dht -e hex */models/silly_places
# file: a/models/silly_places trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff # file: b/models/silly_places trusted.glusterfs.dht=0x0000000100000000000000003ffffffe # file: c/models/silly_places trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd # file: d/models/silly_places trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc
The trusted.glusterfs.dht value ends in two uint32 values. These are the start and end values for the dht has that belongs in that directory. In this example, 0x00000000 <= 0x099d1b6f <= 0x3ffffffe so the file belongs on brick b.
Now the lookup is sent to brick b. If the file is there, great. That was pretty quick and efficient.
If the file’s not there, hopefully there’s a file there with the same filename, zero bytes, mode 1000 with the extended attribute “trusted.glusterfs.dht.linkto”. This is what we call the sticky-pointer, or more correctly the dht link pointer. This tells the distribute translator “yes, the file should be here based on it’s hash but it’s actually at…”. This happens, for instance, when a file is renamed. Rather than use a bunch of network resources moving the file, a pointer is created where we expect the new filename to hash out to that points to where the file actually is. Two network calls, no big deal.
If, however, the file doesn’t exist there at all, the client calls dht_lookup_everywhere. As you might suspect from the name, this sends a lookup to each distribute subvolume. In my little 4×3 volume, that means 4 lookups out of distribute, and 3 lookups each out of replicate for a total of 12 lookups. Now these are done essentially in parallel (the serial network connection prevents true parallel) but that’s still a lot of overhead.
If your application looks for files that don’t exist frequently, this adds a lot of wasted lookups as the client queries every distrubte subvolume every time the file doesn’t exist. If this is, for instance, your average php app, there’s commonly a long include path that gets searched for each of 1000 includes. It’s not uncommon for 30000 non-existent files to be referenced for a single page load.
The gluster developers are working on mitigating that. Jeff Darcy created a sample python plugin translator that caches entries that just don’t exist and saves all those lookups by just replying that the file wasn’t there a second ago so it’s still not there.
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...