Version 3.3 introduced a new structure to the bricks, the .glusterfs directory. So what is it?
The GFID
As you’re probably aware, GlusterFS stores metadata info in extended attributes. One of these bits of metadata is the “trusted.gfid”. This is, for a…
This sets up a GlusterFS Unified File and Object (UFO) server on a single node (single brick) Gluster server using the RPMs contained in my YUM repo at http://repos.fedorapeople.org/repos/kkeithle/glusterfs/. This repo contains RPMs for Fedora 16, Fedora 17, and RHEL 6. Alternatively you may use the glusterfs-3.4.0beta1 RPMs from the GlusterFS YUM repo at http://download.gluster.org/pub/gluster/glusterfs/qa-releases/3.4.0beta1/ …Read more
GlusterFS spreads load using a distribute hash translation (DHT) of filenames to it’s subvolumes. Those subvolumes are usually replicated to provide fault tolerance as well as some load handling. The advanced file replication translator (AFR) departs f…
On Sunday, March 18th, Fan Yong commited a patch against ext4 to “return 32/64-bit dir name hash according to usage type”. Prior to that, ext2/3/4 would return a 32-bit hash value from telldir()/seekdir() as NFSv2 wasn’t designed to accomidate anything…
A GlusterFS user from IRC asked me about my puppet management of KVM in RHEL/CentOS and how it works. I started to write this post two weeks ago and had to stop because although it works great, I figured that wasn’t the answer he was looking for. I loo…
With the addition of automated self-heal in GlusterFS 3.3, a new hidden directory structure was added to each brick: “.glusterfs”. This complicates split-brain resolution as you now not only have to remove the “bad” file from the brick, but it’s counte…
Starting with GlusterFS 3.3, one change has been the check to see if a directory (or any of it’s ancestors) is already part of a volume. This is causing many support questions in #gluster.
This was implemented because if you remove a brick from a volum…
The release of GlusterFS 3.3.0 by the Gluster Community marks a major milestone in Clustered File Storage. GlusterFS is the leading open source solution for the dramatically increasing volume of unstructured data. It is a software-only, highly available, scale-out, centrally managed storage pool that can be backed by any POSIX filesystem that supports extended attributes, such as Ext3/4, XFS, BTRFS and many more.
As an example of Red Hat’s goal of building strong, independent, open source communities, GlusterFS 3.3.0 marks the first release as an “upstream” project with its own release schedule. This release addresses many of the most commonly requested features including proactive self-healing, quorum enforcement, and granular locking for self-healing, as well as many more bug fixes and enhancements.
Some of the more noteworthy features include:
Visit Gluster.org to download. Packages are available for most distributions, including Ubuntu, Debian, Fedora, RHEL, and CentOS.
Get involved! Join us on #gluster on freenode, join our mailing list, ‘like’ our Facebook page, follow us on twitter, or check out our LinkedIn group.
GlusterFS is an open source project sponsored by Red Hat®, who uses it in its line of Red Hat Storage products.
Over the last couple days, in #gluster, users have come in complaining that their application can’t open a file, but that if they try accessing the file from the shell as the same user, it works fine. This was reported with apache’s tomcat and mod_fcgid and courier imap.
My first thought on this, and it still would be, is selinux. Selinux’s role is to prevent the wrong thing from doing what it’s not expected to do. It will make some applications be unable to even access a file that every other test proves should work. Always check this first if you’re experiencing unexpected access issues.
But in this case, it turned out to be the application itself. The users were running 32 bit apps on 64 bit platforms. As it turns out, the applications were tracking the inode numbers of files. They would call stat() which would return a 64bit inode. The apache programs would copy the results of that stat call into it’s own structure. If the apps were built on 32bit platforms, apache’s struct would have a 32 bit field. The 64 bit result wouldn’t fit. Apache tested for that and would error out with the ambiguous error message:
Syntax error on line ## of {filename}
Wrapper {filename} cannot be accessed: (70008)Partial results are valid but processing is incomplete
What it really meant was that the 64 bit inode overflowed the 32 bit field it allocated for storing it.
To identify this is the problem, run
file $FILENAME
where $FILENAME is the binary that’s producing the error. If it contains “32-bit” then that’s a pretty good indication that this might be a problem.
The best solution, of course, is to use 64 bit applications on your 64 bit clients. This wasn’t possible for this user.
To solve this problem, we configured the volume to enable the 32bit inode workaround in it
gluster volume set $VOLUME nfs.enable-ino32 on
Then we mounted via nfs instead of using the fuse client.
This passed 32 bit inode translations to the application, eliminating the overflow. This worked for the apache programs, but not the 32 bit courier imap. GlusterFS 3.2 doesn’t support nfs locks. Since courier requires those, this wouldn’t work.
Redundancy was maintained by installing the server package for GlusterFS on the client, starting glusterd, and adding the client to the peer group for the volume. This starts the nfs daemon on the client, allowing the client to do an nfs mount from localhost. The nfs daemon then handles connecting to the brick servers, maintaining redundancy.
In a very unscientific test, I was curious about how much of an effect GlusterFS’ self-heal check has on lstat. I wrote probably the first C program I’ve written in 20 years to find out.
To my local disk, which is not the same type or speed as my bricks (although it shouldn’t matter as this should all be handled in cache anyway), to a raw image from within a KVM instance, and to a file on a fuse mounted gluster volume; I looped lstat calls for 60 seconds. This was the result:
Iterations |
Calculated Latency |
Store |
90330916 | 0.66 microseconds | Local |
56497255 | 1.06 microseconds | Raw VM Image |
32860989 | 1.83 microseconds | GlusterFS |
Again, this is probably the worst test I could do, it’s not at all scientific, has way too many differences in the tests, is performed on a replica 3 volume with a replica down, is run on 3.1.7 (for which afr should perform the same as 3.2.6) and is just overall a waste of blog space, imho, but who knows. Someone else might at least get inspired to do a real test.
As you can see, it’s pretty significant. An almost 64% latency hit for this dumb test over local which, really, should be expected considering we’re adding network latency on top of everything, but the 41% drop from VM Image to GlusterFS mount probably a smidgeon more accurately represent the latency hit for the self-heal checks.
Here’s the C source:
#include <sys/types.h> #include <sys/stat.h> #include <time.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> int main (int argc, char *argv[]) { struct stat sb; time_t seconds; uint64_t count; if (argc != 2) { fprintf(stderr, "Usage: $s <pathname>\n", argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror("stat"); exit(EXIT_FAILURE); } seconds = time(NULL); count = 0; while ( seconds + 60 > time(NULL) ) { lstat(argv[1], &sb); count++; } fprintf(stdout, "Performed %llu lstat() calls in 60 seconds.\n", count); }
More often than I would like, someone with twenty or more web servers servicing tens of thousands of page hits per hour comes into #gluster asking how to get the highest performance out of their storage system. They’ve only just now come to the realiza…
Since GlusterFS is fuse based, it can be mounted as a standard user without too much difficulty.
On a server:
gluster volume set $VOLUME allow-insecure on
On the client as root:
echo user_allow_other >> /etc/fuse.conf
To mount the volume, you…
Frequently I have new users come into #gluster with their first ever GlusterFS volume being a stripe volume. Why? Because they’re sure that’s the right way to get better performance.
That ain’t necessarily so. The stripe translator was designed to allo…
Nixpanic has created a wireshark decoder for GlusterFS/Redhat Storage. This should help immensely in debugging and tuning!
This is a quick and dirty script I threw together to list files with dirty flags from a GlusterFS brick.
#!/bin/env python
#
# (C) 2011, Joe Julian
#
# License: GPLv2 http://www.gnu.org/licenses/gpl-2.0.html
#
import os,socket,xattr,sys,time
from sta…
One of the questions that I come across in IRC and other places often is how to obtain a list of gluster volumes that can be mounted from a client machine. NFS provides showmount which helps in figuring out the list of exports from a server amongst other things. GlusterFS currently does not have an […]