[From an email to the gluster-devel mailinglist] today I have merged support for GlusterFS 3.2 and 3.3 into one Wireshark ‘dissector’. The packages with date 20120516 in the version support both the current stable 3.2.x version, and the latest 3….
Over the last couple days, in #gluster, users have come in complaining that their application can’t open a file, but that if they try accessing the file from the shell as the same user, it works fine. This was reported with apache’s tomcat and mod_fcgid and courier imap.
My first thought on this, and it still would be, is selinux. Selinux’s role is to prevent the wrong thing from doing what it’s not expected to do. It will make some applications be unable to even access a file that every other test proves should work. Always check this first if you’re experiencing unexpected access issues.
But in this case, it turned out to be the application itself. The users were running 32 bit apps on 64 bit platforms. As it turns out, the applications were tracking the inode numbers of files. They would call stat() which would return a 64bit inode. The apache programs would copy the results of that stat call into it’s own structure. If the apps were built on 32bit platforms, apache’s struct would have a 32 bit field. The 64 bit result wouldn’t fit. Apache tested for that and would error out with the ambiguous error message:
Syntax error on line ## of {filename}
Wrapper {filename} cannot be accessed: (70008)Partial results are valid but processing is incomplete
What it really meant was that the 64 bit inode overflowed the 32 bit field it allocated for storing it.
To identify this is the problem, run
file $FILENAME
where $FILENAME is the binary that’s producing the error. If it contains “32-bit” then that’s a pretty good indication that this might be a problem.
The best solution, of course, is to use 64 bit applications on your 64 bit clients. This wasn’t possible for this user.
To solve this problem, we configured the volume to enable the 32bit inode workaround in it
gluster volume set $VOLUME nfs.enable-ino32 on
Then we mounted via nfs instead of using the fuse client.
This passed 32 bit inode translations to the application, eliminating the overflow. This worked for the apache programs, but not the 32 bit courier imap. GlusterFS 3.2 doesn’t support nfs locks. Since courier requires those, this wouldn’t work.
Redundancy was maintained by installing the server package for GlusterFS on the client, starting glusterd, and adding the client to the peer group for the volume. This starts the nfs daemon on the client, allowing the client to do an nfs mount from localhost. The nfs daemon then handles connecting to the brick servers, maintaining redundancy.
My standing search for “glusterfs” on Twitter got me into an interesting discussion with Dr. Shawn Tan about an interesting GlusterFS configuration for several workstations. At first my reaction was panic, because I could see potential for data loss in that configuration, but “you’re using it wrong” is rarely a productive response from a developer. […]
As part of a continuing effort to broaden the potential uses of GlusterFS, the license for (most of) GlusterFS has been changed as follows (some details might vary between files). /* Copyright (c) 2008-2012 Red Hat, Inc. <http://www.redhat.com> This file is part of GlusterFS. This file is licensed to you under your choice of […]
One of the key features of GlusterFS, or any horizontally scalable system like it, is the ability to rebalance data as servers are added, removed, etc. How is that done? Come to think of it, what does “balance” even mean in such a system, and why is it so important to have it? Intuitively, balance […]
In a very unscientific test, I was curious about how much of an effect GlusterFS’ self-heal check has on lstat. I wrote probably the first C program I’ve written in 20 years to find out.
To my local disk, which is not the same type or speed as my bricks (although it shouldn’t matter as this should all be handled in cache anyway), to a raw image from within a KVM instance, and to a file on a fuse mounted gluster volume; I looped lstat calls for 60 seconds. This was the result:
Iterations |
Calculated Latency |
Store |
90330916 | 0.66 microseconds | Local |
56497255 | 1.06 microseconds | Raw VM Image |
32860989 | 1.83 microseconds | GlusterFS |
Again, this is probably the worst test I could do, it’s not at all scientific, has way too many differences in the tests, is performed on a replica 3 volume with a replica down, is run on 3.1.7 (for which afr should perform the same as 3.2.6) and is just overall a waste of blog space, imho, but who knows. Someone else might at least get inspired to do a real test.
As you can see, it’s pretty significant. An almost 64% latency hit for this dumb test over local which, really, should be expected considering we’re adding network latency on top of everything, but the 41% drop from VM Image to GlusterFS mount probably a smidgeon more accurately represent the latency hit for the self-heal checks.
Here’s the C source:
#include <sys/types.h> #include <sys/stat.h> #include <time.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> int main (int argc, char *argv[]) { struct stat sb; time_t seconds; uint64_t count; if (argc != 2) { fprintf(stderr, "Usage: $s <pathname>\n", argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror("stat"); exit(EXIT_FAILURE); } seconds = time(NULL); count = 0; while ( seconds + 60 > time(NULL) ) { lstat(argv[1], &sb); count++; } fprintf(stdout, "Performed %llu lstat() calls in 60 seconds.\n", count); }
My first time playing with heroku was very cool, but mystifying – it wasn’t clear how or why it was that I needed to run “git init”, and why I was “pushing” code to heroku. As a java developer, I’m used to setting up a tomcat server, dropping a …
Vagrant can build, and destroy, your entire dev setup in a matter of minutes. Its a powerful tool for achieving a cleanroom enginerring deployment setup.Vagrant allows you to setup a personalized VM on any machine in a matter of minute…
More often than I would like, someone with twenty or more web servers servicing tens of thousands of page hits per hour comes into #gluster asking how to get the highest performance out of their storage system. They’ve only just now come to the realiza…
Your code is only as good as its worst library. The lamest thing in the world is getting a “NoSuchMethodException” because you deploy an executable which puts the wrong version of the right libraries on the classpath… Or alternatively, becaus…
In case anyone’s interested, I have some puppet modules I’ve created.
Almost every other puppet module out there is ubuntu-centric and there are very few geared toward RHEL/CentOS/Fedora. Mine are, though I’ve tried to add structure to allow other dist…
Since GlusterFS is fuse based, it can be mounted as a standard user without too much difficulty.
On a server:
gluster volume set $VOLUME allow-insecure on
On the client as root:
echo user_allow_other >> /etc/fuse.conf
To mount the volume, you…
Frequently I have new users come into #gluster with their first ever GlusterFS volume being a stripe volume. Why? Because they’re sure that’s the right way to get better performance.
That ain’t necessarily so. The stripe translator was designed to allo…
I’ve been using other people’s maven repos for years. Emailing jars around, pushing them into drop boxes, checking out source code just to build binaries, etc. etc. etc…. And this was all AFTER maven existed.Why ?Because I never realized HOW EA…
Every once in a while , hadoop goes totally haywire when I play with it in psuedodistributed mode.Problems include :1) Data not being replicated to nodes (i.e. you do a namenode format, and the data nodes are now out of sync). 2) No connection ava…
My recent work on High Speed Replication is not the only thing I’ve done to improve GlusterFS performance recently. In addition to that 2x improvement in synchronous/replicated write performance, here are some of the other changes in the pipeline. Patch 3005 is a more reliable way to make sure we use a local copy of […]
In my last post, I promised to talk a bit about some emergent properties of the current replication approach (AFR), and some directions for the future. The biggest issue is latency. If you will recall, there are five basic steps to doing a write (or other modifying operation): lock, increment changelog, write, decrement changelog, unlock. […]
When your editing files on the fly, you need a tool like VIM. I use VIM almost exclusively for clojure and python.However, to really be efficient, you can’t rely on the arrow keys -> you need to know the shortcuts.So heres my favorites :Inside…
Okay so… in the last post, i tried to build a non-trivial map/r from scratch, and ran it on my machine. I ran into some issues involving the “glue” that held my map/reduce jobs together. For example, configuring the classes declared…
Today I’m writing a new map/r job, from scratch, trying to minimally copy code from other jobs. The principles behind hadoop are simple : you separately map records into key->value[] arrays, and then you convert those keys to integers, and dis…