Technical FAQ
From GlusterDocumentation
What is a brick?
Each storage node in the cluster is called a "brick".
What is GlusterFS scheduler?
The GlusterFS scheduler handles load-balancing and high-availability in clustered mode. You select a scheduler of your choice in your "unify" volume. Check this link for more information about type of schedulers, their options, benefits of using them etc..
What interconnects (transport interfaces) are supported?
- ib-verbs: Uses Infiniband verbs layer for RDMA communication. Fastest interface.
- ib-sdp: Uses Infiniband SDP (sockets direct protocol) for RDMA communication.
- tcp: Uses regular TCP/IP or IPoIB interconnect.
Do I need to LD_PRELOAD libsdp.so for Infiniband SDP transport?
No you should not. GlusterFS ib-sdp transport talks directly to SDP driver.
Does 'ls' query all servers in parallel to deliver the logical summation of the namespace?
No, currently unify translator have support for namespace child, from which it gets the directory information (dirent structures).
'ls' performance
Q: How many files per second can you stat with "ls" and how many stats per second if I am querying on last-modified state? This relates to how a client finds a particular file.
A: I will explain three possible cases and how they are optimized. I will also send you benchmark results soon.
# "ls"
"ls" issues a opendir, multiple readdir and closedir. All these are issued in parallel with multiple-readdir calls handled in a single fetch operation. There is no last modified state. Just doing "ls" doesn't trigger stat call. It should be fast as this is called only on the namespace child.
# "ls -l"
"-l" options triggers stat call for each file in the dir on each storage brick. This surely has the performance issues. But this query on all the brick nodes will happen only on the first call. second call onwards, stat call will be forwarded to only nodes where the file exists.
# Finding a file
When a open call is issued, we already know the whereabout of the file using 'lookup()' fop. hence the open call will be forwarded only to the node where the file exists.
Where is the meta data stored?
There is no meta data info or meta data server in GlusterFS. It is handled by the underlying file system. Also there is nothing central in GlusterFS. It is truly distributed and no single point of failure.
Does each client maintain a cache of the summed namespace?
No we do not maintain any client side cache, because it will lead to severe cache coherency and scalability issues.
We had "location-hint" facility (caching only previously accessed entries) with a very short lived timing. But later we removed it when "stat-prefetch" translator was introduced. Now, we don't have 'stat-prefetch' also, as unify supports namespace feature now. A namespace is maintained centrally, which is shared by all the clients.
What is a GlusterFS translator?
Translator is a very powerful mechanism provided by the GlusterFS to extend its file system capabilities through a well defined interface. Both server and client side translator interfaces are compatible, which means you can load the same translator on either side. Translators are binary shared objects (.so) loaded at run-time based on the volume specification. In GlusterFS, even performance enhancements, extended features and debugging tools are implemented as translators.
The idea of translator is borrowed from the GNU/Hurd operating system. See also Hurd translators
What all translators are available?
Refer to GlusterFS Translators v1.3 for a list of translators.
How do I know which order the translators should be implemented?
One should need to configure the translators according to requirements. But in cluster/* xlators, unify comes at the top level. And in performance xlators, io-thread will be the bottom most. Other than this there is no specific recommendation as such.
What is volume specification?
Refer GlusterFS Volume Specification to understand it with example.
How is locking handled?
File level locking is handled distributedly across the bricks using features/posix-locks translator. GlusterFS supports both fcntl() and flock() calls.
NOTE: Custom FUSE release doesn't support flock() calls. So, if you want support for flock() function, use GlusterFS patched FUSE, available from our download page.
Do clients communicate with each other?
No clients do not communicate with each other.
Is GlusterFS like parallel NFS?
From user's point of view YES. But a very different design internally. NFS protocol is brain damaged. It is hard to fix or improve it. Parallel NFS client is anyway incompatible with existing NFS protocol. Thats why we moved ahead with a new GlusterFS file system implementation. GlusterFS has superior features and performance over NFS. See this link for GlusterFS vs NFS.
Is GlusterFS compatible with NFS or SAMBA?
No. But you can re-export NFS / Samba / CIFS on top of GlusterFS though.
How do you re-export NFS on top of GlusterFS?
Export the mounted volume of GlusterFS just the same way you export any other directory. No special options for NFS or GlusterFS are required.
It seems that fsid=10 is an option that is needed to export gluster storage using nfs. "10" can be any other two digit number also. Also there are problems if you use anticipatory io scheduling in your kernel. If you get write error: not owner then this could be the cause; use normal io.
How do I remove a brick without losing data?
Q: One hard disk sounds like it might need to get replaced, and I may or may not be using replication. How do I decommission a hard drive and be confident that all my data exists elsewhere before removing the drive from the configuration. Is there something that tells gluster to move files off of a brick if it is not replicated?
A: As of now, if you are not using AFR feature, GlusterFS wont have a back up when a harddrive goes bad. But if you are using unify feature, then new files wont be created with the same name.
What happens if a GlusterFS brick crashes?
You treat it like any other storage server. The underlying filesystem will run fsck and recover from crash. With journaled file system such as Ext3 or XFS, recovery is much faster and safer. When the brick comes back, glusterfs fixes all the changes on it by its self-heal feature.
What about deletion self/auto healing?
With auto healing or self healing only file creation is healed. If a brick is missing because of a disk crash re-creation of files is ok but if it's a temporary network problem synchronizing deletion is mandatory.
Q: How do I heal/synchronize file deletion?
A: TODO
How can I increase storage reliability?
We typically recommend 12 500GB SATA-II (RAID or Enterprise edition) disks per server (RAID6 of 11 + 1 hot spare) for best reliability, performance and price. RAID6 can withstand 2 simultaneous disk failures. If you are absolutely concerned about performance and reliability (and not price), then SAS/Ultra320-SCSI is preferable. High-end SATA-II disks now a days offer MTBF close to SAS. But at 15k RPM, SAS is unbeatable in performance, provided you are willing to pay a high price for smaller size.
What happens in case of hardware or GlusterFS crash?
You don't risk any corruption. How ever if the crash happened in the middle of your application writing data, the data in transit may be lost. All file systems are vulnerable to such loses.
Metadata Storage - When using striping (unify), how does the file data get split?
Individual files are never split and stored on multiple bricks, rather, the scheduling algorythm you specify is used to determine which brick a file is stored on.
Metadata Storage - When using striping (unify), how/where is the metadata kept?
The metadata is stored on the namespace brick.
How to make GlusterFS secure?
GlusterFS as of now supports only IP/port based authentication. You specify a range of IP addresses separately for clients and management nodes to allow access. Client side port is always restricted to less than 1024 to ensure only root can perform management operations including mount/umount. New GNU TLS (secure certificate) based authentication is under development. We are also planning to implement encryption translator in the upcoming release. Till then you can even stunnel GlusterFS connections.
Here is one article about setting up Encrypted Network between client and server.
How do I mount/umount GlusterFS?
Refer to Mounting a GlusterFS Volume.
Do I need to synchronize UIDs and GIDs on all servers using GlusterFS ?
No. Only clients machines need to be synchronized, since the access control is done on the client's side.
Do I need to synchronize time on all servers using GlusterFS ?
Yes. You can use NTP (Network Time Protocol) client to do this.
Simple example:
/usr/sbin/ntpdate pool.ntp.org
How do I add a new node to an already running cluster of GlusterFS
Yes, you can add more bricks in your volume specification file and restart GlusterFS (re-mount). Its schedulers (alu) are designed to balance the file system data as you grow.
For releases after 1.3.0-pre5
Just add the extra node in unify's subvolumes list, and restart the GlusterFS, the directory structure is automatically replicated in the new server :D The much desired self-heal property of unify solves the burden of manually maintaining equal directory structure in all the servers before mount.
For releases before 1.3.0-pre5
- Preparing new bricks for addition:
GlusterFS has a requirement that all servers' exported volumes should have the exact same skeleton directory structure as each other. When starting on a fresh cluster, GlusterFS ensures consistency. But when you are adding a new node, that node should contain the same structure as well. For now, recreating the directory structure to the new node has to be done out-of-band (manually). One of the easier ways to replicate the directory structure is to cpio the directory structure from one of the already running servers and to extract it to the new server, (so that permissions and ownerships are preserved). Sample commands:
Assume server1 is a node from the existing cluster, and server2 is the new node you want to add.
server1:~# cd /path/exported server1:/path/exported# find . -type d | cpio -o > /tmp/skeleton.cpio
Now take /tmp/skeleton.cpio to server2 (the new server)
server2:~# cd /path/exported server2:/path/exported# cat /tmp/skeleton.cpio | cpio -i
Now it is safe to add server2 into the unify section of the client spec files and remount. Also ensure that no changes are done to the directory structure during the process, to ensure that server2 ends up with exact same directory structure as server1
Note: We are planning to add on-the-fly addition of storage bricks in our next release. The above steps will be taken care automatically.
How do I add a new AFR namespace brick to an already running cluster?
Loop mounting image files stored in glusterFS file system
To mount one image file stored in glusterfs file system, you have to disable the direct-io in the glusterfs mount. to do this use the following command:
#glusterfs -f <your_spec_file> -d disable /<mount_path>
After that you can use your glusterfs file system mounted on /<mount_path> to store your images. If you disable direct-io you can use glusterfs to store xen virtual machines virtual block device as files. Xen + Live Migration works fine using the option above.
How do you allow more than one IP in auth.ip?
Q: If you can only have one auth.ip line in a config, how do you allow 127.0.0.1 as well as a 192.168.* range?
A: Make your auth.ip.<volumename>.allow look like this:
option auth.ip.<volumename>.allow 127.0.0.1,192.168*
Stripe behavior not working as expected
Q: Striping doesn't work well. I made a file of 4MB with 'option block-size 2MB', but on my two servers the file is distributed like this:
PC1: file = 2MB PC2: file = 4MB
A: View this using 'du' not with 'ls' because 'ls' doesn't understand the presence of holes
Duplicate volume name specification
Q: Is it possible to use the same brick name several times in the same glusterfs-server.vol like in the example below?
volume brick type storage/posix option directory /dfslarge end-volume volume brick type storage/posix option directory /dfssmall end-volume
A: No, volume name should be unique across the volume specification file.

