GlusterFS Translators v1.3
From GlusterDocumentation
Contents |
Performance translators
All of the performance translators should work fine when loaded on both the client side and server side. Keep in mind that 'io-threads' behaves as expected when loaded below all the performance translators.
Read Ahead Translator
read-ahead pre-fetches a sequence of blocks in advance based on its predictions. When your application is busy crunching the data it has read, glusterfs can pre-read the next batch of data in advance and keep it ready. That way consecutive reads are faster. Additionally it also behaves as a read-aggregator, i.e smaller I/O read operations are combined into fewer larger read operations internally to reduce network and disk load. page-size describes the block size and page-count describes the amount of blocks to pre-fetch.
volume readahead type performance/read-ahead option page-size 128kB # 256KB is the default option option page-count 4 # 2 is default option option force-atime-update off # default is off subvolumes <x> end-volume
NOTE: This translator is well utilized when used with IB-verbs transport. With FastEthernet and GigE interface, without read-ahead, one can achieve link max.
Write Behind Translator
In general, write operations are slower than read. The write-behind translator improves write performance significantly over read by using "aggregated background write" technique. That is, multiple smaller write operations are aggregated into fewer larger write operations and written in background (non-blocking). Write-behind on the client aggregates small writes into larger ones reducing network packet counts. On the server side, it helps if the writes are coming in very small chunks by reducing the diskhead seek() time.
aggregate-size determines the block size till which write data should be aggregated. Depending upon your interconnect, RAM size and work load profile, you should tune this value. The default value of 128KB works well for most users. Increasing or decreasing this value beyond certain ranges will bring down your performance. You should always benchmark with an increasing range of aggregate-size and analyze the results to choose an optimal value.
The flush-behind option is also given for increasing the performance of handling lots of small files. With this option the close()/flush() can be pushed to the back-ground, allowing the client to process the next request. It's off by default.
volume writebehind type performance/write-behind option aggregate-size 1MB # default is 0bytes option flush-behind on # default is 'off' subvolumes <x> end-volume
Note: Currently there is an upper limit in the protocol translator to transfer only 4MB of data at the max in one request/reply packet. Hence if you use write-behind on client side (as in most of the cases) with an aggregate-size greater than 4MB, it will fail to send the bigger packet.
Threaded I/O Translator
AIO adds asynchronous (background) read and write functionality. By loading this translator, you can utilize the server idle blocked time to handle new incoming requests. CPU, memory or network is not utilized when the server is blocked on read or write calls while DMA'ing disk. This translator makes best use of all the resources under load and improves concurrent I/O performance.
volume iothreads type performance/io-threads option thread-count 4 # deault is 1 option cache-size 32MB #64MB subvolumes <x> end-volume
NOTE:
- io-threads translator is useful when used over unify or just below server protocol in server side. It is not used at all if used between unify and namespace brick as there is no FileI/O over namespace brick.
- Its advised to use number of 'thread-count' option lesser than or equal to number of CPUs you have.
IO-Cache Translator
IO-Cache translator helps one to reduce the load on the server (if loaded on client side) if the client is accessing some files just for reading (and the file is not edited in server actually between two reads). For example, the header files are accessed for compilation of kernel.
volume io-cache type performance/io-cache option cache-size 64MB # default is 32MB option page-size 1MB #128KB is default option option priority *.h:3,*.html:2,*:1 # default is '*:0' option force-revalidate-timeout 2 # default is 1 subvolumes <x> end-volume
Booster Translator
As GlusterFS is a userspace filesystem, which uses FUSE module to get the fops, many users ask the question "Is there no way to avoid overhead caused by fuse?". Though overhead caused by FUSE is a bit less compared to network overhead, it does help for large file I/O. Hence the Gluster team came up with the Booster translator as a method to achieve it for File I/O. Using the booster translator, one can achieve higher throughput. It can be loaded on either client or server side.
NOTE: booster translator needs to have LD_PRELOADed "glusterfs-booster.so".
volume booster type performance/booster # option transport-type tcp # Default is 'unix', which is only used when booster is loaded on client side. # when used on server side, it does take all the options of client protocol and server protocol. subvolumes <x> end-volume
NOTE: Currently this is not advised for small files, once we get booster tested with small files, we will recommend it for small files.
Clustering Translators
Automatic File Replication Translator (AFR)
AFR provides RAID-1 like functionality. AFR replicates files and directories across the subvolumes. Hence if AFR has four subvolumes, there will be four copies of all files and directories. AFR provides HA (high availability). In case one of the subvolumes goes down (i.e server crash, network disconnection) AFR will still service the requests from the redundant copies.
AFR also provides self-healing functionality. In case the crashed servers comeup, the outdated files and directories will be updated with the latest versions. AFR uses the extended attributes of the backend file system to track the versioning of files and directories to provide the self-heal feature.
NOTE: The previously supported "option replicate *html:2,*txt:1" pattern matching feature has been moved out of AFR. Unify's switch.case scheduler can be used to implement this feature.
volume afr-example type cluster/afr subvolumes brick1 brick2 brick3 # option debug on # turns on detailed debug messages in log by default is debugging off # option self-heal off # turn off self healing default is on # option read-subvolume brick2 # by default reads are scheduled from all subvolumes end-volume
The above example spec file will replicate all directories and files on brick1, brick2 and brick3. The subvolumes can be another translator (storage/posix or protocol/client).
NOTE: AFR needs extended attribute support in the underlying FS.
Refer to Understanding AFR Translator to see more volume spec files, and understand the design.
Stripe Translator
The striping translator stripes the input files into given block-size (default value is 128k) to its subvolumes (or child nodes) depending on the pattern specified.
NOTE: Stripe needs extended attribute support in the underlying FS.
volume stripe type cluster/stripe option block-size *:1MB subvolumes brick1 brick2 brick3 brick4 end-volume
Unify Translator
Unify translator combines multiple storage bricks into one big fast storage server. For I/O scheduling, you can bind your preferred I/O scheduler module to the unify volume. You have a variety of I/O schedulers to pick from, based on your application requirements.
Refer Understanding Unify Translator page to know more about unify translator.
volume unify type cluster/unify subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7 brick8 option namespace brick-ns # should be a node which is not present in 'subvolumes' option scheduler rr # simple round-robin scheduler end-volume
NOTE: From release '1.3.0-pre5' onwards unify translator has a 'option namespace'. It can be an empty export, which will be rebuilt with required data as unify has 'self-heal' property with this.
GlusterFS Schedulers
Scheduler decides how to distribute the new creation operations across the clustered filesystem based on load, availability and other determining factors. Here is a list of I/O schedulers you can pick from...
ALU Scheduler
ALU stands for "Adaptive Least Usage". It is the most advanced scheduler available in GlusterFS. It balances the load across volumes, taking several factors in account. It adapts itself to changing I/O patterns, according to its configuration. When properly configured, it can eliminate the need for regular tuning of the filesystem to keep volume load nicely balanced.
The ALU scheduler is composed of multiple least-usage sub-schedulers. Each sub-scheduler keeps track of a certain type of load, for each of the subvolumes, getting the actual statistics from the subvolumes themselves. The sub-schedulers are these:
- disk-usage - the used and free disk space on the volume
- read-usage - the amount of reading done from this volume
- write-usage - the amount of writing done to this volume
- open-files-usage - the number of files currently opened from this volume
- disk-speed-usage - the speed at which the disks are spinning. This is a constant value and therefore not very useful.
The ALU scheduler needs to know which of these sub-schedulers to use, and in which order to evaluate them. This is done through the "option alu.order" configuration directive.
Each sub-scheduler needs to know two things: when to kick in (the entry-threshold), and how long to stay in control (the exit-threshold). For example: when unifying three disks of 100GB, keeping an exact balance of disk-usage is not necessary. Instead, there could be a 1GB margin, which can be used to nicely balance other factors, such as read-usage. The disk-usage scheduler can be told to kick in only when a certain threshold of discrepancy is passed, such as 1GB. When it assumes control under this condition, it will write all subsequent data to the least-used volume. If it is doing so, it is unwise to stop right after the values are below the entry-threshold again, since that would make it very likely that the situation will occur again very soon. Such a situation would cause the ALU to spend most of its time disk-usage scheduling, which is unfair to the other sub-schedulers. The exit-threshold therefore defines the amount of data that needs to be written to the least-used disk, before control is relinquished again.
In addition to the sub-schedulers, the ALU scheduler also has "limits" options. These can stop the creation of new files on a volume once values drop below a certain threshold. For example, setting "option alu.limits.min-free-disk 5GB" will stop the scheduling of files to volumes that have less than 5GB of free disk space, leaving the files on that disk some room to grow.
The actual values you assign to the thresholds for sub-schedulers and limits depend on your situation. If you have fast-growing files, you'll want to stop file-creation on a disk much earlier than when hardly any of your files are growing. If you care less about disk-usage balance than about read-usage balance, you'll want a bigger disk-usage scheduler entry-threshold and a smaller read-usage scheduler entry-threshold.
For thresholds defining a size, percentage of free space is allowed. For example: "option alu.limits.min-free-disk 5%".
- ALU Scheduler Volume example
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 brick5 option alu.read-only-subvolumes brick5 # This option makes brick5 to be readonly, where no new files are created. option scheduler alu # use the ALU scheduler option alu.limits.min-free-disk 5% # Don't create files one a volume with less than 5% free diskspace option alu.limits.max-open-files 10000 # Don't create files on a volume with more than 10000 files open # When deciding where to place a file, first look at the disk-usage, then at # read-usage, write-usage, open files, and finally the disk-speed-usage. option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage option alu.disk-usage.entry-threshold 2GB # Kick in if the discrepancy in disk-usage between volumes is more than 2GB option alu.disk-usage.exit-threshold 60MB # Don't stop writing to the least-used volume until the discrepancy is 1988MB option alu.open-files-usage.entry-threshold 1024 # Kick in if the discrepancy in open files is 1024 option alu.open-files-usage.exit-threshold 32 # Don't stop until 992 files have been written the least-used volume # option alu.read-usage.entry-threshold 20% # Kick in when the read-usage discrepancy is 20% # option alu.read-usage.exit-threshold 4% # Don't stop until the discrepancy has been reduced to 16% (20% - 4%) # option alu.write-usage.entry-threshold 20% # Kick in when the write-usage discrepancy is 20% # option alu.write-usage.exit-threshold 4% # Don't stop until the discrepancy has been reduced to 16% # option alu.disk-speed-usage.entry-threshold # NEVER SET IT. SPEED IS CONSTANT!!! # option alu.disk-speed-usage.exit-threshold # NEVER SET IT. SPEED IS CONSTANT!!! option alu.stat-refresh.interval 10sec # Refresh the statistics used for decision-making every 10 seconds # option alu.stat-refresh.num-file-create 10 # Refresh the statistics used for decision-making after creating 10 files end-volume
NUFA Scheduler
Non-Uniform Filesystem Scheduler similar to NUMA (http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access) memory design. It is mainly used in HPC environments where you are required to run the filesystem server and client within the same cluster. Under such environment, NUFA scheduler gives the local system more priority for file creation over other nodes.
volume posix1 type storage/posix # POSIX FS translator option directory /home/export # Export this directory end-volume volume bricks type cluster/unify subvolumes posix1 brick2 brick3 brick4 option scheduler nufa option nufa.local-volume-name posix1 option nufa.limits.min-free-disk 5% end-volume
NOTE: Now NUFA comes with support for more than one local volume option.
Random Scheduler
Random scheduler randomly scatters file creation across storage bricks.
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 option scheduler random option random.limits.min-free-disk 5% end-volume
Round-Robin Scheduler
Round-Robin (RR) scheduler creates files in a round-robin fashion. Each client will have its own round-robin loop. When your files are mostly similar in size and I/O access pattern, this scheduler is a good choice. RR scheduler now checks for free disk size of the server before scheduling, so you can get to know when to add another server brick. The default value of min-free-disk is 5% and is checked every 10seconds (by default) if there is any create call happening.
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 option scheduler rr option rr.read-only-subvolumes brick4 # No files will be created in 'brick4' option rr.limits.min-free-disk 5% # Unit in % option rr.refresh-interval 10 # Check server brick after 10s. end-volume
Switch Scheduler
Switch Scheduler is the latest addition to the GlusterFS code base. It schedules the file according the the filename patterns specified. One can understand it with the example given below.
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7 option scheduler switch option switch.case *jpg:brick1,brick2;*mpg:brick3;*:brick4,brick5,brick6 option switch.read-only-subvolumes brick7 end-volume
Above is the an example of the unify translator. Files with pattern '*jpg' will be created in brick1 and brick2, and '*mpg' will be created in brick3, and all other files will be created in brick4,brick5, and brick6.
brick7 will be a read-only subvolume, from which data can be read, but no changes are permitted.
Debug translators
These translators are useful while you want to debug filesystem.
trace
Trace translator produces extensive trace information for debugging purpose. The debug information is written to the GlusterFS log file, which by default is found in /var/log/gluster/glusterfs.log. Trace volume can be inserted or layered on top of any volume which needs to be debugged. All the calls with its arguments/values will be logged.
### Export volume "brick" with the contents of "/home/export" directory. volume brick type storage/posix # POSIX FS translator option directory /home/export # Export this directory end-volume ### Trace storage/posix translator. volume trace type debug/trace subvolumes brick # option include open,close,create,readdir,opendir,closedir # option exclude lookup,read,write end-volume
NOTE: if someone wants to trace only few calls through trace translator, use "option include <fopslist>", if most of the calls are needed to be traced, and few calls are not required, use "option exclude <fopslist>".
Extra Features Translators
filter
Advanced filtering translator based on filenames and/or attributes. Currently it only supports read-only export option.
volume brick-readonly type features/filter subvolumes brick end-volume
posix-locks
This translator provides storage independent POSIX record locking support (fcntl locking). Typically you'll want to load this on the server side, just below the POSIX storage translator. Using this translator you can get both advisory locking and mandatory locking support.
volume locks type features/posix-locks subvolumes brick end-volume
Note: Consider a file that does not have its mandatory locking bits (+setgid, -group execution) turned on. Assume that this file is now opened by a process on a client that has the write-behind xlator loaded. The write-behind xlator does not cache anything for files which have mandatory locking enabled, to avoid incoherence. Let's say that mandatory locking is now enabled on this file through another client. The former client will not know about this change, and write-behind may erroneously report a write as being successful when in fact it would fail due to the region it is writing to being locked.
There seems to be no easy way to fix this. To work around this problem, it is recommended that you never enable the mandatory bits on a file while it is open.
trash
This translator provides a 'libtrash' like feature (or some users may like to call it as recyclebin). This translator is best utilized when loaded on serverside.
volume trash type features/trash option trash-dir /.trashcan subvolumes brick end-volume
fixed-id
This translator provides a feature where all the calls passing through this layer will be from a fixed UID and GID.
volume fixed type features/fixed-id option fixed-uid 1000 option fixed-gid 100 subvolumes brick end-volume
Storage translators
posix
GlusterFS relies on disk-filesystems (such as ext2, ext3, xfs, reiserfs, etc) to handle block device management. This POSIX translator binds the GlusterFS server to underlying file system.
volume posix1 type storage/posix # POSIX FS translator option directory /home/export # Export this directory end-volume
Protocol Translators
server
Server translator allows you to export volumes over the network. This translator implements transport modules for various interconnects.
### Add network serving capability to above brick. volume server type protocol/server option transport-type tcp/server # For TCP/IP transport # option transport-type ib-sdp/server # For Infiniband transport # option transport-type ib-verbs/server # For Infiniband Verbs transport # option ib-verbs-work-request-recv-size 1048576 # Higher performance if its equal to read-ahead size # option ib-verbs-work-request-recv-count 16 # option ib-verbs-work-request-send-size 1048576 # Higher performance if its equal to write-behind size # option ib-verbs-work-request-send-count 16 # option bind-address 192.168.1.10 # Default is to listen on all interfaces # option listen-port 6996 # Default is 6996 # option client-volume-filename /etc/glusterfs/glusterfs-client.vol subvolumes brick1 brick2 option auth.ip.brick1.allow 192.168.* # Allow access to "brick1" volume option auth.ip.brick2.allow 192.168.* # Allow access to "brick2" volume end-volume
Available transport modules for server protocol are:
- tcp/server: TCP/IP based interconnects.
- ib-sdp/server: Infiniband Sockets Direct Protocol transport interface.
- ib-verbs/server: Infiniband Verbs transport interface.
Authenticate modules
In order to allow multiple IP addresses or subnets, specify the IP address one after the other in comma separated pattern, like shown below.
option auth.ip.brick1.allow 192.168.1.10,192.168.1.20,192.168.2.*
NOTE: Valid for version above 1.3.7
As security is the growing need with storing data, and GlusterFS being one of the network filesystem, the need for Authenticating client before connecting is very high. Currently glusterfs supports authentication modules, which has two modes of authentication as of now.
- ip
- login
auth.ip
This module gives authentication based on the ip of the client (connecting) machine. Options provided are
option auth.ip.<VOLUMENAME>.allow <List of IP addrs> # seperated by comma ','
This option is required only in protocol/server volume.
auth.login
This module gives username/passwd type of authentication.
Options in protocol/server:
option auth.login.<VOLUMENAME>.allow <list of users> # seperated by comma option auth.login.<USERNAME>.password <PASSWORD>
Options in protocol/client:
option username <USERNAME> option password <PASSWORD>
client
Client translator allows you to attach to remote volumes exported by GlusterFS servers.
### Add client feature and attach to remote subvolume of server1
volume client1
type protocol/client
option transport-type tcp/client # for TCP/IP transport
# option transport-type ib-sdp/client # for Infiniband transport
# option transport-type ib-verbs/client # For Infiniband Verbs transport
# option ib-verbs-work-request-recv-size 1048576 # Higher performance if its equal to read-ahead size
# option ib-verbs-work-request-recv-count 16
# option ib-verbs-work-request-send-size 1048576 # Higher performance if its equal to write-behind size
# option ib-verbs-work-request-send-count 16
option remote-host 192.168.1.10 # IP address of the remote brick
# option remote-port 6996 # default server port is 6996
# option transport-timeout 30 # seconds to wait for a response
# from server for each request
option remote-subvolume brick # name of the remote volume
end-volume
Available transport modules for client protocol are:
- tcp/client: TCP/IP based interconnects.
- ib-sdp/client: Infiniband Sockets Direct Protocol transport interface.
- ib-verbs/client: Infiniband Verbs transport interface.
Encryption Translators
rot-13
ROT-13 is a toy translator that can "encrypt" and "decrypt" file contents using the ROT-13 algorithm. ROT-13 is a trivial algorithm that rotates each alphabet by thirteen places. Thus, 'A' becomes 'N', 'B' becomes 'O', and 'Z' becomes 'M'.
It goes without saying that you shouldn't use this translator if you need _real_ encryption (a future release of GlusterFS will have real encryption translators).
`encrypt-write [on|off] (on)' Whether to encrypt on write
`decrypt-read [on|off] (on)' Whether to decrypt on read
Example:
volume rot-13 type encryption/rot-13 encrypt-write [on|off] (on) decrypt-read [on|off] (on) subvolumes brick end-volume
Example Client Volume Specification File
Here is the much simpler spec file, which uses all the clustering translators, is defined. You can remove the translator you don't need, and update the 'subvolumes' option of the above translator properly.
volume client1 type protocol/client option transport-type tcp/client option remote-host 192.168.10.1 option remote-subvolume ra end-volume volume client2 type protocol/client option transport-type tcp/client option remote-host 192.168.10.2 option remote-subvolume ra end-volume volume client3 type protocol/client option transport-type tcp/client option remote-host 192.168.10.3 option remote-subvolume ra end-volume volume client4 type protocol/client option transport-type tcp/client option remote-host 192.168.10.4 option remote-subvolume ra end-volume volume client5 type protocol/client option transport-type tcp/client option remote-host 192.168.10.5 option remote-subvolume ra end-volume volume client6 type protocol/client option transport-type tcp/client option remote-host 192.168.10.6 option remote-subvolume ra end-volume volume client7 type protocol/client option transport-type tcp/client option remote-host 192.168.10.7 option remote-subvolume ra end-volume volume client8 type protocol/client option transport-type tcp/client option remote-host 192.168.10.8 option remote-subvolume ra end-volume volume client-ns type protocol/client option transport-type tcp/client option remote-host 192.168.10.1 option remote-subvolume brick-ns end-volume volume stripe1 type cluster/stripe subvolumes client1 client2 option block-size *:10KB end-volume volume stripe2 type cluster/stripe subvolumes client3 client4 option block-size *:10KB end-volume volume stripe3 type cluster/stripe subvolumes client5 client6 option block-size *:10KB end-volume volume stripe4 type cluster/stripe subvolumes client7 client8 option block-size *:10KB, end-volume volume afr1 type cluster/afr subvolumes stripe1 stripe2 option replicate *:2 end-volume volume afr2 type cluster/afr subvolumes stripe3 stripe4 option replicate *:2 end-volume volume unify0 type cluster/unify subvolumes afr1 afr2 option namespace client-ns option scheduler rr option rr.limits.min-disk-free 5 end-volume volume iot type performance/io-threads subvolumes unify0 option thread-count 8 end-volume volume wb type performance/write-behind subvolumes iot end-volume volume ra type performance/read-ahead subvolumes wb end-volume volume ioc type performance/io-cache subvolumes ra end-volume
Example Server Volume Specification File
# Namespace posix volume brick-ns type storage/posix option directory /tmp/export-ns end-volume volume brick type storage/posix option directory /tmp/export end-volume volume posix-locks type features/posix-locks option mandatory on subvolumes brick # subvolumes trash # enable this if you need trash can support (NOTE: not present in 1.3.0-pre5+ releases) end-volume volume io-thr type performance/io-threads subvolumes posix-locks end-volume volume wb type performance/write-behind subvolumes io-thr end-volume volume ra type performance/read-ahead subvolumes wb end-volume volume server type protocol/server subvolumes ra brick-ns option transport-type tcp/server option client-volume-filename /etc/glusterfs/glusterfs-client.vol option auth.ip.ra.allow * option auth.ip.brick-ns.allow * end-volume

