Translators v1.4

From GlusterDocumentation

(Redirected from GlusterFS Translators v1.4)

Options of each translator is listed at this link

FIXME: Work in progress..

Contents

Storage translators

These translators hookup to exported directory, and gives the posix compliance to the filesystem.

posix

GlusterFS relies on disk-filesystems (such as ext2, ext3, xfs, reiserfs, etc) to handle block device management. This POSIX translator binds the GlusterFS server to underlying file system.

volume posix1
  type storage/posix               # POSIX FS translator
  option directory /home/export    # Export this directory
end-volume

Berkeley DB

Aiming at solving the problems of Millions/Billions of small files inside a single directory, Though, the files are not saved as files inside each directory, they will be saved inside a db file (glusterfs-storage.db). There will be directory tree, and one db per directory. The db recovery and transaction of BDB is taken care of.

volume bdb
  type storage/bdb
  option directory /tmp/bdb-export # Export Point, also HOME for DB_ENV
  #option transaction off   # default is on
  #option cache on          # default is off
  #option access-mode btree # default will be hash
  option checkpoint-timeout 10 # default is 30seconds
  #option file-mode 0644  # default is 0644
  #option dir-mode 0755   # default is 0755
  option lru-limit 200    # default is 100
  #option errfile /tmp/bdberrlog # default is /dev/null?
  #option logdir /tmp/dbd-logdir # default is <dir> in 'option directory <dir>'
end-volume

External Links

Requirements

  • GlusterFS requires latest version of BerkeleyDB, 4.7.25 - (It may work with older versions, but oracle itself advises to use latest BDB in production, also GlusterFS team encountered many known issues of older DB versions, which proved not usable)

Protocol Translators

server

Server translator allows you to export volumes over the network. This translator implements transport modules for various interconnects.

### Add network serving capability to above brick.
volume server
  type protocol/server
  option transport-type tcp/server       # For TCP/IP transport
# option transport-type ib-sdp/server    # For Infiniband transport
# option transport-type ib-verbs/server # For Infiniband Verbs transport
# option ib-verbs-work-request-recv-size   1048576  # Higher performance if its equal to read-ahead size
# option ib-verbs-work-request-recv-count  16
# option ib-verbs-work-request-send-size   1048576  # Higher performance if its equal to write-behind size
# option ib-verbs-work-request-send-count  16
# option bind-address 192.168.1.10       # Default is to listen on all interfaces
# option listen-port 6996                # Default is 6996
# option client-volume-filename /etc/glusterfs/glusterfs-client.vol
  subvolumes brick1 brick2
  option auth.addr.brick1.allow 192.168.* # Allow access to "brick1" volume
  option auth.addr.brick2.allow 192.168.* # Allow access to "brick2" volume
end-volume

Available transport modules for server protocol are:

  • tcp: TCP/IP based interconnects.
  • ib-sdp: Infiniband Sockets Direct Protocol transport interface.
  • ib-verbs: Infiniband Verbs transport interface.

Authenticate modules

In order to allow multiple IP addresses or subnets, specify the IP address one after the other in comma separated pattern, like shown below.

option auth.addr.brick1.allow 192.168.1.10,192.168.1.20,192.168.2.*

NOTE: Valid for version above 1.3.7

As security is the growing need with storing data, and GlusterFS being one of the network filesystem, the need for Authenticating client before connecting is very high. Currently glusterfs supports authentication modules, which has two modes of authentication as of now.

  • addr
  • login

auth.addr

This module gives authentication based on the ip of the client (connecting) machine. Options provided are

option auth.addr.<VOLUMENAME>.allow <List of IP addrs> # seperated by comma ','
option auth.addr.<VOLUMENAME>.reject <List of IP addrs> # seperated by comma ','

This option is required only in protocol/server volume.

auth.login

This module gives username/passwd type of authentication.

Options in protocol/server:

option auth.login.<VOLUMENAME>.allow <list of users> # seperated by comma
option auth.login.<USERNAME>.password <PASSWORD> 

Options in protocol/client:

option username <USERNAME>
option password <PASSWORD>


client

Client translator allows you to attach to remote volumes exported by GlusterFS servers.

### Add client feature and attach to remote subvolume of server1
volume client1
  type protocol/client
  option transport-type tcp/client       # for TCP/IP transport
# option transport-type ib-sdp/client    # for Infiniband transport
# option transport-type ib-verbs/client # For Infiniband Verbs transport
# option ib-verbs-work-request-recv-size   1048576  # Higher performance if its equal to read-ahead size
# option ib-verbs-work-request-recv-count  16
# option ib-verbs-work-request-send-size   1048576  # Higher performance if its equal to write-behind size
# option ib-verbs-work-request-send-count  16
  option remote-host 192.168.1.10        # IP address of the remote brick
# option remote-port 6996                # default server port is 6996
# option transport-timeout 30            # seconds to wait for a response 
                                         # from server for each request
  option remote-subvolume brick          # name of the remote volume
end-volume

Available transport modules for client protocol are:

  • tcp: TCP/IP based interconnects.
  • ib-sdp: Infiniband Sockets Direct Protocol transport interface.
  • ib-verbs: Infiniband Verbs transport interface.


Clustering Translators

The clustering translators takes more than one subvolumes.

DHT (Distributed Hash Table) Translator

DHT translator or simply, hash translator is designed for O(1) scalability. This doesn't need any namespace translator, hence for applications which use lot of small files, it will be significant improvement.

volume bricks
  type cluster/dht
  subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7
end-volume

Understanding DHT Translator for more technical details.

NUFA Translator

NUFA translator or Non Uniform FIle Access translator is designed for giving higher preference to local drive when HPC type of environment is used.

volume bricks
  type cluster/nufa
  option local-volume-name brick1
  subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7
end-volume

Refer NUFA_with_single_process example for proper usage scenario with NUFA.

Automatic File Replication Translator (AFR)

AFR provides RAID-1 like functionality. AFR replicates files and directories across the subvolumes. Hence if AFR has four subvolumes, there will be four copies of all files and directories. In case one of the subvolumes goes down (i.e server crash, network disconnection) AFR will still service the requests from the redundant copies.

AFR also provides self-healing functionality. In case the crashed servers comeup, the outdated files and directories will be updated with the latest versions. AFR uses the extended attributes of the backend file system to track the pending activities over the files and directories to provide the self-heal feature.

volume afr-example
  type cluster/afr
  subvolumes brick1 brick2 brick3
end-volume

The above example volfile will replicate all directories and files on brick1, brick2 and brick3. The subvolumes can be another translator

NOTE: AFR needs extended attribute support in the underlying FS, and also 'posix-locks' translator over the posix translator.

Refer to Understanding AFR Translator to see more volume files, and understand the design.

Stripe Translator

The striping translator stripes the input files into given block-size (default value is 128k) to its subvolumes (or child nodes).

NOTE: Stripe needs extended attribute support in the underlying FS.

volume stripe
   type cluster/stripe
   option block-size 1MB
   subvolumes brick1 brick2 brick3 brick4
 end-volume

HA Translator

HA or High Availability translator provides the feature of fail over mechanism between two volumes. It can be two servers exporting a big clustered volume. It can be same server over two different (IB and TCP) interfaces.

volume ha
  type cluster/ha
  subvolumes interface1 interface2
end-volume

Unify Translator

If your setup is fresh, Hash translator will suit better. If you have some data already existing, use unify. Unify translator combines multiple storage bricks into one big fast storage server. For I/O scheduling, you can bind your preferred I/O scheduler module to the unify volume. You have a variety of I/O schedulers to pick from, based on your application requirements.

Refer Understanding Unify Translator page to know more about unify translator.

volume unify
   type cluster/unify
   subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7 brick8
   option namespace brick-ns # should be a node which is not present in 'subvolumes'
   option scheduler rr    # simple round-robin scheduler
end-volume


GlusterFS Schedulers

Scheduler decides how to distribute the new creation operations across the clustered filesystem based on load, availability and other determining factors. Here is a list of I/O schedulers you can pick from...

ALU Scheduler

ALU stands for "Adaptive Least Usage". It is the most advanced scheduler available in GlusterFS. It balances the load across volumes, taking several factors in account. It adapts itself to changing I/O patterns, according to its configuration. When properly configured, it can eliminate the need for regular tuning of the filesystem to keep volume load nicely balanced.

The ALU scheduler is composed of multiple least-usage sub-schedulers. Each sub-scheduler keeps track of a certain type of load, for each of the subvolumes, getting the actual statistics from the subvolumes themselves. The sub-schedulers are these:

  • disk-usage - the used and free disk space on the volume
  • read-usage - the amount of reading done from this volume
  • write-usage - the amount of writing done to this volume
  • open-files-usage - the number of files currently opened from this volume
  • disk-speed-usage - the speed at which the disks are spinning. This is a constant value and therefore not very useful.

The ALU scheduler needs to know which of these sub-schedulers to use, and in which order to evaluate them. This is done through the "option alu.order" configuration directive.

Each sub-scheduler needs to know two things: when to kick in (the entry-threshold), and how long to stay in control (the exit-threshold). For example: when unifying three disks of 100GB, keeping an exact balance of disk-usage is not necessary. Instead, there could be a 1GB margin, which can be used to nicely balance other factors, such as read-usage. The disk-usage scheduler can be told to kick in only when a certain threshold of discrepancy is passed, such as 1GB. When it assumes control under this condition, it will write all subsequent data to the least-used volume. If it is doing so, it is unwise to stop right after the values are below the entry-threshold again, since that would make it very likely that the situation will occur again very soon. Such a situation would cause the ALU to spend most of its time disk-usage scheduling, which is unfair to the other sub-schedulers. The exit-threshold therefore defines the amount of data that needs to be written to the least-used disk, before control is relinquished again.

In addition to the sub-schedulers, the ALU scheduler also has "limits" options. These can stop the creation of new files on a volume once values drop below a certain threshold. For example, setting "option alu.limits.min-free-disk 5GB" will stop the scheduling of files to volumes that have less than 5GB of free disk space, leaving the files on that disk some room to grow.

The actual values you assign to the thresholds for sub-schedulers and limits depend on your situation. If you have fast-growing files, you'll want to stop file-creation on a disk much earlier than when hardly any of your files are growing. If you care less about disk-usage balance than about read-usage balance, you'll want a bigger disk-usage scheduler entry-threshold and a smaller read-usage scheduler entry-threshold.

For thresholds defining a size, percentage of free space is allowed. For example: "option alu.limits.min-free-disk 5%".

  • ALU Scheduler Volume example
volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4 brick5
  option alu.read-only-subvolumes brick5 # This option makes brick5 to be readonly, where no new files are created.
  option scheduler alu   # use the ALU scheduler
  option alu.limits.min-free-disk  5%      # Don't create files one a volume with less than 5% free diskspace
  option alu.limits.max-open-files 10000   # Don't create files on a volume with more than 10000 files open
  
  # When deciding where to place a file, first look at the disk-usage, then at  
  # read-usage, write-usage, open files, and finally the disk-speed-usage.
  option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage
  option alu.disk-usage.entry-threshold 2GB   # Kick in if the discrepancy in disk-usage between volumes is more than 2GB
  option alu.disk-usage.exit-threshold  60MB   # Don't stop writing to the least-used volume until the discrepancy is 1988MB 
  option alu.open-files-usage.entry-threshold 1024   # Kick in if the discrepancy in open files is 1024
  option alu.open-files-usage.exit-threshold 32   # Don't stop until 992 files have been written the least-used volume
# option alu.read-usage.entry-threshold 20%   # Kick in when the read-usage discrepancy is 20%
# option alu.read-usage.exit-threshold 4%   # Don't stop until the discrepancy has been reduced to 16% (20% - 4%)
# option alu.write-usage.entry-threshold 20%   # Kick in when the write-usage discrepancy is 20%
# option alu.write-usage.exit-threshold 4%   # Don't stop until the discrepancy has been reduced to 16%
# option alu.disk-speed-usage.entry-threshold # NEVER SET IT. SPEED IS CONSTANT!!!
# option alu.disk-speed-usage.exit-threshold  # NEVER SET IT. SPEED IS CONSTANT!!!
  option alu.stat-refresh.interval 10sec   # Refresh the statistics used for decision-making every 10 seconds
# option alu.stat-refresh.num-file-create 10   # Refresh the statistics used for decision-making after creating 10 files
end-volume

NUFA Scheduler

Non-Uniform Filesystem Scheduler similar to NUMA (http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access) memory design. It is mainly used in HPC environments where you are required to run the filesystem server and client within the same cluster. Under such environment, NUFA scheduler gives the local system more priority for file creation over other nodes.

volume posix1
  type storage/posix               # POSIX FS translator
  option directory /home/export    # Export this directory
end-volume 

volume bricks
  type cluster/unify
  subvolumes posix1 brick2 brick3 brick4
  option scheduler nufa
  option nufa.local-volume-name posix1
  option nufa.limits.min-free-disk 5%
end-volume

NOTE: Now NUFA comes with support for more than one local volume option.

Random Scheduler

Random scheduler randomly scatters file creation across storage bricks.

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4
  option scheduler random
  option random.limits.min-free-disk 5%
end-volume

Round-Robin Scheduler

Round-Robin (RR) scheduler creates files in a round-robin fashion. Each client will have its own round-robin loop. When your files are mostly similar in size and I/O access pattern, this scheduler is a good choice. RR scheduler now checks for free disk size of the server before scheduling, so you can get to know when to add another server brick. The default value of min-free-disk is 5% and is checked every 10seconds (by default) if there is any create call happening.

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4
  option scheduler rr
  option rr.read-only-subvolumes brick4  # No files will be created in 'brick4'
  option rr.limits.min-free-disk 5%          # Unit in %
  option rr.refresh-interval 10               # Check server brick after 10s.
end-volume


Switch Scheduler

Switch Scheduler is the latest addition to the GlusterFS code base, which actually schedules the file according the the filename patterns specified. One can understand it with the example given below.

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7
  option scheduler switch
  option switch.case *jpg:brick1,brick2;*mpg:brick3;*:brick4,brick5,brick6
  option switch.read-only-subvolumes brick7
end-volume

Above is the snapshot of just unify translator in a spec file. Here, files with pattern '*jpg' will be created in brick1 and brick2, and '*mpg' will be created in brick3, and all other files will be created in brick4,brick5, and brick6. And brick7 will be just read-only subvolume, from which just data can be read.

Performance translators

All of the performance translators should work fine when loaded on both the client side and server side.

Read Ahead Translator

read-ahead pre-fetches a sequence of blocks in advance based on its predictions. When your application is busy crunching the data it has read, glusterfs can pre-read the next batch of data in advance and keep it ready. That way consecutive reads are faster. Additionally it also behaves as a read-aggregator, i.e smaller I/O read operations are combined into fewer larger read operations internally to reduce network and disk load. page-size describes the block size and page-count describes the amount of blocks to pre-fetch.

volume readahead
  type performance/read-ahead
  option page-size 128kB        # 256KB is the default option
  option page-count 4           # 2 is default option
  option force-atime-update off # default is off
  subvolumes <x>
end-volume

NOTE: This translator is well utilized when used with IB-verbs transport or with 10Gig/E interface. With FastEthernet and GigE interface, without read-ahead, one can achieve link max throughput.

Write Behind Translator

In general, write operations are slower than read. The write-behind translator improves write performance significantly over read by using "aggregated background write" technique. That is, multiple smaller write operations are aggregated into fewer larger write operations and written in background (non-blocking). Write-behind on the client aggregates small writes into larger ones reducing network packet counts. On the server side, it helps if the writes are coming in very small chunks by reducing the diskhead seek() time.

aggregate-size determines the block size till which write data should be aggregated. Depending upon your interconnect, RAM size and work load profile, you should tune this value. The default value of 128KB works well for most users. Increasing or decreasing this value beyond certain ranges will bring down your performance. You should always benchmark with an increasing range of aggregate-size and analyze the results to choose an optimal value.

The flush-behind option is also given for increasing the performance of handling lots of small files. With this option the close()/flush() can be pushed to the back-ground, allowing the client to process the next request. It's off by default.

volume writebehind
  type performance/write-behind
  option aggregate-size 1MB # default is 0bytes
  option window-size 3MB    # default is 0bytes
  option flush-behind on    # default is 'off'
  subvolumes <x>
end-volume

Note: Currently there is an upper limit in the protocol translator to transfer only 4MB of data at the max in one request/reply packet. Hence if you use write-behind on client side (as in most of the cases) with an aggregate-size greater than 4MB, it will fail to send the bigger packet.

Threaded I/O Translator

IO-threads adds asynchronous (background) filesystem operations. By loading this translator, you can utilize the server idle blocked time to handle new incoming requests. CPU, memory or network is not utilized when the server is blocked on read or write calls while DMA'ing disk. This translator makes best use of all the resources under load and improves concurrent I/O performance.

volume iothreads
  type performance/io-threads
  option thread-count 8  # deault is 1
  subvolumes <x>
end-volume

IO-Cache Translator

The IO-Cache translator is useful on both the client and server sides of a connection.

If this translator is loaded on the client side, it may help reduce the load on both the network and the server when the client is accessing files just for reading (and the files are not edited on the server between reads). This would, For example, greatly improve the compilation of a kernel where header files are accessed over and over.

If this translator is loaded on the server side, it will allow the server to keep data that is being accessed from multiple clients simultaneously fresh in its cache.


A sample IO-Cache config:

volume io-cache
  type performance/io-cache
  option cache-size 64MB             # default is 32MB
  option page-size 1MB               #128KB is default option
  option priority *.h:3,*.html:2,*:1 # default is '*:0'
  option cache-timeout 2             # default is 1 second
  subvolumes <x>
end-volume

The cache-size parameters determines the amount of memory dedicated to the cache. The page-size is the smallest chunk of data cached for a file. The cache-timeout is only used to determine when to update file attributes from the server. File data is always verified against the server to ensure the cache has the latest copy.

Extra Features Translators

locks

This translator provides storage independent POSIX record locking support (fcntl locking). Typically you'll want to load this on the server side, just above the POSIX storage translator. Using this translator you can get both advisory locking and mandatory locking support. This also implements more locking mechanisms required for GlusterFS itself.

volume locks
  type features/locks
  subvolumes brick
end-volume

Note: Consider a file that does not have its mandatory locking bits (+setgid, -group execution) turned on. Assume that this file is now opened by a process on a client that has the write-behind xlator loaded. The write-behind xlator does not cache anything for files which have mandatory locking enabled, to avoid incoherence. Let's say that mandatory locking is now enabled on this file through another client. The former client will not know about this change, and write-behind may erroneously report a write as being successful when in fact it would fail due to the region it is writing to being locked.

There seems to be no easy way to fix this. To work around this problem, it is recommended that you never enable the mandatory bits on a file while it is open.

filter

Advanced filtering translator based on user id or group id. Currently implements root-squashing, and uid-mapping options too.

volume brick-readonly
  type features/filter
  option root-squashing enable
  option translate-uid 501-1000=10000,1000-1999=10001
  subvolumes brick
end-volume

trash

This translator provides a 'libtrash' like feature (or some users may like to call it as recyclebin). This translator is best utilized when loaded on serverside.

volume trash
  type features/trash
  option trash-dir /.trashcan
  subvolumes brick
end-volume

path-converter

This translator enables one to convert the path internally.

quota

Enables basic quota to make sure GlusterFS export doesn't grow beyond certain size, or it can grow only till last 'min-disk-free' percent of the disk is free.

Debug translators

These translators are useful while you want to debug filesystem.

trace

Trace translator produces extensive trace information for debugging purpose. The debug information is written to the GlusterFS log file, which by default is found in /var/log/gluster/glusterfs.log. Trace volume can be inserted or layered on top of any volume which needs to be debugged. All the calls with its arguments/values will be logged.

### Export volume "brick" with the contents of "/home/export" directory.
volume brick
  type storage/posix                   # POSIX FS translator
  option directory /home/export        # Export this directory
end-volume

### Trace storage/posix translator.
volume trace
  type debug/trace
  subvolumes brick
#  option include open,close,create,readdir,opendir,closedir
#  option exclude lookup,read,write
end-volume

NOTE: if someone wants to trace only few calls through trace translator, use "option include <fopslist>", if most of the calls are needed to be traced, and few calls are not required, use "option exclude <fopslist>".

Encryption Translators

rot-13

ROT-13 is a toy translator that can "encrypt" and "decrypt" file contents using the ROT-13 algorithm. ROT-13 is a trivial algorithm that rotates each alphabet by thirteen places. Thus, 'A' becomes 'N', 'B' becomes 'O', and 'Z' becomes 'M'.

It goes without saying that you shouldn't use this translator if you need _real_ encryption (a future release of GlusterFS will have real encryption translators).

`encrypt-write [on|off] (on)' Whether to encrypt on write

`decrypt-read [on|off] (on)' Whether to decrypt on read

Example:

volume rot-13
  type encryption/rot-13
  encrypt-write [on|off] (on)
  decrypt-read [on|off] (on)
  subvolumes brick
end-volume

Refer

Options of each translator is listed at this link