Understanding AFR Translator
From GlusterDocumentation
AFR provides RAID-1 like functionality. AFR replicates files and directories across the subvolumes. Hence if AFR has four subvolumes, there will be four copies of all files and directories. AFR provides HA, i.e in case one of the subvolumes goes down (ex. server crash, network disconnection) AFR will still service the requests from the redundant copies. Be sure to read the section on split brain below to understand some of the potential pitfalls of this operation though.
AFR also provides healing functionality, in case of inconsistent files and directories across subvolumes. This healing allows for recovery after the network or a node was unavailable. During healing, the outdated files and directories will be updated with the latest versions. AFR uses extended attributes of the backend file system to track the versioning of files and directories to provide the healing feature.
Contents |
Configuration
This sample configuration will replicate all directories and files on brick1, brick2 and brick3. The subvolumes can be another translator (storage/posix or protocol/client)
volume afr-example type cluster/afr option replicate *:3 # required for rev <= 1.3.7 subvolumes brick1 brick2 brick3 end-volume
Note: For releases 1.3.7 and older, "option replicate *html:2,*txt:1" pattern matching feature is required in order for replication to work. Releases after 1.3.7 have moved this pattern matching feature out of AFR. It should be implemented using unify's switch.case scheduler.
read() operations will be scheduled between subvolumes for load balancing.
In case one of the subvolumes is a local storage, then it is advantageous
to do all reads from that subvolume, in which case we can give the option
"option read-subvolume brick2" when brick2 is the local storage subvolume.
"option read-subvolume *" will schedule reads across all the children which
is the default behavior. A given file will always be read from the same
subvolume so that we take advantage of caching in servers.
Healing
AFR has healing feature, which updates the outdated file and directory copies by the most recent versions. The built-in healing feature of AFR uses a lazy healing algorithm. Lazy healing means that files and directories are not healed until they are requested.
Extended Attributes
In order to support healing, AFR uses extended attributes of the backend file system to track the versioning of files and directories. It is thus important to be sure that your backend filesystem is compiled with and that your subvolume mounts enable extended attributes when using the AFR translator.
For example consider the following config:
volume afr-example type cluster/afr option replicate *:2 # required for rev <= 1.3.7 subvolumes brick1 brick2 end-volume
File healing:
Now if we create a file foo.txt on afr-example, the file will be created on brick1 and brick2. The file will have two extended attributes associated with it in the backend filesystem. One is trusted.glusterfs.createtime and the other is trusted.glusterfs.version. The trusted.glusterfs.createtime xattr has the create time (in terms of seconds since epoch) and trusted.glusterfs.version is a number that is incremented each time a file is modified. This increment happens during close (incase any write was done before close).
If brick1 goes down, we edit foo.txt the version gets incremented. Now the brick1 comes back up, when we open() on foo.txt AFR will check if their versions are same. If they are not same, the outdated copy is replaced by the latest copy and its version is updated. After the sync the open() proceeds in the usual manner and the application calling open() can continue on its access to the file.
If brick1 goes down, we delete foo.txt and create a file with the same name again i.e foo.txt. Now brick1 comes back up, clearly there is a chance that the version on brick1 being more than the version on brick2, this is where createtime extended attribute helps in deciding which the outdated copy is. Hence we need to consider both createtime and version to decide on the latest copy.
The version attribute is incremented during the close() call. Version will not be incremented in case there was no write() done. In case the fd that the close() gets was got by create() call, we also create the createtime extended attribute.
Directory healing:
Suppose brick1 goes down, we delete foo.txt, brick1 comes back up, now we should not create foo.txt on brick2 but we should delete foo.txt on brick1. We handle this situation by having the createtime and version attribute on the directory similar to the file. when lookup() is done on the directory, we compare the createtime/version attributes of the copies and see which files needs to be deleted and delete those files and update the extended attributes of the outdated directory copy. Each time a directory is modified (a file or a subdirectory is created or deleted inside the directory) and one of the subvols is down, we increment the directory's version.
lookup() is a call initiated by the kernel on a file or directory just before any access to that file or directory. In glusterfs, by default, lookup() will not be called in case it was called in the past one second on that particular file or directory.
The extended attributes can be seen in the backend filesystem using the getfattr command. (getfattr -n trusted.glusterfs.version <file>)
Pitfalls with healing:
Since the healing mechanism relies on timestamps to decide which subvolumes have the latest and most up-to-date version of files/directories, it is important that all nodes in a cluster have a tightly synchronized clock. One mechanism to achieve this is to use NTP. If node times are not in sync it is possible for healing to make the wrong decisions as to which objects are the latest.
Preemptive Self-Heal
Currently AFR doesn't do active preemptive self heal, it is limited to lazy healing. That is, it won't fix all the inconsistencies automatically. But instead it fixes the inconsistencies when a file gets opened. Hence, if one needs to make sure all of his afr'd copies are in sync, following command may help.
$ find /mnt/glusterfs -type f -exec head -n 1 {} \;
A faster healing solution could be
$ find /mnt/glusterfs -type f -print0 | xargs -0 head -c1 >/dev/null
Split Brain
Split brain operation is when some nodes of a cluster are read/writing to a different dataset than other nodes, but these node sets believe that they are writing to the same authoritative dataset. Imagine a simple cluster scenario with two servers, A & B, and two clients, Foo & Bar. If this cluster becomes temporarily network segregated so that client Foo can only see server A, and client Bar can only see server B it will be a split brain situation. This split brain situation may lead to inconsistent data being written to server A and server B. When the network returns to normal, files/directories will be healed by replicating the highest version # of each changed object to the other nodes, which may or may not lead to valid application data results. Also, if a file/directory was written to the same amount of times by each segregated node set, it will have the same version # on each subvolume and it will not undergo any healing leading to an inconsistent unhealable view of the data on each subvolume.
There are some other potential ways for the current AFR implementation to incur 'split brain' like symptoms even without any network/node failures. Currently there is no mechanism to ensure that each AFR subvolume receives writes from each client writer in the same order. This could lead to different writers writing to the same section of the same file at the same time but succeeding their writes in different orders on different subvolumes. This can potentially result in inconsistent views of the same file on different subvolumes. GlusterFS v1.4 has addressed this split brain issue.
Existing Data
It is possible to begin using AFR with preexisting mirrored copies of data on each subvolume. This however is a hack and is not the recommended method of using AFR. For more info, see: AFR with Existing Data
Examples
Single Brick and Single Mirror
Master Brick:
### Export volume "brick" with the contents of "/home/export" directory. volume brick type storage/posix # POSIX FS translator option directory /home/export # Export this directory end-volume volume server type protocol/server subvolumes brick option transport-type tcp/server # For TCP/IP transport option auth.ip.brick.allow * # access to "brick" volume end-volume
Mirror Brick:
### Export volume "brick" with the contents of "/home/mirror-export" directory. volume brick type storage/posix # POSIX FS translator option directory /home/mirror-export # Export this directory end-volume volume server type protocol/server subvolumes brick option transport-type tcp/server # For TCP/IP transport option auth.ip.brick.allow * # access to "brick" volume end-volume
Client:
volume brick type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host 192.168.1.10 # IP address of the remote brick option remote-subvolume brick # name of the remote volume end-volume volume brick-afr type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host 192.168.1.11 # IP address of the remote brick option remote-subvolume brick # name of the remote volume end-volume ### Add AFR feature to brick volume afr type cluster/afr option replicate *:2 # required for rev <= 1.3.7 subvolumes brick brick-afr end-volume
Clustered Mode
Two bricks clustered file system with AFR'ing each other. (the files are the same except for the commented IP address)
Brick 1:
### Export volume "brick" with the contents of "/home/export" directory. volume brick type storage/posix # POSIX FS translator option directory /home/export # Export this directory end-volume ### Export volume "brick-afr" with the contents of "/home/afr-export" directory. volume brick-afr type storage/posix # POSIX FS translator option directory /home/afr-export # Export this directory end-volume volume brick-ns type storage/posix option directory /home/namespace end-volume ### Add network serving capability to above brick. volume server type protocol/server subvolumes brick brick-afr brick-ns option transport-type tcp/server # For TCP/IP transport option auth.ip.brick.allow * # access to "brick" volume option auth.ip.brick-afr.allow * # access to "brick" volume option auth.ip.brick-ns.allow * # access to "brick" volume end-volume
Brick 2:
### Export volume "brick" with the contents of "/home/export" directory. volume brick type storage/posix # POSIX FS translator option directory /home/export # Export this directory end-volume ### Export volume "brick-afr" with the contents of "/home/afr-export" directory. volume brick-afr type storage/posix # POSIX FS translator option directory /home/afr-export # Export this directory end-volume ### Add network serving capability to above brick. volume server type protocol/server subvolumes brick brick-afr option transport-type tcp/server # For TCP/IP transport option auth.ip.brick.allow * # access to "brick" volume option auth.ip.brick-afr.allow * # access to "brick" volume end-volume
Client:
### Add client feature and attach to remote subvolume of server1 volume brick1 type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host 192.168.1.10 # IP address of the remote brick option remote-subvolume brick # name of the remote volume end-volume ### Add client feature and attach to remote subvolume of brick1 volume brick1-afr type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host 192.168.1.10 # IP address of the remote brick option remote-subvolume brick-afr # name of the remote volume end-volume ### Add client feature and attach to remote subvolume of brick2 volume brick2 type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host 192.168.1.11 # IP address of the remote brick option remote-subvolume brick # name of the remote volume end-volume ### Add client feature and attach to remote subvolume of server1 volume brick2-afr type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host 192.168.1.11 # IP address of the remote brick option remote-subvolume brick-afr # name of the remote volume end-volume ### Add AFR feature to brick1 volume afr1 type cluster/afr option replicate *:2 # required for rev <= 1.3.7 subvolumes brick1 brick2-afr end-volume ### Add AFR feature to brick2 volume afr2 type cluster/afr option replicate *:2 # required for rev <= 1.3.7 subvolumes brick2 brick1-afr end-volume ## Name space option for unify volume brick-ns type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host 192.168.1.10 # IP address of the remote brick option remote-subvolume brick-ns # name of the remote volume end-volume ### Add unify feature to cluster the servers. Associate an ### appropriate scheduler that matches your I/O demand. volume bricks type cluster/unify subvolumes afr1 afr2 option scheduler rr option namespace brick-ns option rr.limits.min-free-disk 5% end-volume

