Understanding AFR Translator

From GlusterDocumentation

Jump to: navigation, search

AFR provides RAID-1 like functionality. AFR replicates files and directories across the subvolumes. Hence if AFR has four subvolumes, there will be four copies of all files and directories. AFR provides HA, i.e in case one of the subvolumes goes down (ex. server crash, network disconnection) AFR will still service the requests from the redundant copies. Be sure to read the section on split brain below to understand some of the potential pitfalls of this operation though.

AFR also provides healing functionality, in case of inconsistent files and directories across subvolumes. This healing allows for recovery after the network or a node was unavailable. During healing, the outdated files and directories will be updated with the latest versions. AFR uses extended attributes of the backend file system to track the versioning of files and directories to provide the healing feature.

Contents

Configuration

This sample configuration will replicate all directories and files on brick1, brick2 and brick3. The subvolumes can be another translator (storage/posix or protocol/client)

volume afr-example
  type cluster/afr
  option replicate *:3                 # required for rev <= 1.3.7
  subvolumes brick1 brick2 brick3
end-volume

Note: For releases 1.3.7 and older, "option replicate *html:2,*txt:1" pattern matching feature is required in order for replication to work. Releases after 1.3.7 have moved this pattern matching feature out of AFR. It should be implemented using unify's switch.case scheduler.


read() operations will be scheduled between subvolumes for load balancing. In case one of the subvolumes is a local storage, then it is advantageous to do all reads from that subvolume, in which case we can give the option "option read-subvolume brick2" when brick2 is the local storage subvolume. "option read-subvolume *" will schedule reads across all the children which is the default behavior. A given file will always be read from the same subvolume so that we take advantage of caching in servers.


Healing

AFR has healing feature, which updates the outdated file and directory copies by the most recent versions. The built-in healing feature of AFR uses a lazy healing algorithm. Lazy healing means that files and directories are not healed until they are requested.

Extended Attributes

In order to support healing, AFR uses extended attributes of the backend file system to track the versioning of files and directories. It is thus important to be sure that your backend filesystem is compiled with and that your subvolume mounts enable extended attributes when using the AFR translator.

For example consider the following config:

volume afr-example
  type cluster/afr
  option replicate *:2                 # required for rev <= 1.3.7
  subvolumes brick1 brick2
end-volume

File healing:

Now if we create a file foo.txt on afr-example, the file will be created on brick1 and brick2. The file will have two extended attributes associated with it in the backend filesystem. One is trusted.glusterfs.createtime and the other is trusted.glusterfs.version. The trusted.glusterfs.createtime xattr has the create time (in terms of seconds since epoch) and trusted.glusterfs.version is a number that is incremented each time a file is modified. This increment happens during close (incase any write was done before close).

If brick1 goes down, we edit foo.txt the version gets incremented. Now the brick1 comes back up, when we open() on foo.txt AFR will check if their versions are same. If they are not same, the outdated copy is replaced by the latest copy and its version is updated. After the sync the open() proceeds in the usual manner and the application calling open() can continue on its access to the file.

If brick1 goes down, we delete foo.txt and create a file with the same name again i.e foo.txt. Now brick1 comes back up, clearly there is a chance that the version on brick1 being more than the version on brick2, this is where createtime extended attribute helps in deciding which the outdated copy is. Hence we need to consider both createtime and version to decide on the latest copy.

The version attribute is incremented during the close() call. Version will not be incremented in case there was no write() done. In case the fd that the close() gets was got by create() call, we also create the createtime extended attribute.

Directory healing:

Suppose brick1 goes down, we delete foo.txt, brick1 comes back up, now we should not create foo.txt on brick2 but we should delete foo.txt on brick1. We handle this situation by having the createtime and version attribute on the directory similar to the file. when lookup() is done on the directory, we compare the createtime/version attributes of the copies and see which files needs to be deleted and delete those files and update the extended attributes of the outdated directory copy. Each time a directory is modified (a file or a subdirectory is created or deleted inside the directory) and one of the subvols is down, we increment the directory's version.

lookup() is a call initiated by the kernel on a file or directory just before any access to that file or directory. In glusterfs, by default, lookup() will not be called in case it was called in the past one second on that particular file or directory.

The extended attributes can be seen in the backend filesystem using the getfattr command. (getfattr -n trusted.glusterfs.version <file>)

Pitfalls with healing:

Since the healing mechanism relies on timestamps to decide which subvolumes have the latest and most up-to-date version of files/directories, it is important that all nodes in a cluster have a tightly synchronized clock. One mechanism to achieve this is to use NTP. If node times are not in sync it is possible for healing to make the wrong decisions as to which objects are the latest.

Preemptive Self-Heal

Currently AFR doesn't do active preemptive self heal, it is limited to lazy healing. That is, it won't fix all the inconsistencies automatically. But instead it fixes the inconsistencies when a file gets opened. Hence, if one needs to make sure all of his afr'd copies are in sync, following command may help.

$ find /mnt/glusterfs -type f -exec head -n 1 {} \;

A faster healing solution could be

$ find /mnt/glusterfs -type f -print0 | xargs -0 head -c1 >/dev/null


Split Brain

Split brain operation is when some nodes of a cluster are read/writing to a different dataset than other nodes, but these node sets believe that they are writing to the same authoritative dataset. Imagine a simple cluster scenario with two servers, A & B, and two clients, Foo & Bar. If this cluster becomes temporarily network segregated so that client Foo can only see server A, and client Bar can only see server B it will be a split brain situation. This split brain situation may lead to inconsistent data being written to server A and server B. When the network returns to normal, files/directories will be healed by replicating the highest version # of each changed object to the other nodes, which may or may not lead to valid application data results. Also, if a file/directory was written to the same amount of times by each segregated node set, it will have the same version # on each subvolume and it will not undergo any healing leading to an inconsistent unhealable view of the data on each subvolume.

There are some other potential ways for the current AFR implementation to incur 'split brain' like symptoms even without any network/node failures. Currently there is no mechanism to ensure that each AFR subvolume receives writes from each client writer in the same order. This could lead to different writers writing to the same section of the same file at the same time but succeeding their writes in different orders on different subvolumes. This can potentially result in inconsistent views of the same file on different subvolumes. GlusterFS v1.4 has addressed this split brain issue.

Existing Data

It is possible to begin using AFR with preexisting mirrored copies of data on each subvolume. This however is a hack and is not the recommended method of using AFR. For more info, see: AFR with Existing Data


Examples

Single Brick and Single Mirror

Master Brick:

### Export volume "brick" with the contents of "/home/export" directory.
volume brick
  type storage/posix                   # POSIX FS translator
  option directory /home/export        # Export this directory
end-volume

volume server
  type protocol/server
  subvolumes brick
  option transport-type tcp/server     # For TCP/IP transport
  option auth.ip.brick.allow *         # access to "brick" volume
end-volume

Mirror Brick:

### Export volume "brick" with the contents of "/home/mirror-export" directory.
volume brick
  type storage/posix                   # POSIX FS translator
  option directory /home/mirror-export # Export this directory
end-volume

volume server
  type protocol/server
  subvolumes brick
  option transport-type tcp/server     # For TCP/IP transport
  option auth.ip.brick.allow *         # access to "brick" volume
end-volume

Client:

volume brick
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.1.10      # IP address of the remote brick
  option remote-subvolume brick        # name of the remote volume
end-volume

volume brick-afr
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.1.11      # IP address of the remote brick
  option remote-subvolume brick        # name of the remote volume
end-volume

### Add AFR feature to brick
volume afr
  type cluster/afr
  option replicate *:2                 # required for rev <= 1.3.7
  subvolumes brick brick-afr
end-volume

Clustered Mode

Two bricks clustered file system with AFR'ing each other. (the files are the same except for the commented IP address)

Brick 1:

### Export volume "brick" with the contents of "/home/export" directory.
volume brick
  type storage/posix                   # POSIX FS translator
  option directory /home/export        # Export this directory
end-volume

### Export volume "brick-afr" with the contents of "/home/afr-export" directory.
volume brick-afr
  type storage/posix                   # POSIX FS translator
  option directory /home/afr-export    # Export this directory
end-volume

volume brick-ns
 type storage/posix
 option directory /home/namespace
end-volume
 
### Add network serving capability to above brick.
volume server
  type protocol/server
  subvolumes brick brick-afr brick-ns
  option transport-type tcp/server     # For TCP/IP transport
  option auth.ip.brick.allow *         # access to "brick" volume
  option auth.ip.brick-afr.allow *     # access to "brick" volume
  option auth.ip.brick-ns.allow *     # access to "brick" volume
end-volume

Brick 2:

### Export volume "brick" with the contents of "/home/export" directory.
volume brick
  type storage/posix                   # POSIX FS translator
  option directory /home/export        # Export this directory
end-volume

### Export volume "brick-afr" with the contents of "/home/afr-export" directory.
volume brick-afr
  type storage/posix                   # POSIX FS translator
  option directory /home/afr-export    # Export this directory
end-volume

### Add network serving capability to above brick.
volume server
  type protocol/server
  subvolumes brick brick-afr
  option transport-type tcp/server     # For TCP/IP transport
  option auth.ip.brick.allow *         # access to "brick" volume
  option auth.ip.brick-afr.allow *     # access to "brick" volume
end-volume

Client:

### Add client feature and attach to remote subvolume of server1
volume brick1
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.1.10      # IP address of the remote brick
  option remote-subvolume brick        # name of the remote volume
end-volume

### Add client feature and attach to remote subvolume of brick1
volume brick1-afr
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.1.10      # IP address of the remote brick
  option remote-subvolume brick-afr    # name of the remote volume
end-volume

### Add client feature and attach to remote subvolume of brick2
volume brick2
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.1.11      # IP address of the remote brick
  option remote-subvolume brick        # name of the remote volume
end-volume

### Add client feature and attach to remote subvolume of server1
volume brick2-afr
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.1.11      # IP address of the remote brick
  option remote-subvolume brick-afr    # name of the remote volume
end-volume

### Add AFR feature to brick1
volume afr1
  type cluster/afr
  option replicate *:2                 # required for rev <= 1.3.7
  subvolumes brick1 brick2-afr
end-volume

### Add AFR feature to brick2
volume afr2
  type cluster/afr
  option replicate *:2                 # required for rev <= 1.3.7
  subvolumes brick2 brick1-afr
end-volume

## Name space option for unify
volume brick-ns
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.1.10      # IP address of the remote brick
  option remote-subvolume brick-ns        # name of the remote volume
end-volume

### Add unify feature to cluster the servers. Associate an
### appropriate scheduler that matches your I/O demand.
volume bricks
  type cluster/unify
  subvolumes afr1 afr2
  option scheduler rr
  option namespace brick-ns
  option rr.limits.min-free-disk 5%
end-volume
Personal tools