High-availability storage using server-side AFR
From GlusterDocumentation
Contents |
Introduction
In this howto we will set up an HA cluster using two storage nodes and two clients; however, due to the scalable nature of this configuration, the number of storage and clients nodes can be easily increased.
A basic knowledge of DNS, Linux administration, and network topology is assumed.
High-Availability
The idea of a high-availability cluster is simple : if even one of the storage nodes is functional, then the data should be accessible by the clients. In a Gluster-based environment, this can be accomplished using a combination of server-side data replication and round-robin DNS addressing. For performance purposes, storage-related traffic can be moved to a physically separate high-speed network - an obvious enhancement to any HA environment.
Network
Starting from the bottom and working our way up is the best approach to understanding this configuration; thus, we shall begin with a description of the physical network. In a typical environment, the storage cluster will be used solely for storing and serving data to clients, while the clients may engage in any number of diverse activities. Commonly, clients are themselves responsible for delivery of other network services, such as web, email, or FTP servers.
Given this as the premise, it is easy to imagine a scenario whereby many, say, web servers need to ensure that :
- Requests to and from the storage cluster are fast and responsive
- Normal web-serving traffic is not affected by the storage requests
Storage Network
One natural solution to this problem is to segregate the different types of traffic on to different physical networks. Each of the web servers would therefore have two network interfaces; one for the "storage network", and another for regular traffic (HTTP, SSH for administration, etc...). Consider the following graphic :
In the above scenario, regular traffic is carried via an (unspecified) "regular network" on eth0 using the 10.0.0.0 address space, and storage traffic is carried via a dedicated storage network on eth1 using the 192.168.0.0 address space. None of the machines are set up as bridges - since there is no way for the two networks to talk to each other, the two networks are therefore segregated. The storage network can set up using commodity Gigabit Ethernet hardware, or Fiber Channel for those with the budget and inclination.
In order to keep things tidy, the addressing scheme of the storage network mimics that of the regular network: for example, the first web server (www1) is assigned 10.0.0.11 on the regular network, and 192.168.0.11 on the storage network.
DNS
The storage network requires basic DNS resolution, and thus, a private zone would be set up for just this purpose. The DNS server responsible for the zone needs to be accessible by the members of the storage network (obviously), but need not be attached to the storage network itself. In this howto, the storage network zone is called "storagenet.gfs", with each member of the storage network assigned the same hostname as on the general network; for example, the www1(.general.net) server on the general network remains www1(.storagenet.gfs) on the storage network.
Thus, while querying the properly-configured DNS server, one would see the following results :
$ host www1.general.net www1.general.net has address 10.0.0.10 $ host www1.storagenet.gfs www1.storagenet.gfs has address 192.168.0.10
Further discussion on the setup and configuration of a basic DNS server is outside of the scope of this document, and left as an exercise to the reader.
Round-Robin DNS
A key component of the HA setup described by this document is round-robin DNS (RRDNS). Though it is used only in one instance, it is a critical function - one which helps to ensure that the data can be served continuously even in the event that one of the storage servers becomes inaccessible. In a basic Gluster configuration the clients are told to access servers via their IP addresses; while functional, this has the drawback of causing the data to become inaccessible if the IP address cannot be reached (i.e. the server dies). This problem is mitigated by using an single hostname for both of the servers, as in the diagram to the right.
Consider the following results :
$ host storage1.storagenet.gfs storage1.storagenet.gfs has address 192.168.0.110 $ host storage2.storagenet.gfs storage2.storagenet.gfs has address 192.168.0.111 $ host cluster.storagenet.gfs cluster.storagenet.gfs has address 192.168.0.110 cluster.storagenet.gfs has address 192.168.0.111
$ dig cluster.storagenet.gfs | grep -A 2 "ANSWER SECTION" ;; ANSWER SECTION: cluster.storagenet.gfs. 3600 IN A 192.168.0.110 cluster.storagenet.gfs. 3600 IN A 192.168.0.111
Briefly stated, the Gluster clients will be aware of multiple servers (in this case, two) instead of just one. In this fashion, when one of the storage nodes becomes inaccessible, the clients will use the other automatically - exactly how this works will be explored in the following section. For now, consider the following diagram, which shows the network with some additional DNS-level information :
Gluster
Now that the network and DNS architectures are well understood, we can move on to the Gluster configuration. As with the previous sections, the Gluster configuration files are relatively straightforward; the key element to be aware of is the Automatic File Replication (or "AFR") translator, which will be discussed below.
The basic premise is that the storage servers are responsible solely for themselves, which is to say that the functions of file replication and so forth are assigned to the storage servers - this is important to note, as many of the examples available on the wiki at large put these functions on the clients (about which much discussion has been generated on the mailing list).
AFR
The AFR translator is used to replicate files and directories automatically, thus creating identical copies of the same data - or "subvolumes" in the Gluster vernacular - across multiple servers. In this scenario, AFR is used to ensure that both of the storage servers contain the same subvolumes at all times.
Server Config
The server configuration files on storage1 and storage2 are nearly identical to each other.
TODO : discuss transport-timeout
storage1
[user@storage1 ~]$ cat /etc/glusterfs/glusterfs-server.vol ############################################## ### GlusterFS Server Volume Specification ## ############################################## # dataspace on storage1 volume gfs-ds type storage/posix option directory /opt/gfs-ds end-volume # posix locks volume gfs-ds-locks type features/posix-locks subvolumes gfs-ds end-volume # dataspace on storage2 volume gfs-storage2-ds type protocol/client option transport-type tcp/client option remote-host 192.168.0.111 # storage network option remote-subvolume gfs-ds-locks option transport-timeout 10 # value in seconds; it should be set relatively low end-volume # automatic file replication translator for dataspace volume gfs-ds-afr type cluster/afr subvolumes gfs-ds-locks gfs-storage2-ds # local and remote dataspaces end-volume # the actual exported volume volume gfs type performance/io-threads option thread-count 8 option cache-size 64MB subvolumes gfs-ds-afr end-volume # finally, the server declaration volume server type protocol/server option transport-type tcp/server subvolumes gfs # storage network access only option auth.ip.gfs-ds-locks.allow 192.168.0.*,127.0.0.1 option auth.ip.gfs.allow 192.168.0.* end-volume
storage2
[user@storage2 ~]$ cat /etc/glusterfs/glusterfs-server.vol ############################################## ### GlusterFS Server Volume Specification ## ############################################## # dataspace on storage2 volume gfs-ds type storage/posix option directory /opt/gfs-ds end-volume # posix locks volume gfs-ds-locks type features/posix-locks subvolumes gfs-ds end-volume # dataspace on storage1 volume gfs-storage1-ds type protocol/client option transport-type tcp/client option remote-host 192.168.0.110 # storage network option remote-subvolume gfs-ds-locks option transport-timeout 10 # value in seconds; it should be set relatively low end-volume # automatic file replication translator for dataspace volume gfs-ds-afr type cluster/afr subvolumes gfs-ds-locks gfs-storage1-ds # local and remote dataspaces end-volume # the actual exported volume volume gfs type performance/io-threads option thread-count 8 option cache-size 64MB subvolumes gfs-ds-afr end-volume # finally, the server declaration volume server type protocol/server option transport-type tcp/server subvolumes gfs # storage network access only option auth.ip.gfs-ds-locks.allow 192.168.0.*,127.0.0.1 option auth.ip.gfs.allow 192.168.0.* end-volume
Client Config
The client configuration is very simple and, in fact, identical on each client. It is in this configuration where the RRDNS hostname comes into play - the remote-host is, in this case, defined as cluster.storagenet.gfs. When the Gluster client process does a lookup on cluster, it will store both responses in its cache, then randomly choose one to actually use. If the server becomes inaccessible, the Gluster client will wait for the period of time defined by transport-timeout, then automatically attempt to use the other response in the cache. See this thread from the Gluster mailing list for more information.
In this fashion, the client performs a failover from the non-functional server, to the functional one, thus ensuring that services are not interrupted for long. Whether the storage cluster has two nodes (as in this example), or two hundred (oh my!), the failover process is identical.
It is worth noting that this process is totally automatic, which is a good thing when it happens at 04:00 on Sunday morning!
www1
[user@www1 ~]$ cat /etc/glusterfs/glusterfs-client.vol ############################################# ## GlusterFS Client Volume Specification ## ############################################# # the exported volume to mount # required! volume cluster type protocol/client option transport-type tcp/client option remote-host cluster.storagenet.gfs # RRDNS option remote-subvolume gfs # exported volume option transport-timeout 10 # value in seconds, should be relatively low end-volume # performance block for cluster # optional! volume writeback type performance/write-behind option aggregate-size 131072 subvolumes cluster end-volume # performance block for cluster # optional! volume readahead type performance/read-ahead option page-size 65536 option page-count 16 subvolumes writeback end-volume
Conclusion
Questions or comments should be directed to the GlusterFS mailing list.
Phrawzty 07:37, 2 May 2008 (PDT)




