GlusterFS: Disaster Recovery

Gluster

2014-08-04

Scenario:

You are operating a busy GlusterFS cluster and for whatever reason the volume data gets corrupted. Luckily, you have been backing up the underlying bricks so you are able to restore the bricks to a usable state, but now how do you run a comparison to determine the differences between the “corrupted” volume and the restored volume?

It turns out this particular scenario doesn’t appear to be well documented anywhere else and having recently walked through this process doing some testing with GlusterFS I wanted to share my solution and solicit any feedback from the GlusterFS community on it.

Now, on to the solution!

Solution

The solution has several steps that I will outline here in the next several paragraphs. Excluding the first set of steps that specifically call out the need to individually address each server, each set of instructions will need to be implemented on each server.

Step 1: Carefully Boot Servers

For now I will assume the servers are offline following a restore. Pick a single server and power if on (Important: do not power on the other server yet). Once the server is booted stop the GlusterFS services

# service glusterd stop
Stopping glusterd:                                         [  OK  ]
# chkconfig glusterd off
# chkconfig --list |grep gluster
glusterd 0:off 1:off 2:off 3:off 4:off 5:off 6:off

Once the GlusterFS services are stopped and prevented from starting on boot for the first server, boot the second server and perform the same steps.

From now on each step will need to be performed on each server individually and can be done in parallel or serially.

Step 2: Rename the Original Volume Groups

In my setup I used LVM to create the GlusterFS bricks. If this does not apply to your configuration you may skip the LVM steps.

Step 2 A: Assess LVM Status

First we need to see the LVM status so we can be sure what volumes we are working with.

# vgscan
Reading all physical volumes.  This may take a while...

  Found duplicate PV ukZdmURSMtZjMu0AdSwl8MBU07kQsSKk: using /dev/sdg1 not /dev/sdb1
  Found duplicate PV gOQvGcZimfuQD397EYH1pOlgf6M55qN8: using /dev/sdh1 not /dev/sdc1
  Found duplicate PV Iu9Rwf1jzfcyNJxFoteyw3DDpI9yyi6N: using /dev/sdi1 not /dev/sdd1
  Found duplicate PV 3IMiwT35LKfnGaKKFJDseYkBxdE8rEO0: using /dev/sdj1 not /dev/sdf1
  Found duplicate PV 5pWfSBpC4xlgLenJnPJR2vdA5pXlEBkj: using /dev/sdk1 not /dev/sde1
  Found volume group "vg_userbackup" using metadata type lvm2
  Found volume group "vg_gluster" using metadata type lvm2
  Found volume group "vg_sys" using metadata type lvm2

In my case, /dev/sdg through /dev/sdk are the restored brick devices, and /dev/sdb through /dev/sde are the original (corrupted) brick devices. Also, note in this case I was using LVM on partitions, but the same process would apply if you are using devices as physical volumes.

These duplicate physical volume messages are a result of the cloned bricks. We will take care of this in Step 4. However, we can see we have vg_userbackup and vg_gluster that will need to be renamed (on the original physical volumes). Before we can rename them, we will need to filter the new bricks out of the LVM config in order to be able to interact with only the original bricks.

Step 2 B: Unmount Bricks

First we will need to unmount the brick volumes (since we have disabled and stopped GlusterFS we can move directly to unmounting the brick volumes).

# for f in /brick/*; do umount $f; done
# mount
/dev/mapper/vg_sys-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs
(rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/sda1 on /boot type ext4 (rw)
/dev/mapper/vg_sys-lv_tmp on /tmp type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)

Now that we can see the bricks are unmounted we can move on to the next step.

Step 2 C: Filter out the new bricks in LVM

Now we need to tell LVM to ignore the restored brick devices and and we only want to interact with the original (corrupted) brick devices. This is accomplished by editing the /etc/lvm/lvm.conf config file as follows:

-filter = [ "a/.*/" ]
+filter = [ "a|/dev/sd[abcdef]|", "r/.*/"]

Note: the + represents adding a line and – represents removing or commenting out a line in the configuration

Now that we have changed the lvm.conf file we will need to re-scan the LVM config on the machine:

# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "vg_userbackup" using metadata type lvm2
  Found volume group "vg_gluster" using metadata type lvm2
  Found volume group "vg_sys" using metadata type lvm2

Now we can see the duplicate physical volumes messages are gone and we can see the 3 volumes that we are expecting.

Step 2 D: Rename volume groups

In this step we are now going to rename the volume groups for the original (corrupted) GlusterFS bricks. This is done with the vgrename command:

# vgrename vg_gluster vg_gluster_original
  Volume group "vg_gluster" successfully renamed to "vg_gluster_original"
# vgrename vg_userbackup vg_userbackup_original
  Volume group "vg_userbackup" successfully renamed to "vg_userbackup_original"

Now we can see the change using vgscan:

# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "vg_userbackup_original" using metadata type lvm2
  Found volume group "vg_gluster_original" using metadata type lvm2
  Found volume group "vg_sys" using metadata type lvm2

Step 3: Update UUIDs on Restored Bricks

In this step we need to generate new UUIDs for the volume groups, physical volumes, and filesystems on each brick.

Step 3 A: Update LVM Filters

Once again, we need to modify the devices LVM detects for our volume group using the /etc/lvm/lvm.conf file.

-filter = [ "a|/dev/sd[abcdef]|", "r/.*/"]
+filter = [ "a|/dev/sd[aghijk]|", "r/.*/"]

Step 3 B: Inactivate Volume Groups

Next, inactivate the logical volumes:

# vgchange -an vg_gluster
  0 logical volume(s) in volume group "vg_gluster" now active
# vgchange -an vg_userbackup
  0 logical volume(s) in volume group "vg_userbackup" now act

Step 3 C: Update Physical Volume UUIDs

Next, we need to update the UUIDs on each brick’s physical volume. This is done using pvchange:

# pvchange --uuid /dev/sdg1
  Physical volume "/dev/sdg1" changed
  1 physical volume changed / 0 physical volumes not changed
# pvchange --uuid /dev/sdh1
  Physical volume "/dev/sdh1" changed
  1 physical volume changed / 0 physical volumes not changed
# pvchange --uuid /dev/sdi1
  Physical volume "/dev/sdi1" changed
  1 physical volume changed / 0 physical volumes not changed
# pvchange --uuid /dev/sdj1
  Physical volume "/dev/sdj1" changed
  1 physical volume changed / 0 physical volumes not changed
# pvchange --uuid /dev/sdk1
  Physical volume "/dev/sdk1" changed
  1 physical volume changed / 0 physical volumes not changed

Note: please confirm that each command shows 1 volume changed and 0 volumes not changed. This indicates each command completed successfully.

Step 3 D: Update Volume Group UUIDs

Next, we need to update the volume groups with new UUIDs:

# vgchange --uuid vg_gluster
  Volume group "vg_gluster" successfully changed
# vgchange --uuid vg_userbackup
  Volume group "vg_userbackup" successfully changed

Step 3 E: Activate Volume Groups

Next, we can re-activate the logical volumes:

# vgchange -ay vg_gluster
  6 logical volume(s) in volume group "vg_gluster" now active
# vgchange -ay vg_userbackup
  2 logical volume(s) in volume group "vg_userbackup" now active

Step 3 F: Update Filesystem UUIDs

Next, we update the UUIDs of each of the XFS file systems on each brick:

# xfs_admin -U generate /dev/vg_gluster/lv_home_01
Clearing log and setting UUID
writing all SBs
new UUID = 094f399d-4f74-4504-9345-88f300a7075f
# xfs_admin -U generate /dev/vg_gluster/lv_home_02
Clearing log and setting UUID
writing all SBs
new UUID = 4922d0a1-35d3-4ee2-8ecd-7884f42ddb83
# xfs_admin -U generate /dev/vg_gluster/lv_home_03
Clearing log and setting UUID
writing all SBs
new UUID = b2bc4685-8232-41f2-b6f2-2104fb041cfd
# xfs_admin -U generate /dev/vg_gluster/lv_home_04
Clearing log and setting UUID
writing all SBs
new UUID = 03d5f728-9df1-41dd-9712-d1c0ceec4a2a

Step 4: Remove LVM Filter

Now, remove all the LVM filters applied earlier. This is accomplished by editing the /etc/lvm/lvm.conf config file as follows:

-filter = [ "a|/dev/sd[aghijk]|", "r/.*/"]
+filter = [ "a/.*/" ]

After, run vgscan to display the list of volume groups available:

# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "vg_userbackup" using metadata type lvm2
  Found volume group "vg_gluster" using metadata type lvm2
  Found volume group "vg_userbackup_original" using metadata type lvm2
  Found volume group "vg_gluster_original" using metadata type lvm2
  Found volume group "vg_sys" using metadata type lvm2

Here we can see we now have both the original “corrupted” volume and the restored volumes.

Step 5: Mount Original Bricks

Before we can create the Gluster volume for the original bricks we need to mount all the original bricks.

Step 5 A: Create Mount Points

First, we need to create the brick mount points:

# mkdir -p /brick/home_01_original
# mkdir -p /brick/home_02_original
# mkdir -p /brick/home_03_original
# mkdir -p /brick/home_04_original

Step 5 B: Reboot Server

It is now safe to reboot and recommended to reboot the server to validate the server is ready for the last steps (it also rebuilds the device mappers list in /dev).

Step 5 C: Modify /etc/fstab

Next, add the following lines to the /etc/fstab file:

# Temporary for DR compare
/dev/vg_gluster_original/lv_home_01 /brick/home_01_original  xfs     defaults
/dev/vg_gluster_original/lv_home_02 /brick/home_02_original  xfs     defaults
/dev/vg_gluster_original/lv_home_03 /brick/home_03_original  xfs     defaults
/dev/vg_gluster_original/lv_home_04 /brick/home_04_original  xfs     defaults
# End DR Compare

Step 5 D: Mount File Bricks

# mount -a -t xfs

At this point you can begin any brick level comparison or repairs prior to the creation of the Gluster volume and the use of the self-healing within Gluster to repair the volume. Once you are ready to proceed (and have both servers at this point) move on to Step 6.

Step 7: Create New Gluster Volume

The process for creating a new volume from the data of an existing set of bricks from an existing volume is not documented anywhere.

In order to create the new volume, I applied the same command line (with modified paths) as the original create with the addition of the force command:

# gluster volume create home_original replica 2 \
 fs-z1-t01-gfs:/brick/home_01_original/data \
 fs-z1-t02-gfs:/brick/home_01_original/data \
 fs-z1-t01-gfs:/brick/home_02_original/data \
 fs-z1-t02-gfs:/brick/home_02_original/data \
 fs-z1-t01-gfs:/brick/home_03_original/data \
 fs-z1-t02-gfs:/brick/home_03_original/data \
 fs-z1-t01-gfs:/brick/home_04_original/data \
 fs-z1-t02-gfs:/brick/home_04_original/data force

Then start the volume:

# gluster volume start home_original

Steps to Finalize Gluster Volume Creation

After starting the new volume, mount the volume on a new server and execute a recursive call on the entireity of the volume. This is required for the volume to report the correct size and fully recognize all files on the volume. This is done by executing the following command at the root of the Gluster volume:

# du -hc --max-depth=1

This process will take a while depending on the size of the volume.

How to Compare Original Versus Restored

I used rsync to generate a comparison between the volumes:

rsync -i -anxv --omit-dir-times --numeric-ids --delete /gluster/home_original/ /gluster/home/ >/tmp/home compare.txt 2>&1 &

And there you go. This was the process I was able to use and demonstrate success for restoring and comparing Gluster volumes (that were busy with a calculated set of files) with no data loss (other than what happened after the snapshot) in a simulated failure scenario (infrastructure snapshot taken during smallfile generation).