You are operating a busy GlusterFS cluster and for whatever reason the volume data gets corrupted. Luckily, you have been backing up the underlying bricks so you are able to restore the bricks to a usable state, but now how do you run a comparison to determine the differences between the “corrupted” volume and the restored volume?
It turns out this particular scenario doesn’t appear to be well documented anywhere else and having recently walked through this process doing some testing with GlusterFS I wanted to share my solution and solicit any feedback from the GlusterFS community on it.
Now, on to the solution!
The solution has several steps that I will outline here in the next several paragraphs. Excluding the first set of steps that specifically call out the need to individually address each server, each set of instructions will need to be implemented on each server.
For now I will assume the servers are offline following a restore. Pick a single server and power if on (Important: do not power on the other server yet). Once the server is booted stop the GlusterFS services
# service glusterd stop Stopping glusterd: [ OK ] # chkconfig glusterd off # chkconfig --list |grep gluster glusterd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
Once the GlusterFS services are stopped and prevented from starting on boot for the first server, boot the second server and perform the same steps.
From now on each step will need to be performed on each server individually and can be done in parallel or serially.
In my setup I used LVM to create the GlusterFS bricks. If this does not apply to your configuration you may skip the LVM steps.
First we need to see the LVM status so we can be sure what volumes we are working with.
# vgscan Reading all physical volumes. This may take a while...
Found duplicate PV ukZdmURSMtZjMu0AdSwl8MBU07kQsSKk: using /dev/sdg1 not /dev/sdb1 Found duplicate PV gOQvGcZimfuQD397EYH1pOlgf6M55qN8: using /dev/sdh1 not /dev/sdc1 Found duplicate PV Iu9Rwf1jzfcyNJxFoteyw3DDpI9yyi6N: using /dev/sdi1 not /dev/sdd1 Found duplicate PV 3IMiwT35LKfnGaKKFJDseYkBxdE8rEO0: using /dev/sdj1 not /dev/sdf1 Found duplicate PV 5pWfSBpC4xlgLenJnPJR2vdA5pXlEBkj: using /dev/sdk1 not /dev/sde1 Found volume group "vg_userbackup" using metadata type lvm2 Found volume group "vg_gluster" using metadata type lvm2 Found volume group "vg_sys" using metadata type lvm2
In my case, /dev/sdg through /dev/sdk are the restored brick devices, and /dev/sdb through /dev/sde are the original (corrupted) brick devices. Also, note in this case I was using LVM on partitions, but the same process would apply if you are using devices as physical volumes.
These duplicate physical volume messages are a result of the cloned bricks. We will take care of this in Step 4. However, we can see we have vg_userbackup and vg_gluster that will need to be renamed (on the original physical volumes). Before we can rename them, we will need to filter the new bricks out of the LVM config in order to be able to interact with only the original bricks.
First we will need to unmount the brick volumes (since we have disabled and stopped GlusterFS we can move directly to unmounting the brick volumes).
# for f in /brick/*; do umount $f; done # mount /dev/mapper/vg_sys-lv_root on / type ext4 (rw) proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext4 (rw) /dev/mapper/vg_sys-lv_tmp on /tmp type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
Now that we can see the bricks are unmounted we can move on to the next step.
Now we need to tell LVM to ignore the restored brick devices and and we only want to interact with the original (corrupted) brick devices. This is accomplished by editing the /etc/lvm/lvm.conf config file as follows:
-filter = [ "a/.*/" ] +filter = [ "a|/dev/sd[abcdef]|", "r/.*/"]
Note: the + represents adding a line and – represents removing or commenting out a line in the configuration
Now that we have changed the lvm.conf file we will need to re-scan the LVM config on the machine:
# vgscan Reading all physical volumes. This may take a while... Found volume group "vg_userbackup" using metadata type lvm2 Found volume group "vg_gluster" using metadata type lvm2 Found volume group "vg_sys" using metadata type lvm2
Now we can see the duplicate physical volumes messages are gone and we can see the 3 volumes that we are expecting.
In this step we are now going to rename the volume groups for the original (corrupted) GlusterFS bricks. This is done with the vgrename command:
# vgrename vg_gluster vg_gluster_original Volume group "vg_gluster" successfully renamed to "vg_gluster_original" # vgrename vg_userbackup vg_userbackup_original Volume group "vg_userbackup" successfully renamed to "vg_userbackup_original"
Now we can see the change using vgscan:
# vgscan Reading all physical volumes. This may take a while... Found volume group "vg_userbackup_original" using metadata type lvm2 Found volume group "vg_gluster_original" using metadata type lvm2 Found volume group "vg_sys" using metadata type lvm2
In this step we need to generate new UUIDs for the volume groups, physical volumes, and filesystems on each brick.
Once again, we need to modify the devices LVM detects for our volume group using the /etc/lvm/lvm.conf file.
-filter = [ "a|/dev/sd[abcdef]|", "r/.*/"] +filter = [ "a|/dev/sd[aghijk]|", "r/.*/"]
Next, inactivate the logical volumes:
# vgchange -an vg_gluster 0 logical volume(s) in volume group "vg_gluster" now active # vgchange -an vg_userbackup 0 logical volume(s) in volume group "vg_userbackup" now act
Next, we need to update the UUIDs on each brick’s physical volume. This is done using pvchange:
# pvchange --uuid /dev/sdg1 Physical volume "/dev/sdg1" changed 1 physical volume changed / 0 physical volumes not changed # pvchange --uuid /dev/sdh1 Physical volume "/dev/sdh1" changed 1 physical volume changed / 0 physical volumes not changed # pvchange --uuid /dev/sdi1 Physical volume "/dev/sdi1" changed 1 physical volume changed / 0 physical volumes not changed # pvchange --uuid /dev/sdj1 Physical volume "/dev/sdj1" changed 1 physical volume changed / 0 physical volumes not changed # pvchange --uuid /dev/sdk1 Physical volume "/dev/sdk1" changed 1 physical volume changed / 0 physical volumes not changed
Note: please confirm that each command shows 1 volume changed and 0 volumes not changed. This indicates each command completed successfully.
Next, we need to update the volume groups with new UUIDs:
# vgchange --uuid vg_gluster Volume group "vg_gluster" successfully changed # vgchange --uuid vg_userbackup Volume group "vg_userbackup" successfully changed
Next, we can re-activate the logical volumes:
# vgchange -ay vg_gluster 6 logical volume(s) in volume group "vg_gluster" now active # vgchange -ay vg_userbackup 2 logical volume(s) in volume group "vg_userbackup" now active
Next, we update the UUIDs of each of the XFS file systems on each brick:
# xfs_admin -U generate /dev/vg_gluster/lv_home_01 Clearing log and setting UUID writing all SBs new UUID = 094f399d-4f74-4504-9345-88f300a7075f # xfs_admin -U generate /dev/vg_gluster/lv_home_02 Clearing log and setting UUID writing all SBs new UUID = 4922d0a1-35d3-4ee2-8ecd-7884f42ddb83 # xfs_admin -U generate /dev/vg_gluster/lv_home_03 Clearing log and setting UUID writing all SBs new UUID = b2bc4685-8232-41f2-b6f2-2104fb041cfd # xfs_admin -U generate /dev/vg_gluster/lv_home_04 Clearing log and setting UUID writing all SBs new UUID = 03d5f728-9df1-41dd-9712-d1c0ceec4a2a
Now, remove all the LVM filters applied earlier. This is accomplished by editing the /etc/lvm/lvm.conf config file as follows:
-filter = [ "a|/dev/sd[aghijk]|", "r/.*/"] +filter = [ "a/.*/" ]
After, run vgscan to display the list of volume groups available:
# vgscan Reading all physical volumes. This may take a while... Found volume group "vg_userbackup" using metadata type lvm2 Found volume group "vg_gluster" using metadata type lvm2 Found volume group "vg_userbackup_original" using metadata type lvm2 Found volume group "vg_gluster_original" using metadata type lvm2 Found volume group "vg_sys" using metadata type lvm2
Here we can see we now have both the original “corrupted” volume and the restored volumes.
Before we can create the Gluster volume for the original bricks we need to mount all the original bricks.
First, we need to create the brick mount points:
# mkdir -p /brick/home_01_original # mkdir -p /brick/home_02_original # mkdir -p /brick/home_03_original # mkdir -p /brick/home_04_original
It is now safe to reboot and recommended to reboot the server to validate the server is ready for the last steps (it also rebuilds the device mappers list in /dev).
Next, add the following lines to the /etc/fstab file:
# Temporary for DR compare /dev/vg_gluster_original/lv_home_01 /brick/home_01_original xfs defaults /dev/vg_gluster_original/lv_home_02 /brick/home_02_original xfs defaults /dev/vg_gluster_original/lv_home_03 /brick/home_03_original xfs defaults /dev/vg_gluster_original/lv_home_04 /brick/home_04_original xfs defaults # End DR Compare
# mount -a -t xfs
At this point you can begin any brick level comparison or repairs prior to the creation of the Gluster volume and the use of the self-healing within Gluster to repair the volume. Once you are ready to proceed (and have both servers at this point) move on to Step 6.
The process for creating a new volume from the data of an existing set of bricks from an existing volume is not documented anywhere.
In order to create the new volume, I applied the same command line (with modified paths) as the original create with the addition of the force command:
# gluster volume create home_original replica 2 \ fs-z1-t01-gfs:/brick/home_01_original/data \ fs-z1-t02-gfs:/brick/home_01_original/data \ fs-z1-t01-gfs:/brick/home_02_original/data \ fs-z1-t02-gfs:/brick/home_02_original/data \ fs-z1-t01-gfs:/brick/home_03_original/data \ fs-z1-t02-gfs:/brick/home_03_original/data \ fs-z1-t01-gfs:/brick/home_04_original/data \ fs-z1-t02-gfs:/brick/home_04_original/data force
Then start the volume:
# gluster volume start home_original
After starting the new volume, mount the volume on a new server and execute a recursive call on the entireity of the volume. This is required for the volume to report the correct size and fully recognize all files on the volume. This is done by executing the following command at the root of the Gluster volume:
# du -hc --max-depth=1
This process will take a while depending on the size of the volume.
I used rsync to generate a comparison between the volumes:
rsync -i -anxv --omit-dir-times --numeric-ids --delete /gluster/home_original/ /gluster/home/ >/tmp/home compare.txt 2>&1 &
And there you go. This was the process I was able to use and demonstrate success for restoring and comparing Gluster volumes (that were busy with a calculated set of files) with no data loss (other than what happened after the snapshot) in a simulated failure scenario (infrastructure snapshot taken during smallfile generation).
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...