Block device translator (BD xlator) is a new translator added to GlusterFS recently which provides block backend for GlusterFS. This replaces the existing bd_map translator in GlusterFS that provided similar but very limited functionality. GlusterFS expects the underlying brick to be formatted with a POSIX compatible file system. BD xlator changes that and allows for having bricks that are raw block devices like LVM which needn’t have any file systems on them. Hence with BD xlator, it becomes possible to build a GlusterFS volume comprising of bricks that are logical volumes (LV).
BD xlator maps underlying LVs to files and hence the LVs appear as files to GlusterFS clients. Though BD volume externally appears very similar to the usual Posix volume, not all operations are supported or possible for the files on a BD volume. Only those operations that make sense for a block device are supported and the exact semantics are described in subsequent sections.
While Posix volume takes a file system directory as brick, BD volume needs a volume group (VG) as brick. In the usual use case of BD volume, a file created on BD volume will result in an LV being created in the brick VG. In addition to a VG, BD volume also needs a file system directory that should be specified at the volume creation time. This directory is necessary for supporting the notion of directories and directory hierarchy for the BD volume. Metadata about LVs (size, mapping info) is stored in this directory.
BD xlator was mainly developed to use block devices directly as VM images when GlusterFS is used as storage for KVM virtualization. Some of the salient points of BD xlator are
Though BD xlator is primarily intended to be used with block devices, it does provide full Posix xlator compatibility for files that are created on BD volume but are not backed by or mapped to a block device. Such files which don’t have a block device mapping exist on the Posix directory that is specified during BD volume creation.
BD xlator developed by M. Mohan Kumar was committed into GlusterFS git in November 2013 and is expected to be part of upcoming GlusterFS-3.5 release.
BD xlator needs lvm2 development library. –enable-bd-xlator option can be used with ./configure script to explicitly enable BD translator. The following snippet from the output of configure script shows that BD xlator is enabled for compilation.
GlusterFS configure summary
===================
…
Block Device xlator : yes
BD supports hosting of both linear LV and thin LV within the same volume. However I will be showing them separately in the following instructions. As noted above, the prerequisite for a BD volume is VG which I am creating here from a loop device, but it can be any other device too.
– Create a loop device
[root@bharata ~]# dd if=/dev/zero of=bd-loop count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop
– Prepare a brick by creating a VG
[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg /dev/loop0
– Create the BD volume
Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta
It is recommended that this directory is created on an LV in the brick VG itself so that both data and metadata live together on the same device.
Create and mount the volume
[root@bharata ~]# gluster volume create bd bharata:/bd-meta?bd-vg force
The general syntax for specifying the brick is host:/posix-dir?volume-group-name where “?” is the separator.
[root@bharata ~]# gluster volume start bd
[root@bharata ~]# gluster volume info bd
Volume Name: bd
Type: Distribute
Volume ID: cb042d2a-f435-4669-b886-55f5927a4d7f
Status: Started
Xlator 1: BD
Capability 1: offload_copy
Capability 2: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta
Brick1 VG: bd-vg
[root@bharata ~]# mount -t glusterfs bharata:/bd /mnt
– Create a file that is backed by an LV
[root@bharata ~]# ls /mnt
[root@bharata ~]#
Since the volume is empty now, so is the underlying VG.
[root@bharata ~]# lvdisplay bd-vg
[root@bharata ~]#
Creating a file that is mapped to an LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to LV.
[root@bharata ~]# touch /mnt/lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “lv” /mnt/lv
Now an LV got created in the VG brick and the file /mnt/lv maps to this LV. Any read/write to this file ends up as read/write to the underlying LV.
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name 6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name bd-vg
LV UUID PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status available
# open 0
LV Size 4.00 MiB
Current LE 1
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:6
The file gets created with default LV size which is 1 LE which is 4MB in this case.
[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 4.0M Nov 26 16:15 /mnt/lv
truncate can be used to set the required file size.
[root@bharata ~]# truncate /mnt/lv -s 256M
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name 6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name bd-vg
LV UUID PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status available
# open 0
LV Size 256.00 MiB
Current LE 64
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:6
[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 256M Nov 26 16:15 /mnt/lv
The size of the file/LV can be specified during creation/mapping time itself like this:
setfattr -n “user.glusterfs.bd” -v “lv:256MB” /mnt/lv
– Create a loop device
[root@bharata ~]# dd if=/dev/zero of=bd-loop-thin count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop-thin
– Prepare a brick by creating a VG and thin pool
[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg-thin /dev/loop0
Create a thin pool
[root@bharata ~]# lvcreate –thin bd-vg-thin -L 1000M
Rounding up size to full physical extent 4.00 MiB
Logical volume “lvol0″ created
lvdisplay shows the thin pool
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name lvol0
VG Name bd-vg-thin
LV UUID HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID 0
LV Pool metadata lvol0_tmeta
LV Pool data lvol0_tdata
LV Pool chunk size 64.00 KiB
LV Zero new blocks yes
LV Status available
# open 0
LV Size 1000.00 MiB
Allocated pool data 0.00%
Allocated metadata 0.88%
Current LE 250
Segments 1
Allocation inherit
Read ahead sectors auto
– currently set to 256
Block device 253:9
– Create the BD volume
Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta-thin
Create and mount the volume
[root@bharata ~]# gluster volume create bd-thin bharata:/bd-meta-thin?bd-vg-thin force
[root@bharata ~]# gluster volume start bd-thin
[root@bharata ~]# gluster volume info bd-thin
Volume Name: bd-thin
Type: Distribute
Volume ID: 27aa7eb0-4ffa-497e-b639-7cbda0128793
Status: Started
Xlator 1: BD
Capability 1: thin
Capability 2: offload_copy
Capability 3: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta-thin
Brick1 VG: bd-vg-thin
[root@bharata ~]# mount -t glusterfs bharata:/bd-thin /mnt
– Create a file that is backed by a thin LV
[root@bharata ~]# ls /mnt
[root@bharata ~]#
Creating a file that is mapped to a thin LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to a thin LV.
[root@bharata ~]# touch /mnt/thin-lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “thin:256MB” /mnt/thin-lv
Now /mnt/thin-lv is a thin provisioned file that is backed by a thin LV.
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name lvol0
VG Name bd-vg-thin
LV UUID HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID 1
LV Pool metadata lvol0_tmeta
LV Pool data lvol0_tdata
LV Pool chunk size 64.00 KiB
LV Zero new blocks yes
LV Status available
# open 0
LV Size 1000.00 MiB
Allocated pool data 0.00%
Allocated metadata 0.98%
Current LE 250
Segments 1
Allocation inherit
Read ahead sectors auto
– currently set to 256
Block device 253:9
— Logical volume —
LV Path /dev/bd-vg-thin/081b01d1-1436-4306-9baf-41c7bf5a2c73
LV Name 081b01d1-1436-4306-9baf-41c7bf5a2c73
VG Name bd-vg-thin
LV UUID coxpTY-2UZl-9293-8H2X-eAZn-wSp6-csZIeB
LV Write Access read/write
LV Creation host, time bharata, 2013-11-26 16:43:19 +0530
LV Pool name lvol0
LV Status available
# open 0
LV Size 256.00 MiB
Mapped size 0.00%
Current LE 64
Segments 1
Allocation inherit
Read ahead sectors auto
– currently set to 256
Block device 253:10
As can be seen from above, creation of a file resulted in creation of a thin LV in the brick.
BD xlator uses LVM snapshot and clone capabilities to provide file level snapshots and clones for files on GlusterFS volume. Snapshots and clones work only for those files that have been already mapped to an LV. In other words, snapshots and clones aren’t for Posix-only file that exist on BD volume.
Say we are interested in taking snapshot of a file /mnt/file that already exists and has been mapped to an LV.
[root@bharata ~]# ls -l /mnt/file
-rw-r–r–. 1 root root 268435456 Nov 27 10:16 /mnt/file
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name bd-vg
LV UUID HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV Status available
# open 0
LV Size 256.00 MiB
Current LE 64
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:6
Snapshot creation is a two step process.
– Create a snapshot destination file first
[root@bharata ~]# touch /mnt/file-snap
– Then take the actual snapshot
In order to create the actual snapshot, we need to know the GFID of the snapshot file.
[root@bharata ~]# getfattr -n glusterfs.gfid.string /mnt/file-snap
getfattr: Removing leading ‘/’ from absolute path names
# file: mnt/file-snap
glusterfs.gfid.string=”bdf74e38-dc96-4b26-94e2-065fe3b8bcc3″
Use this GFID string to create the actual snapshot
[root@bharata ~]# setfattr -n snapshot -v bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 /mnt/file
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name bd-vg
LV UUID HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV snapshot status source of bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 [active]
LV Status available
# open 0
LV Size 256.00 MiB
Current LE 64
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:6
— Logical volume —
LV Path /dev/bd-vg/bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
LV Name bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
VG Name bd-vg
LV UUID 9XH6xX-Sl64-uNhk-7OiH-f91m-DaMo-6AWiBD
LV Write Access read/write
LV Creation host, time bharata, 2013-11-27 10:20:35 +0530
LV snapshot status active destination for abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Status available
# open 0
LV Size 256.00 MiB
Current LE 64
COW-table size 4.00 MiB
COW-table LE 1
Allocated to snapshot 0.00%
Snapshot chunk size 4.00 KiB
Segments 1
Allocation inherit
Read ahead sectors auto
– currently set to 256
Block device 253:7
As can be seen from the lvdisplay output, /mnt/file-snap now is the snapshot of /mnt/file.
Creating a clone is similar to creating a snapshot except that “clone” attribute name should be used instead of “snapshot”.
setfattr -n clone -v <gfid-of-clone-file> <path-to-source-file>
Clone in BD volume is essentially a server off-loaded full copy of the file.
As you have seen, creation of block device backed file on BD volume, creation of snapshots and clones involve non-standard steps including setting of extended attributes. These steps could be cumbersome for an end user and there are plans to encapsulate all these into nice APIs that users could use easily.
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...