UNMAP/DISCARD support in QEMU-GlusterFS

Gluster

2013-08-07

In my last blog post on QEMU-GlusterFS, I described the integration of QEMU with GlusterFS using libgfapi. In this post, I give an overview of the recently added discard support to QEMU’s GlusterFS back-end and how it can be used. Newer SCSI devices support UNMAP command that is used to return the unused/freed blocks back to the storage. This command is typically useful when the storage is thin provisioned like a thin provisioned SCSI LUN. In response to a file deletion, the host device driver could send down an UNMAP command to the SCSI target and instruct it to free the relevant blocks from the thin provisioned LUN. This leads to much better utilization of the storage.

Linux support for discard

In Linux, SCSI UNMAP is supported via the generic discard framework which I believe is also used to support ATA TRIM command. ATA TRIM command typically used in SSD isn’t the topic of discussion of this blog. There are multiple ways in which discard functionality is invoked or used in Linux.

For direct block devices, one could use BLKDISCARD ioctl to release the unused blocks.
File systems like EXT4 support a file level discard using FALLOC_FL_PUNCH_HOLE option of fallocate system call.
For releasing the unused blocks at the file systems level, fstrim command can be used.
Finally, file systems like EXT4 also support ‘discard’ mount option that will control if file system (EXT4) should issue UNMAP requests to the underlying block device when there are free blocks.

QEMU support for discard

UNMAP is primarily useful in two ways for KVM virtualization.

When a file is deleted in the VM, the resulting UNMAP in the guest is passed down to host which will result in host sending the discard request to the thin provisioned SCSI device. This results in the blocks consumed by the deleted file to be returned back to the SCSI storage. The effect is same when there is an explicit discard request from the guest using either ioctl or fallocate methods listed in the previous section.
When a VM image is deleted, there is a potential to return the freed blocks back to the storage by sending the UNMAP command to the SCSI storage.

Guest UNMAP requests will end up in QEMU only if the guest is using scsi or virtio-scsi device and not virtio-blk device. QEMU will forward this request further down to the host (device or file system) only if ‘discard=on’ or ‘discard=unmap’ drive flag is used for the device on the QEMU command line.

Example1: qemu-system-x86_64 -drive file=/images/vm.img,if=scsi,discard=on
Example2: qemu-system-x86_64 -device virtio-scsi-pci -drive if=none,discard=on,id=rootdisk,file=gluster://host/volume/image -device scsi-hd,drive=rootdisk

The way discard request is further passed down in the host is determined by the block driver inside QEMU which is serving the disk image type. While QEMU uses fallocate(FALLOC_FL_PUNCH_HOLE) for raw file backends and ioctl(BLKDISCARD) for block device back-end, other backends use their own interfaces to pass down the discard request.

GlusterFS support for discard

GlusterFS, starting from version 3.4, supports discard functionality through a new API called glfs_discard(glfs_fd, offset, size) that is available as part of libgfapi library. On the GlusterFS server side, discard request is handled differently for posix back-end and Block Device(BD) back-end.

For the posix back-end, fallocate(FALLOC_FL_PUNCH_HOLE) is used to eventually release the blocks to the filesystem. If the posix brick has been mounted with ‘-o discard’ option, then the discard request will eventually reach the SCSI storage if the storage device supports UNMAP.

Support for BD back-end is planned to come up as soon as the ongoing development work on the newer and feature rich BD translator becomes upstream. BD translator should be using ioctl(BLKDISCARD) to UNMAP the blocks.

Discard support in GlusterFS back-end of QEMU

As described in my earlier blog, QEMU starting from version 1.3 supports GlusterFS back-end using libgfapi which is the non-FUSE way of accessing GlusterFS volumes. Recently I added discard support to GlusterFS block driver in QEMU and this support should be available in QEMU-1.6 onwards. This work involved using the glfs_discard() API from QEMU GlusterFS driver to send the discard request to GlusterFS server.

With this, the entire KVM virtualization stack using GlusterFS back-end is enabled to take advantage of SCSI UNMAP command.

Typical usage

This section describes a typical use case with QEMU-GlusterFS where a file deleted from inside the VM results in discard requests for the host storage.

Step 1: Prepare a block device that supports UNMAP

Since I don’t have a real SCSI device that supports UNMAP, I am going to use a loop device to host my GlusterFS volume. Loop device supports discard.

[root@llmvm02 bharata]# dd if=/dev/zero of=discard-loop count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.63255 s, 1.7 GB/s

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824 Blocks: 2097152 IO Block: 4096 regular file
Device: fc13h/64531d Inode: 15740290 Links: 1

[root@llmvm02 bharata]# losetup /dev/loop0 discard-loop

[root@llmvm02 bharata]# cat /sys/block/loop0/queue/discard_max_bytes
4294966784

Step 2: Prepare the brick directory for GlusterFS volume

[root@llmvm02 bharata]# mkfs.ext4 /dev/loop0
[root@llmvm02 bharata]# mount -o discard /dev/loop0 /discard-mnt/
[root@llmvm02 bharata]# mount | grep discard
/dev/loop0 on /discard-mnt type ext4 (rw,discard)

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824 Blocks: 66368 IO Block: 4096 regular file

Step 3: Create a GlusterFS volume

[root@llmvm02 bharata]# gluster volume create discard llmvm02:/discard-mnt/ force
volume create: discard: success: please start the volume to access data

[root@llmvm02 bharata]# gluster volume start discard
volume start: discard: success

[root@llmvm02 bharata]# gluster volume info discard
Volume Name: discard
Type: Distribute
Volume ID: ed7a6f8a-9cb8-463d-a948-61974cb64c99
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: llmvm02:/discard-mnt

Step 4: Create a sparse file in the GlusterFS volume

[root@llmvm02 bharata]# glusterfs -s llmvm02 –volfile-id=discard /mnt
[root@llmvm02 bharata]# touch /mnt/file
[root@llmvm02 bharata]# truncate -s 450M /mnt/file

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200 Blocks: 0 IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824 Blocks: 66368 IO Block: 4096 regular file

Step 5: Use this sparse file on GlusterFS volume as a disk drive with QEMU

[root@llmvm02 bharata]# qemu-system-x86_64 –enable-kvm -nographic -m 8192 -smp 2 -device virtio-scsi-pci -drive if=none,cache=none,id=F17,file=gluster://llmvm02/test/F17 -device scsi-hd,drive=F17 -drive if=none,cache=none,id=gluster,discard=on,file=gluster://llmvm02/discard/file -device scsi-hd,drive=gluster -kernel /home/bharata/linux-2.6-vm/kernel1 -initrd /home/bharata/linux-2.6-vm/initrd1 -append “root=UUID=d29b972f-3568-4db6-bf96-d2702ec83ab6 ro rd.md=0 rd.lvm=0 rd.dm=0 SYSFONT=True KEYTABLE=us rd.luks=0 LANG=en_US.UTF-8 console=tty0 console=ttyS0 selinux=0″

The file appears as a SCSI disk in the guest.

[root@F17-kvm ~]# dmesg | grep -i sdb
sd 0:0:1:0: [sdb] 921600 512-byte logical blocks: (471 MB/450 MiB)
sd 0:0:1:0: [sdb] Write Protect is off
sd 0:0:1:0: [sdb] Mode Sense: 63 00 00 08
sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA

Step 6: Generate discard requests from the guest

Format a FS on the SCSI disk, mount it and create a file on it

[root@F17-kvm ~]# mkfs.ext4 /dev/sdb
[root@F17-kvm ~]# mount -o discard /dev/sdb /guest-mnt/
[root@F17-kvm ~]# dd if=/dev/zero of=/guest-mnt/file bs=1M count=400
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 1.07204 s, 391 MB/s

Now we can see the blocks count growing in the host like this:

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200 Blocks: 846880 IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824 Blocks: 917288 IO Block: 4096 regular file

Remove the file and this should generate discard requests which should get passed down to host eventually resulting in the discarded blocks getting released from the underlying loop device in the host.

[root@F17-kvm ~]# rm -f /guest-mnt/file

In the host,

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200 Blocks: 46600 IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824 Blocks: 112960 IO Block: 4096 regular file

Thus we saw the discard requests from the guest eventually resulting in the blocks getting released in the host block device. I was using a loop device in the host, but this should work for any SCSI device that supports UNMAP. It is not necessary to have SCSI device with UNMAP support to get the benefit of space saving. In fact if the underlying device is a thin logical volume(dm-thin) coming from a thin pool dm device, the space saving can be realized at the thin pool level itself.

Concerns with discard mount option

There have been concerns about the cost of UNMAP operation in the storage and its detrimental effect on the IO throughput. So it is unclear if everyone will want to turn on the discard mount option by default. I wish I had access to an UNMAP-capable storage array to really test the effect of UNMAP on IO performance.