The Gluster Blog

Gluster blog stories provide high-level spotlights on our users all over the world

QEMU-GlusterFS native integration

Gluster
2012-10-28

GlusterFS is a distributed file system implemented in user space. It is strictly not a native file system in itself but is an aggregator of different file systems. GlusterFS can aggregate individual file system mount points or directories (called bricks in gluster terminology) to provide a single unified file system namespace. In addition to NFS and CIFS, the most common
way to access GlusterFS namespace is via FUSE based Gluster native client.

gluster

More information on creating and mounting GlusterFS volume can be obtained from GlusterFS website.

GlusterFS for virtualization

Until recently using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:

– A new library called libgfapi is now available as part of GlusterFS that  provides POSIX-like C APIs for accessing gluster volumes. libgfapi support will be available from GlusterFS-3.4 release.
– QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.

GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.

QEMU with native GlusterFS block driver

QEMU with native GlusterFS block driver

GlusterFS specifcation in QEMU

VM image residing on gluster volume can be specified on QEMU command line using URI format:

gluster[+transport]://[server[:port]]/volname/image[?socket=…]

gluster is the protocol.

transport specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are tcp, unix and rdma. If a transport type isn’t specified, then tcp type is assumed.

server specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.

port is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.

volname is the name of the gluster volume which contains the VM image.

image is the path to the actual VM image that resides on gluster volume.

Examples:

gluster://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
gluster+tcp://server.domain.com:24007/testvol/dir/a.img
gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
gluster+rdma://1.2.3.4:24007/testvol/a.img

(GlusterFS URI description and above examples are taken from QEMU documentation)

Configuring QEMU with GlusterFS backend

While building QEMU from source, in addition to the normal configuration options, ensure that –enable-uuid and –enable-glusterfs options are is specified explicitly with ./configure script. (Update Feb 2013: A fix in QEMU-1.3 time frame makes the use of –enable-uuid unnecessary for GlusterFS support in QEMU)

Update Aug 2013: Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script:

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/

Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system.

Creating a VM image on GlusterFS backend

qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:

qemu-img create gluster://server/volname/path/to/image size

Example:

To create a raw image, qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
To create a qcow2 image, qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G

Booting VM image from GlusterFS backend

A VM image a.img residing on gluster volume testvol can be booted using QEMU like this:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio

In addition to VM images, gluster drives can also be used as data drives:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio

Here a-data.img from datavol gluster volume appears as a 2nd drive for the guest.

Performance numbers

The following numbers from FIO benchmark are to show the performance advantage of using QEMU’s GlusterFS block driver instead of the usual FUSE mount while accessing the VM image.

Test setup

Host Dual core x86_64 system running Fedora 17 kernel (3.5.6-1.fc17.x86_64)
Guest Fedora 17 image, 4 way SMP, 2GB RAM, using virtio and cache=none QEMU options

QEMU options

FUSE mount qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/mnt/F17,if=virtio,cache=none => /mnt is GlusterFS FUSE mount point
GlusterFS block driver in QEMU (FUSE bypass) qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=gluster://bharata/test/F17,if=virtio,cache=none
Base (VM image accessed directly from brick) qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/test/F17,if=virtio,cache=none => /test is brick directory

FIO load files

Sequential read direct IO ; Read 4 files with aio at different depths
[global]
ioengine=libaio
direct=1
rw=read
bs=128k
size=512m
directory=/data1
[file1]
iodepth=4
[file2]
iodepth=32
[file3]
iodepth=8
[file4]
iodepth=16
Sequential write direct IO ; Write 4 files with aio at different depths
[global]
ioengine=libaio
direct=1
rw=write
bs=128k
size=512m
directory=/data1
[file1]
iodepth=4
[file2]
iodepth=32
[file3]
iodepth=8
[file4]
iodepth=16


FIO READ numbers

aggrb (KB/s) minb (KB/s) maxb (KB/s)
FUSE mount 15219 3804 5792
QEMU’s GlusterFS block driver (FUSE bypass) 39357 9839 12946
Base 43802 10950 12918

FIO WRITE numbers

aggrb (KB/s) minb (KB/s) maxb (KB/s)
FUSE mount 24579 6144 8423
QEMU’s GlusterFS block driver (FUSE bypass) 42707 10676 17262
Base 42393 10598 15646

Updated numbers

Here are the recent FIO numbers averaged from 5 runs using latest QEMU (git commit: 03a36f17d77) and GlusterFS (git commit: cee1b62d01). The test environment remains same as above with the following two changes:

  • The GlusterFS volume has write-behind translator turned off
  • The host kernel is upgraded to 3.6.7-4.fc17.x86_64

FIO READ numbers

aggrb (KB/s) % Reduction from Base
Base 44464 0
FUSE mount 21637 -51
QEMU’s GlusterFS block driver (FUSE bypass) 38847 -12.6

FIO WRITE numbers

aggrb (KB/s) % Reduction from Base
Base 45824 0
FUSE mount 40919 -10.7
QEMU’s GlusterFS block driver (FUSE bypass) 45627 -0.43

GlusterFS support in oVirt

While I described how to use GlusterFS as a storage backend for QEMU manually, there have been efforts to enable QEMU-GlusterFS native support from libvirt, VDSM and oVirt as well. We now have GlusterFS enabled completely from oVirt which allows user to use self-help portal of oVirt to create GlusterFS volume and use it as storage backend to host VM images. The GlusterFS storage domain work in VDSM and the enablement of the same from oVirt allows oVirt to exploit the QEMU-GlusterFS native integration rather than using FUSE for accessing GlusterFS volume.

Deepak C Shetty has created a nice video demo of how to use oVirt to create a GlusterFS storage domain and boot VMs off it.

UNMAP/Discard support in QEMU-GlusterFS

UNMAP support in QEMU-GlusterFS is explained here.

BLOG

  • 06 Dec 2020
    Looking back at 2020 – with g...

    2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...

    Read more
  • 27 Apr 2020
    Update from the team

    It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...

    Read more
  • 03 Feb 2020
    Building a longer term focus for Gl...

    The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...

    Read more