Introducing shard translator

Gluster

2015-12-23

GlusterFS-3.7.0 saw the release of sharding feature, among several others. The feature was tagged as “experimental” as it was still in the initial stages of development back then. Here is some introduction to the feature:

Why shard translator?

GlusterFS’ answer to very large files (those which can grow beyond a single brick) had never been clear. There is a stripe translator which allows you to do that, but that comes at a cost of flexibility – you can add servers only in multiple of stripe-count x replica-count, mixing striped and unstriped files is not possible in an “elegant” way. This also happens to be a big limiting factor for the big data/Hadoop use case where super large files are the norm (and where you want to split a file even if it could fit within a single server.) The proposed solution for this is to replace the current stripe translator with a new “shard” translator.

What?

Unlike stripe, Shard is not a cluster translator. It is placed on top of DHT. Initially all files will be created as normal files, even up to a certain configurable size. The first block (default 4MB) will be stored like a normal file under its parent directory. However further blocks will be stored in a file, named by the GFID and block index in a separate namespace (like /.shard/GFID1.1, /.shard/GFID1.2 … /.shard/GFID1.N). File IO happening to a particular offset will write to the appropriate “piece file”, creating it if necessary. The aggregated file size and block count will be stored in the xattr of the original file.

Usage:

Here I have a 2×2 distributed-replicated volume.

# gluster volume info
Volume Name: dis-rep
Type: Distributed-Replicate
Volume ID: 96001645-a020-467b-8153-2589e3a0dee3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/1
Brick2: server2:/bricks/2
Brick3: server3:/bricks/3
Brick4: server4:/bricks/4
Options Reconfigured:
performance.readdir-ahead: on

To enable sharding on it, this is what I do:

# gluster volume set dis-rep features.shard on
volume set: success

Now, to configure the shard block size to 16MB, this is what I do:

# gluster volume set dis-rep features.shard-block-size 16MB
volume set: success

How files are sharded:

Now I write 84MB of data into a file named ‘testfile’.

# dd if=/dev/urandom of=/mnt/glusterfs/testfile bs=1M count=84
84+0 records in
84+0 records out
88080384 bytes (88 MB) copied, 13.2243 s, 6.7 MB/s

Let’s check the backend to see how the file was sharded to pieces and how these pieces got distributed across the bricks:

# ls /bricks/* -lh
/bricks/1:
total 0

/bricks/2:
total 0

/bricks/3:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

/bricks/4:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

So the file hashed to the second replica set (brick3 and brick4 which form a replica pair) and 16M in size. Where did the remaining 68MB worth of data go? To find out, let’s check the contents of the hidden directory .shard on all bricks:

# ls /bricks/*/.shard -lh
/bricks/1/.shard:
total 37M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/2/.shard:
total 37M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/3/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

/bricks/4/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

So, the file was basically split into 6 pieces: 5 of them residing in the hidden directory “/.shard” distributed across replica sets based on disk space availability and the file name hash, and the first block residing in its native parent directory. Notice how blocks 1 through 4 are all of size 16M and the last block (block-5) is 4M in size.

Now let’s do some math to see how ‘testfile’ was “sharded”:

The total size of the write was 84MB. And the configured block size in this case is 16MB. So (84MB divided by 16MB) = 5 with remainder = 4MB

So the file was basically broken into 6 pieces in all, with the last piece having 4MB of data and the rest of them 16MB in size.

Now when we view the file from the mount point, it would appear as one single file:

# ls -lh /mnt/glusterfs/
total 85M
-rw-r–r–. 1 root root 84M Dec 24 12:36 testfile

Notice how the file is shown to be of size 84MB on the mount point. Similarly, when the file is read by an application, the different pieces or ‘shards’ are stitched together and appropriately presented to the application as if there was no chunking done at all.

Advantages of sharding:

The advantage of sharding a file over striping it across a finite set of bricks are:

Data blocks are distributed by DHT in a “normal way”.
Adding servers can happen in any number (even one at a time) and DHT’s rebalance will spread out the “piece files” evenly.
Sharding provides better utilization of disk space. Now it is no longer necessary to have at least one brick of size X in order to accommodate a file of size X, where X is really large. Consider this example: A distribute volume is made up of 3 bricks of size 10GB, 20GB, 30GB. With this configuration, it is impossible to store a file greater than 30GB in size on this volume. Sharding eliminates this limitation. A file of upto 60GB size can be stored on this volume with sharding.
Self-healing of a large file is now more distributed into smaller files across more servers leading to better heal performance and lesser CPU usage, which is particularly a pain point for large file workloads.
piece file naming scheme is immune to renames and hardlinks.
When geo-replicating a large file to a remote volume, only the shards that changed can be synced to the slave, considerably reducing the sync time.
When sharding is used in conjunction with tiering, only the shards that change would be promoted/demoted. This reduces the amount of data that needs to be migrated between hot and cold tier.
When sharding is used in conjunction with bit-rot detection feature of GlusterFS, the checksum is computed on smaller shards as opposed to one large file.

Introducing shard translator

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...