<p dir="ltr">Dang. I always think I get all the detail and inevitably leave out something important. :-/</p>
<p dir="ltr">I'm mobile and don't have the exact version in front of me, but this is recent if not latest RHGS on RHEL 7.2.<br>
</p>
<div class="gmail_extra"><br><div class="gmail_quote">On Oct 18, 2016 7:04 PM, "Dan Lambright" <<a href="mailto:dlambrig@redhat.com">dlambrig@redhat.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dustin,<br>
<br>
What level code ? I often run smallfile on upstream code with tiered volumes and have not seen this.<br>
<br>
Sure, one of us will get back to you.<br>
<br>
Unfortunately, gluster has a lot of protocol overhead (LOOKUPs), and they overwhelm the boost in transfer speeds you get for small files. A presentation at the Berlin gluster summit evaluated this. The expectation is md-cache will go a long way towards helping that, before too long.<br>
<br>
Dan<br>
<br>
<br>
<br>
----- Original Message -----<br>
> From: "Dustin Black" <<a href="mailto:dblack@redhat.com">dblack@redhat.com</a>><br>
> To: <a href="mailto:gluster-devel@gluster.org">gluster-devel@gluster.org</a><br>
> Cc: "Annette Clewett" <<a href="mailto:aclewett@redhat.com">aclewett@redhat.com</a>><br>
> Sent: Tuesday, October 18, 2016 4:30:04 PM<br>
> Subject: [Gluster-devel] Possible race condition bug with tiered volume<br>
><br>
> I have a 3x2 hot tier on NVMe drives with a 3x2 cold tier on RAID6 drives.<br>
><br>
> # gluster vol info 1nvme-distrep3x2<br>
> Volume Name: 1nvme-distrep3x2<br>
> Type: Tier<br>
> Volume ID: 21e3fc14-c35c-40c5-8e46-<wbr>c258c1302607<br>
> Status: Started<br>
> Number of Bricks: 12<br>
> Transport-type: tcp<br>
> Hot Tier :<br>
> Hot Tier Type : Distributed-Replicate<br>
> Number of Bricks: 3 x 2 = 6<br>
> Brick1: n5:/rhgs/hotbricks/1nvme-<wbr>distrep3x2-hot<br>
> Brick2: n4:/rhgs/hotbricks/1nvme-<wbr>distrep3x2-hot<br>
> Brick3: n3:/rhgs/hotbricks/1nvme-<wbr>distrep3x2-hot<br>
> Brick4: n2:/rhgs/hotbricks/1nvme-<wbr>distrep3x2-hot<br>
> Brick5: n1:/rhgs/hotbricks/1nvme-<wbr>distrep3x2-hot<br>
> Brick6: n0:/rhgs/hotbricks/1nvme-<wbr>distrep3x2-hot<br>
> Cold Tier:<br>
> Cold Tier Type : Distributed-Replicate<br>
> Number of Bricks: 3 x 2 = 6<br>
> Brick7: n0:/rhgs/coldbricks/1nvme-<wbr>distrep3x2<br>
> Brick8: n1:/rhgs/coldbricks/1nvme-<wbr>distrep3x2<br>
> Brick9: n2:/rhgs/coldbricks/1nvme-<wbr>distrep3x2<br>
> Brick10: n3:/rhgs/coldbricks/1nvme-<wbr>distrep3x2<br>
> Brick11: n4:/rhgs/coldbricks/1nvme-<wbr>distrep3x2<br>
> Brick12: n5:/rhgs/coldbricks/1nvme-<wbr>distrep3x2<br>
> Options Reconfigured:<br>
> cluster.tier-mode: cache<br>
> features.ctr-enabled: on<br>
> performance.readdir-ahead: on<br>
><br>
><br>
> I am attempting to run the 'smallfile' benchmark tool on this volume. The<br>
> 'smallfile' tool creates a starting gate directory and files in a shared<br>
> filesystem location. The first run (write) works as expected.<br>
><br>
> # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top<br>
> /rhgs/client/1nvme-distrep3x2 --host-set<br>
> c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,<wbr>c10,c11 --prefix test1 --stonewall Y<br>
> --network-sync-dir /rhgs/client/1nvme-distrep3x2/<wbr>smf1 --operation create<br>
><br>
> For the second run (read), I believe that smallfile attempts first to 'rm<br>
> -rf' the "network-sync-dir" path, which fails with ENOTEMPTY, causing the<br>
> run to fail<br>
><br>
> # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top<br>
> /rhgs/client/1nvme-distrep3x2 --host-set<br>
> c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,<wbr>c10,c11 --prefix test1 --stonewall Y<br>
> --network-sync-dir /rhgs/client/1nvme-distrep3x2/<wbr>smf1 --operation create<br>
> ...<br>
> Traceback (most recent call last):<br>
> File "/root/bin/smallfile_cli.py", line 280, in <module><br>
> run_workload()<br>
> File "/root/bin/smallfile_cli.py", line 270, in run_workload<br>
> return run_multi_host_workload(<wbr>params)<br>
> File "/root/bin/smallfile_cli.py", line 62, in run_multi_host_workload<br>
> sync_files.create_top_dirs(<wbr>master_invoke, True)<br>
> File "/root/bin/sync_files.py", line 27, in create_top_dirs<br>
> shutil.rmtree(master_invoke.<wbr>network_dir)<br>
> File "/usr/lib64/python2.7/shutil.<wbr>py", line 256, in rmtree<br>
> onerror(os.rmdir, path, sys.exc_info())<br>
> File "/usr/lib64/python2.7/shutil.<wbr>py", line 254, in rmtree<br>
> os.rmdir(path)<br>
> OSError: [Errno 39] Directory not empty: '/rhgs/client/1nvme-<wbr>distrep3x2/smf1'<br>
><br>
><br>
> From the client perspective, the directory is clearly empty.<br>
><br>
> # ls -a /rhgs/client/1nvme-distrep3x2/<wbr>smf1/<br>
> . ..<br>
><br>
><br>
> And a quick search on the bricks shows that the hot tier on the last replica<br>
> pair is the offender.<br>
><br>
> # for i in {0..5}; do ssh n$i "hostname; ls<br>
> /rhgs/coldbricks/1nvme-<wbr>distrep3x2/smf1 | wc -l; ls<br>
> /rhgs/hotbricks/1nvme-<wbr>distrep3x2-hot/smf1 | wc -l"; donerhosd0<br>
> 0<br>
> 0<br>
> rhosd1<br>
> 0<br>
> 0<br>
> rhosd2<br>
> 0<br>
> 0<br>
> rhosd3<br>
> 0<br>
> 0<br>
> rhosd4<br>
> 0<br>
> 1<br>
> rhosd5<br>
> 0<br>
> 1<br>
><br>
><br>
> (For the record, multiple runs of this reproducer show that it is<br>
> consistently the hot tier that is to blame, but it is not always the same<br>
> replica pair.)<br>
><br>
><br>
> Can someone try recreating this scenario to see if the problem is consistent?<br>
> Please reach out if you need me to provide any further details.<br>
><br>
><br>
> Dustin Black, RHCA<br>
> Senior Architect, Software-Defined Storage<br>
> Red Hat, Inc.<br>
> (o) +1.212.510.4138 (m) +1.215.821.7423<br>
> <a href="mailto:dustin@redhat.com">dustin@redhat.com</a><br>
><br>
> ______________________________<wbr>_________________<br>
> Gluster-devel mailing list<br>
> <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
> <a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br>
</blockquote></div></div>