<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Shard Production Report – Week Two :)<br>

    <br>

    Hi Y’all, meant to do a Week One report but I had various dramas

    with failing hard disks – bad  timing really. But good for testing.<br>

    <br>

    First week things went pretty well – I had 12 Vms on 3 nodes running

    off a rep 3 Sharded (64MB) Gluster Volume. It coped well – good

    performance, rebooted nodes and/or killed gluster processes and i/o

    continued without a hitch, no users noticed and heal time was more

    than satisfactory.<br>

    <br>

    There was a drama mid week when on gluster brick froze. Bad feeling

    until I realised the ceph osd’s on the same node had also frozen. It

    was a hard disk failure that locked up the underlying zfs pool

    (eventually two disks in the same mirror dammit). System kept going

    until the next day when I could replace the disks and reboot the

    node.<br>

    <br>

    On the weekend I downed the volume and ran a md5sum across the

    shards on all three nodes. Unfortunately I found 8 mismatching

    shards which was concerning. However:<br>

    <br>

    - All the shards were from the same VM image file, that was running

    on the node with the disk failures.<br>

    <br>

    - Of the 3 copies of each shard, two always matched, meaning I could

    easily select one for healing<br>

    <br>

    - 7 of the mismatching shards were from the bad disk. I put this

    down to failed zfs recovery after the fact<br>

    <br>

    - Repair was easy – just deleted the mismatched shard copy and

    issued a full heal.<br>

    <br>

    <br>

    Second week – I gradually migrated the remaining VM’s to gluster and

    retired the ceph pools. So far no one has noticed :) performance has

    greatly improved and iowaits have greatly reduced, Overall the VM’s

    seem much less vulnerable to peaks in i/o, with a smoother

    experience overall. A rolling upgrade and reboot of the servers went

    very smoothly, had to wait about 15 min between boots for heals to

    finish.<br>

    <br>

    Friday night I downed the volume again and re-checksummed all the

    shards (2TB Data per brick) – everything matched down to the last

    byte. <br>

    <br>

    It was instructive bringing it back up – just for laughs started all

    the Windows VM’s (30) simultaneously and it actually coped. iowait

    went through the roof (50% on one nodes) but they all started

    without a hitch and were accessible in a few minutes. After about an

    hour the cluster had settled down to under 5%<br>

    <br>

    I know in the overall scheme of things our setup is pretty small –

    30+ VM’s amoungst 3 nodes, but its pretty important for our small

    business. Very pleased with the outcome so far and will be

    continuing. All the issues which bothered us when we first looked at

    gluster (3.6) have been addressed.<br>

    <br>

    Cheers all and a big thanks to the devs, testers and documentators.<br>

    <br>

    <br>

    <pre class="moz-signature" cols="72">-- 

Lindsay Mathieson</pre>

  </body>

</html>