<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hi.<br>

    <br>

    I've been trying to find out what's going on for several days now,

    but can't find anything myself, so I'm asking for some help with

    GlusterFS experts ;-)<br>

    <br>

    I'm running 3 replicated gluster volumes between 2 nodes (each node

    hosting 3 bricks: one per volume). Components involved:<br>

    <br>

    - CentOS 7.0 x86_64 / 3.10.0-123.20.1<br>

    - GlusterFS 3.5.3<br>

    <br>

    (yes, I should upgrade, I know).<br>

    <br>

    This is used to host qemu-kvm VM. (1 GlusterFS volume for VM images,

    1 for libvirt locks, 1 for VM states, eg virsh save vm1 can be

    restored on the other node). The VM are hosted on the GlusterFS

    server itself (each node fuse-mount the storage volume on

    /var/lib/libvirt/images). So they are both GlusterFS server and

    client. VM are running only on the first node (but can be live

    migrated to the second one in case of problem).<br>

    <br>

    The 3 volumes (vmstore, save and locks) have the same configuration:<br>

    <br>

    [root@master1 ~]# gluster vol info vmstore<br>

     <br>

    Volume Name: vmstore<br>

    Type: Replicate<br>

    Volume ID: 7ed967f1-3b33-46d7-8908-0bb78c6e9199<br>

    Status: Started<br>

    Number of Bricks: 1 x 2 = 2<br>

    Transport-type: tcp<br>

    Bricks:<br>

    Brick1: master1:/mnt/bricks/vmstore<br>

    Brick2: master2:/mnt/bricks/vmstore<br>

    Options Reconfigured:<br>

    diagnostics.client-log-level: DEBUG<br>

    diagnostics.brick-log-level: INFO<br>

    cluster.eager-lock: on<br>

    network.frame-timeout: 300<br>

    network.ping-timeout: 20<br>

    nfs.disable: on<br>

    <br>

    <br>

    This setup worked well for more than a year, but had a big failure 3

    months ago: all my VM had a kernel panic because they couldn't

    access their storage anymore. Looking at my logs, I saw that gluster

    fuse client lost connection with both bricks because they had not

    responded for more than 5 sec (which was the network.ping-timeout at

    this time). I don't really understand how this could happen as the

    network was OK, and anyway, one of the bricks is running on

    127.0.0.1 so definitely not a network issue. I've increased

    network.ping-timeout to 20 sec, which allowed all my VM to be

    started again without connection to bricks being lost.<br>

    <br>

    Now, things are working, but since this day, I have random IO

    hanging from time to time. When the problem occurs, all IO in all

    the VM is hanged, the load on the hypervisor (which is also the

    GlusterFS client and one of the bricks) goes crazy (I've seen up to

    ~120). The load goes so high I can't do anything on the hypervisor,

    I loose my SSH access which doesn't respond anymore. The problem

    last for 5 or 10 minutes, then everything start working again (Some

    VM doesn't like being stuck for that long and need to be restarted).<br>

    <br>

    The problem is very random, can happen every 2 days, as everything

    can be working without a single issue for more than 3 weeks. It

    doesn't depend on the load, nor on the access pattern.<br>

    <br>

    I suspect something in Gluster to be the culprit, but I can't find

    anything. I've enabled DEBUG logging on the client (but not on the

    brick as it just too verbose), and will see if I can get more info

    next time the issue happens.<br>

    <br>

    I first noticed the problem always happened when I executed a

    monitoring script (which executed several gluster commands and

    parsed it's output to check the different volume status, script

    available here [1] if anyone is interested), but I've now completely

    disabled monitoring, and I still have this random issue.<br>

    <br>

    A strange thing I've noticed is that the main volume (the one

    storing the VM images) continuously shows files being healed if I

    look at:<br>

    <br>

    gluster vol heal vmstore info healed<br>

    <br>

    I see every 10 (exactly 10) minutes a few VM images being healed.

    But nothing in the client logs, nor the system loads indicate heal

    taking place. <br>

    <br>

    I'm lost and don't know where to look, I'd really appreciate some

    help :-)<br>

    <br>

    (we're ready to hire a GlusterFS expert to help us sorting this out

    if necessary, this is a critical installation for us)<br>

    <br>

    [1]:

<a class="moz-txt-link-freetext" href="https://gitweb.firewall-services.com/?p=zabbix-agent-addons;a=blob_plain;f=zabbix_scripts/check_gluster_sudo;hb=HEAD">https://gitweb.firewall-services.com/?p=zabbix-agent-addons;a=blob_plain;f=zabbix_scripts/check_gluster_sudo;hb=HEAD</a><br>

    <div class="moz-signature">-- <br>

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <table>

        <tbody>

          <tr>

            <td>

              <p> <img

                  src="cid:part1.00040904.05070302@firewall-services.com"

                  alt="Logo FWS" width="275"> </p>

            </td>

            <td> <font face="Verdana, Geneva, sans-serif" size="2"> <strong>Daniel

                  Berteaud</strong><br>

                <br>

                FIREWALL-SERVICES SAS.<br>

                Société de Services en Logiciels Libres<br>

                Tel : 05 56 64 15 32<br>

                Visio : <a class="moz-txt-link-freetext" href="http://vroom.im/dani">http://vroom.im/dani</a><br>

                <em><a class="moz-txt-link-abbreviated" href="http://www.firewall-services.com">www.firewall-services.com</a></em> </font> </td>

          </tr>

        </tbody>

      </table>

    </div>

  </body>

</html>