<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi.<br>
<br>
I've been trying to find out what's going on for several days now,
but can't find anything myself, so I'm asking for some help with
GlusterFS experts ;-)<br>
<br>
I'm running 3 replicated gluster volumes between 2 nodes (each node
hosting 3 bricks: one per volume). Components involved:<br>
<br>
- CentOS 7.0 x86_64 / 3.10.0-123.20.1<br>
- GlusterFS 3.5.3<br>
<br>
(yes, I should upgrade, I know).<br>
<br>
This is used to host qemu-kvm VM. (1 GlusterFS volume for VM images,
1 for libvirt locks, 1 for VM states, eg virsh save vm1 can be
restored on the other node). The VM are hosted on the GlusterFS
server itself (each node fuse-mount the storage volume on
/var/lib/libvirt/images). So they are both GlusterFS server and
client. VM are running only on the first node (but can be live
migrated to the second one in case of problem).<br>
<br>
The 3 volumes (vmstore, save and locks) have the same configuration:<br>
<br>
[root@master1 ~]# gluster vol info vmstore<br>
<br>
Volume Name: vmstore<br>
Type: Replicate<br>
Volume ID: 7ed967f1-3b33-46d7-8908-0bb78c6e9199<br>
Status: Started<br>
Number of Bricks: 1 x 2 = 2<br>
Transport-type: tcp<br>
Bricks:<br>
Brick1: master1:/mnt/bricks/vmstore<br>
Brick2: master2:/mnt/bricks/vmstore<br>
Options Reconfigured:<br>
diagnostics.client-log-level: DEBUG<br>
diagnostics.brick-log-level: INFO<br>
cluster.eager-lock: on<br>
network.frame-timeout: 300<br>
network.ping-timeout: 20<br>
nfs.disable: on<br>
<br>
<br>
This setup worked well for more than a year, but had a big failure 3
months ago: all my VM had a kernel panic because they couldn't
access their storage anymore. Looking at my logs, I saw that gluster
fuse client lost connection with both bricks because they had not
responded for more than 5 sec (which was the network.ping-timeout at
this time). I don't really understand how this could happen as the
network was OK, and anyway, one of the bricks is running on
127.0.0.1 so definitely not a network issue. I've increased
network.ping-timeout to 20 sec, which allowed all my VM to be
started again without connection to bricks being lost.<br>
<br>
Now, things are working, but since this day, I have random IO
hanging from time to time. When the problem occurs, all IO in all
the VM is hanged, the load on the hypervisor (which is also the
GlusterFS client and one of the bricks) goes crazy (I've seen up to
~120). The load goes so high I can't do anything on the hypervisor,
I loose my SSH access which doesn't respond anymore. The problem
last for 5 or 10 minutes, then everything start working again (Some
VM doesn't like being stuck for that long and need to be restarted).<br>
<br>
The problem is very random, can happen every 2 days, as everything
can be working without a single issue for more than 3 weeks. It
doesn't depend on the load, nor on the access pattern.<br>
<br>
I suspect something in Gluster to be the culprit, but I can't find
anything. I've enabled DEBUG logging on the client (but not on the
brick as it just too verbose), and will see if I can get more info
next time the issue happens.<br>
<br>
I first noticed the problem always happened when I executed a
monitoring script (which executed several gluster commands and
parsed it's output to check the different volume status, script
available here [1] if anyone is interested), but I've now completely
disabled monitoring, and I still have this random issue.<br>
<br>
A strange thing I've noticed is that the main volume (the one
storing the VM images) continuously shows files being healed if I
look at:<br>
<br>
gluster vol heal vmstore info healed<br>
<br>
I see every 10 (exactly 10) minutes a few VM images being healed.
But nothing in the client logs, nor the system loads indicate heal
taking place. <br>
<br>
I'm lost and don't know where to look, I'd really appreciate some
help :-)<br>
<br>
(we're ready to hire a GlusterFS expert to help us sorting this out
if necessary, this is a critical installation for us)<br>
<br>
[1]:
<a class="moz-txt-link-freetext" href="https://gitweb.firewall-services.com/?p=zabbix-agent-addons;a=blob_plain;f=zabbix_scripts/check_gluster_sudo;hb=HEAD">https://gitweb.firewall-services.com/?p=zabbix-agent-addons;a=blob_plain;f=zabbix_scripts/check_gluster_sudo;hb=HEAD</a><br>
<div class="moz-signature">-- <br>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<table>
<tbody>
<tr>
<td>
<p> <img
src="cid:part1.00040904.05070302@firewall-services.com"
alt="Logo FWS" width="275"> </p>
</td>
<td> <font face="Verdana, Geneva, sans-serif" size="2"> <strong>Daniel
Berteaud</strong><br>
<br>
FIREWALL-SERVICES SAS.<br>
Société de Services en Logiciels Libres<br>
Tel : 05 56 64 15 32<br>
Visio : <a class="moz-txt-link-freetext" href="http://vroom.im/dani">http://vroom.im/dani</a><br>
<em><a class="moz-txt-link-abbreviated" href="http://www.firewall-services.com">www.firewall-services.com</a></em> </font> </td>
</tr>
</tbody>
</table>
</div>
</body>
</html>