<div dir="ltr">Aren't we are talking about this patch?<div><a href="https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/gluster-backupserver.patch;h=ad241ee1154ebbd536d7c2c7987d86a02255aba2;hb=HEAD">https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/gluster-backupserver.patch;h=ad241ee1154ebbd536d7c2c7987d86a02255aba2;hb=HEAD</a><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">2015-10-26 22:56 GMT+02:00 Niels de Vos <span dir="ltr"><<a href="mailto:ndevos@redhat.com" target="_blank">ndevos@redhat.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Thu, Oct 22, 2015 at 08:45:04PM +0200, André Bauer wrote:<br>
> Hi,<br>
><br>
> i have a 4 node Glusterfs 3.5.6 Cluster.<br>
><br>
> My VM images are in an replicated distributed volume which is accessed<br>
> from kvm/qemu via libgfapi.<br>
><br>
> Mount is against storage.domain.local which has IPs for all 4 Gluster<br>
> nodes set in DNS.<br>
><br>
> When one of the Gluster nodes goes down (accidently reboot) a lot of the<br>
> vms getting read only filesystem. Even when the node comes back up.<br>
><br>
> How can i prevent this?<br>
> I expect that the vm just uses the replicated file on the other node,<br>
> without getting ro fs.<br>
><br>
> Any hints?<br>
<br>
</span>There are at least two timeouts that are involved in this problem:<br>
<br>
1. The filesystem in a VM can go read-only when the virtual disk where<br>
the filesystem is located does not respond for a while.<br>
<br>
2. When a storage server that holds a replica of the virtual disk<br>
becomes unreachable, the Gluster client (qemu+libgfapi) waits for<br>
max. network.ping-timeout seconds before it resumes I/O.<br>
<br>
Once a filesystem in a VM goes read-only, you might be able to fsck and<br>
re-mount it read-writable again. It is not something a VM will do by<br>
itself.<br>
<br>
<br>
The timeouts for (1) are set in sysfs:<br>
<br>
$ cat /sys/block/sda/device/timeout<br>
30<br>
<br>
30 seconds is the default for SD-devices, and for testing you can change<br>
it with an echo:<br>
<br>
# echo 300 > /sys/block/sda/device/timeout<br>
<br>
This is not a peristent change, you can create a udev-rule to apply this<br>
change at bootup.<br>
<br>
Some of the filesystem offer a mount option that can change the<br>
behaviour after a disk error is detected. "man mount" shows the "errors"<br>
option for ext*. Changing this to "continue" is not recommended, "abort"<br>
or "panic" will be the most safe for your data.<br>
<br>
<br>
The timeout mentioned in (2) is for the Gluster Volume, and checked by<br>
the client. When a client does a write to a replicated volume, the write<br>
needs to be acknowledged by both/all replicas. The client (libgfapi)<br>
delays the reply to the application (qemu) until both/all replies from<br>
the replicas has been received. This delay is configured as the volume<br>
option network.ping-timeout (42 seconds by default).<br>
<br>
<br>
Now, if the VM returns block errors after 30 seconds, and the client<br>
waits up to 42 seconds for recovery, there is an issue... So, your<br>
solution could be to increase the timeout for error detection of the<br>
disks inside the VMs, and/or decrease the network.ping-timeout.<br>
<br>
It would be interesting to know if adapting these values prevents the<br>
read-only occurrences in your environment. If you do any testing with<br>
this, please keep me informed about the results.<br>
<span class="HOEnZb"><font color="#888888"><br>
Niels<br>
</font></span><br>_______________________________________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-devel</a><br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Best regards,<br>Roman.</div>
</div>