<div dir="ltr"><div>Could you attach the glusterfs client, shd logs?<br><br></div>-Krutika<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, May 2, 2016 at 2:35 PM, Kevin Lemonnier <span dir="ltr"><<a href="mailto:lemonnierk@ulrar.net" target="_blank">lemonnierk@ulrar.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
So after some testing, it is a lot better but I do still have some problems with 3.7.11.<br>
When I reboot a server it seems to have some strange behaviour sometimes, but I need to test<br>
that better.<br>
Removing a server from the network, waiting for a while then adding it back and letting it heal<br>
works perfectly, completly invisible for the user and that's perfect !<br>
<br>
However when I add a brick, changing the replica count from 2 to 3, it starts a heal<br>
and some VMs switch to read only. I have to power them off then on again to fix it,<br>
clearly it's better than with 3.7.6 which froze the VM until the heal was complete,<br>
but I would still like to understand why some of the VMs are switching to readonly.<br>
Looks like it happens everytime I add a brick to increase the replica, I would like<br>
to test adding a whole replica set at once but I just don't have the hardware for that.<br>
<br>
Rebooting a node looks like it's making some VMs go read only too, but I need to test<br>
that better. For some reason it looks like rebooting a brick or adding a brick is causing<br>
I/O errors on some VM disks and not others, and I have to power them off and then on to fix it.<br>
I can't just reboot them, I guess I have to actually re-open the file to trigger a heal ?<br>
<br>
Any idea on how to prevent that ? It's a lot better than 3.7.6 'cause it can be fixed in a minute,<br>
but that's still not great to explain to the clients.<br>
<br>
Thanks<br>
<div class="HOEnZb"><div class="h5"><br>
<br>
On Mon, Apr 25, 2016 at 02:01:09PM +0200, Kevin Lemonnier wrote:<br>
> Hi,<br>
><br>
> So I'm trying that now.<br>
> I installed 3.7.11 on two nodes and put a few VMs on it, same config<br>
> as before but with 64MB shards and the heal algo to full. As expected,<br>
> if I poweroff one of the nodes, everything is dead, which is fine.<br>
><br>
> Now I'm adding a third node, a big heal was started after the add-brick<br>
> of everything (7000+ shards), and for now everything seems to be working<br>
> fine on the VMs. Last time I tried adding a brick, all those VM died for<br>
> the duration of the heal, so that's already pretty good.<br>
><br>
> I'm gonna let it finish to copy everything on the new nodes, then I'll try<br>
> to simulate nodes going down to see if my original problem of freezing and<br>
> low heal time is solved with this config.<br>
> For reference, here is the volume info, if someone sees something I should change :<br>
><br>
> Volume Name: gluster<br>
> Type: Replicate<br>
> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a<br>
> Status: Started<br>
> Number of Bricks: 1 x 3 = 3<br>
> Transport-type: tcp<br>
> Bricks:<br>
> Brick1: ipvr2.client_name:/mnt/storage/gluster<br>
> Brick2: ipvr3.client_name:/mnt/storage/gluster<br>
> Brick3: ipvr50.client_name:/mnt/storage/gluster<br>
> Options Reconfigured:<br>
> cluster.quorum-type: auto<br>
> cluster.server-quorum-type: server<br>
> network.remote-dio: enable<br>
> cluster.eager-lock: enable<br>
> performance.quick-read: off<br>
> performance.read-ahead: off<br>
> performance.io-cache: off<br>
> performance.stat-prefetch: off<br>
> features.shard: on<br>
> features.shard-block-size: 64MB<br>
> cluster.data-self-heal-algorithm: full<br>
> performance.readdir-ahead: on<br>
><br>
><br>
> It starts at 2 and jumps to 50 because the first server is doing something else for now,<br>
> and I use 50 to be the temporary third node. If everything goes well, I'll migrate the production<br>
> on the cluster, re-install the first server and do a replace-brick, which I hope will work just as well<br>
> as the add-brick I'm doing now. Last replace-brick also brought everything down, but I guess that was the<br>
> joy of 3.7.6 :).<br>
><br>
> Thanks !<br>
><br>
><br>
> On Mon, Apr 18, 2016 at 08:17:05PM +0530, Krutika Dhananjay wrote:<br>
> > On Mon, Apr 18, 2016 at 8:02 PM, Kevin Lemonnier <<a href="mailto:lemonnierk@ulrar.net">lemonnierk@ulrar.net</a>><br>
> > wrote:<br>
> ><br>
> > > I will try migrating to 3.7.10, is it considered stable yet ?<br>
> > ><br>
> ><br>
> > Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)<br>
> ><br>
> ><br>
> > > Should I change the self heal algorithm even if I move to 3.7.10, or is<br>
> > > that not necessary ?<br>
> > > Not sure what that change might do.<br>
> > ><br>
> ><br>
> > So the other algorithm is 'diff' which computes rolling checksum on chunks<br>
> > of the src(es) and sink(s), compares them and heals upon mismatch. This is<br>
> > known to consume lot of CPU. 'full' algo on the other hand simply copies<br>
> > the src into sink in chunks. With sharding, it shouldn't be all that bad<br>
> > copying a 256MB file (in your case) from src to sink. We've used double the<br>
> > block size and had no issues reported.<br>
> ><br>
> > So you could change self heal algo to full even in the upgraded cluster.<br>
> ><br>
> > -Krutika<br>
> ><br>
> ><br>
> > ><br>
> > > Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move<br>
> > > the VMs on it then,<br>
> > > Thanks a lot for your help,<br>
> > ><br>
> > > Regards<br>
> > ><br>
> > ><br>
> > > On Mon, Apr 18, 2016 at 07:58:44PM +0530, Krutika Dhananjay wrote:<br>
> > > > Hi,<br>
> > > ><br>
> > > > Yeah, so the fuse mount log didn't convey much information.<br>
> > > ><br>
> > > > So one of the reasons heal may have taken so long (and also consumed<br>
> > > > resources) is because of a bug in self-heal where it would do heal from<br>
> > > > both source bricks in 3-way replication. With such a bug, heal would take<br>
> > > > twice the amount of time and consume resources both the times by the same<br>
> > > > amount.<br>
> > > ><br>
> > > > This issue is fixed at <a href="http://review.gluster.org/#/c/14008/" rel="noreferrer" target="_blank">http://review.gluster.org/#/c/14008/</a> and will be<br>
> > > > available in 3.7.12.<br>
> > > ><br>
> > > > The other thing you could do is to set cluster.data-self-heal-algorithm<br>
> > > to<br>
> > > > 'full', for better heal performance and more regulated resource<br>
> > > consumption<br>
> > > > by the same.<br>
> > > > #gluster volume set <VOL> cluster.data-self-heal-algorithm full<br>
> > > ><br>
> > > > As far as sharding is concerned, some critical caching issues were fixed<br>
> > > in<br>
> > > > 3.7.7 and 3.7.8.<br>
> > > > And my guess is that the vm crash/unbootable state could be because of<br>
> > > this<br>
> > > > issue, which exists in 3.7.6.<br>
> > > ><br>
> > > > 3.7.10 saw the introduction of throttled client side heals which also<br>
> > > moves<br>
> > > > such heals to the background, which is all the more helpful for<br>
> > > preventing<br>
> > > > starvation of vms during client heal.<br>
> > > ><br>
> > > > Considering these factors, I think it would be better if you upgraded<br>
> > > your<br>
> > > > machines to 3.7.10.<br>
> > > ><br>
> > > > Do let me know if migrating to 3.7.10 solves your issues.<br>
> > > ><br>
> > > > -Krutika<br>
> > > ><br>
> > > > On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <<a href="mailto:lemonnierk@ulrar.net">lemonnierk@ulrar.net</a>><br>
> > > > wrote:<br>
> > > ><br>
> > > > > Yes, but as I was saying I don't believe KVM is using a mount point, I<br>
> > > > > think it uses<br>
> > > > > the API (<br>
> > > > ><br>
> > > <a href="http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt" rel="noreferrer" target="_blank">http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt</a><br>
> > > > > ).<br>
> > > > > Might be mistaken ofcourse. Proxmox does have a mountpoint for<br>
> > > > > conveniance, I'll attach those<br>
> > > > > logs, hoping they contain the informations you need. They do seem to<br>
> > > > > contain a lot of errors<br>
> > > > > for the 15.<br>
> > > > > For reference, there was a disconnect of the first brick (10.10.0.1) in<br>
> > > > > the morning and then a successfull<br>
> > > > > heal that caused about 40 minutes downtime of the VMs. Right after that<br>
> > > > > heal finished (if my memory is<br>
> > > > > correct it was about noon or close) the second brick (10.10.0.2)<br>
> > > rebooted,<br>
> > > > > and that's the one I disconnected<br>
> > > > > to prevent the heal from causing another downtime.<br>
> > > > > I reconnected it one at the end of the afternoon, hoping the heal<br>
> > > would go<br>
> > > > > well but everything went down<br>
> > > > > like in the morning so I disconnected it again, and waited 11pm<br>
> > > (23:00) to<br>
> > > > > reconnect it and let it finish.<br>
> > > > ><br>
> > > > > Thanks for your help,<br>
> > > > ><br>
> > > > ><br>
> > > > > On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay wrote:<br>
> > > > > > Sorry, I was referring to the glusterfs client logs.<br>
> > > > > ><br>
> > > > > > Assuming you are using FUSE mount, your log file will be in<br>
> > > > > > /var/log/glusterfs/<hyphenated-mount-point-path>.log<br>
> > > > > ><br>
> > > > > > -Krutika<br>
> > > > > ><br>
> > > > > > On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <<br>
> > > <a href="mailto:lemonnierk@ulrar.net">lemonnierk@ulrar.net</a>><br>
> > > > > > wrote:<br>
> > > > > ><br>
> > > > > > > I believe Proxmox is just an interface to KVM that uses the lib,<br>
> > > so if<br>
> > > > > I'm<br>
> > > > > > > not mistaken there isn't client logs ?<br>
> > > > > > ><br>
> > > > > > > It's not the first time I have the issue, it happens on every heal<br>
> > > on<br>
> > > > > the<br>
> > > > > > > 2 clusters I have.<br>
> > > > > > ><br>
> > > > > > > I did let the heal finish that night and the VMs are working now,<br>
> > > but<br>
> > > > > it<br>
> > > > > > > is pretty scarry for future crashes or brick replacement.<br>
> > > > > > > Should I maybe lower the shard size ? Won't solve the fact that 2<br>
> > > > > bricks<br>
> > > > > > > on 3 aren't keeping the filesystem usable but might make the<br>
> > > healing<br>
> > > > > > > quicker right ?<br>
> > > > > > ><br>
> > > > > > > Thanks<br>
> > > > > > ><br>
> > > > > > > Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <<br>
> > > > > > > <a href="mailto:kdhananj@redhat.com">kdhananj@redhat.com</a>> a écrit :<br>
> > > > > > > >Could you share the client logs and information about the approx<br>
> > > > > > > >time/day<br>
> > > > > > > >when you saw this issue?<br>
> > > > > > > ><br>
> > > > > > > >-Krutika<br>
> > > > > > > ><br>
> > > > > > > >On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier<br>
> > > > > > > ><<a href="mailto:lemonnierk@ulrar.net">lemonnierk@ulrar.net</a>><br>
> > > > > > > >wrote:<br>
> > > > > > > ><br>
> > > > > > > >> Hi,<br>
> > > > > > > >><br>
> > > > > > > >> We have a small glusterFS 3.7.6 cluster with 3 nodes running<br>
> > > with<br>
> > > > > > > >proxmox<br>
> > > > > > > >> VM's on it. I did set up the different recommended option like<br>
> > > the<br>
> > > > > > > >virt<br>
> > > > > > > >> group, but<br>
> > > > > > > >> by hand since it's on debian. The shards are 256MB, if that<br>
> > > matters.<br>
> > > > > > > >><br>
> > > > > > > >> This morning the second node crashed, and as it came back up<br>
> > > started<br>
> > > > > > > >a<br>
> > > > > > > >> heal, but that basically froze all the VM's running on that<br>
> > > volume.<br>
> > > > > > > >Since<br>
> > > > > > > >> we really really<br>
> > > > > > > >> can't have 40 minutes down time in the middle of the day, I just<br>
> > > > > > > >removed<br>
> > > > > > > >> the node from the network and that stopped the heal, allowing<br>
> > > the<br>
> > > > > > > >VM's to<br>
> > > > > > > >> access<br>
> > > > > > > >> their disks again. The plan was to re-connecte the node in a<br>
> > > couple<br>
> > > > > > > >of<br>
> > > > > > > >> hours to let it heal at night.<br>
> > > > > > > >> But a VM crashed now, and it can't boot up again : seems to<br>
> > > freez<br>
> > > > > > > >trying<br>
> > > > > > > >> to access the disks.<br>
> > > > > > > >><br>
> > > > > > > >> Looking at the heal info for the volume, it has gone way up<br>
> > > since<br>
> > > > > > > >this<br>
> > > > > > > >> morning, it looks like the VM's aren't writing to both nodes,<br>
> > > just<br>
> > > > > > > >the one<br>
> > > > > > > >> they are on.<br>
> > > > > > > >> It seems pretty bad, we have 2 nodes on 3 up, I would expect the<br>
> > > > > > > >volume to<br>
> > > > > > > >> work just fine since it has quorum. What am I missing ?<br>
> > > > > > > >><br>
> > > > > > > >> It is still too early to start the heal, is there a way to<br>
> > > start the<br>
> > > > > > > >VM<br>
> > > > > > > >> anyway right now ? I mean, it was running a moment ago so the<br>
> > > data<br>
> > > > > is<br>
> > > > > > > >> there, it just needs<br>
> > > > > > > >> to let the VM access it.<br>
> > > > > > > >><br>
> > > > > > > >><br>
> > > > > > > >><br>
> > > > > > > >> Volume Name: vm-storage<br>
> > > > > > > >> Type: Replicate<br>
> > > > > > > >> Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef<br>
> > > > > > > >> Status: Started<br>
> > > > > > > >> Number of Bricks: 1 x 3 = 3<br>
> > > > > > > >> Transport-type: tcp<br>
> > > > > > > >> Bricks:<br>
> > > > > > > >> Brick1: first_node:/mnt/vg1-storage<br>
> > > > > > > >> Brick2: second_node:/mnt/vg1-storage<br>
> > > > > > > >> Brick3: third_node:/mnt/vg1-storage<br>
> > > > > > > >> Options Reconfigured:<br>
> > > > > > > >> cluster.quorum-type: auto<br>
> > > > > > > >> cluster.server-quorum-type: server<br>
> > > > > > > >> network.remote-dio: enable<br>
> > > > > > > >> cluster.eager-lock: enable<br>
> > > > > > > >> performance.readdir-ahead: on<br>
> > > > > > > >> performance.quick-read: off<br>
> > > > > > > >> performance.read-ahead: off<br>
> > > > > > > >> performance.io-cache: off<br>
> > > > > > > >> performance.stat-prefetch: off<br>
> > > > > > > >> features.shard: on<br>
> > > > > > > >> features.shard-block-size: 256MB<br>
> > > > > > > >> cluster.server-quorum-ratio: 51%<br>
> > > > > > > >><br>
> > > > > > > >><br>
> > > > > > > >> Thanks for your help<br>
> > > > > > > >><br>
> > > > > > > >> --<br>
> > > > > > > >> Kevin Lemonnier<br>
> > > > > > > >> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111<br>
> > > > > > > >><br>
> > > > > > > >> _______________________________________________<br>
> > > > > > > >> Gluster-users mailing list<br>
> > > > > > > >> <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
> > > > > > > >> <a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a><br>
> > > > > > > >><br>
> > > > > > ><br>
> > > > > > > --<br>
> > > > > > > Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma<br>
> > > > > brièveté.<br>
> > > > > > > _______________________________________________<br>
> > > > > > > Gluster-users mailing list<br>
> > > > > > > <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
> > > > > > > <a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a><br>
> > > > > > ><br>
> > > > ><br>
> > > > > --<br>
> > > > > Kevin Lemonnier<br>
> > > > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111<br>
> > > > ><br>
> > > > > _______________________________________________<br>
> > > > > Gluster-users mailing list<br>
> > > > > <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
> > > > > <a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a><br>
> > > > ><br>
> > ><br>
> > > --<br>
> > > Kevin Lemonnier<br>
> > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111<br>
> > ><br>
> > > _______________________________________________<br>
> > > Gluster-users mailing list<br>
> > > <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
> > > <a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a><br>
> > ><br>
><br>
> --<br>
> Kevin Lemonnier<br>
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111<br>
<br>
<br>
<br>
> _______________________________________________<br>
> Gluster-users mailing list<br>
> <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
> <a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a><br>
<br>
<br>
--<br>
Kevin Lemonnier<br>
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111<br>
</div></div><br>_______________________________________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
<a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a><br></blockquote></div><br></div>