<div dir="ltr">Hi!<div><br></div><div>It's VMs based on KVM/qemu managed by libvirtd. I figured I could see the heal status by comparing the bricks: nothing was replicated, but new files were (after a long delay of about 5 mins). So I wanted to see if existing files (VM images) will be healed if I would stop a VM (close any open handle on the file), which turned out not to be the case.</div><div><br></div><div>I ended up shutting down all VMs and restarting the server. Afterwards healing worked as expected....</div><div><br></div><div>- Andreas</div><div><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Oct 5, 2015 at 1:01 PM, Anuradha Talur <span dir="ltr"><<a href="mailto:atalur@redhat.com" target="_blank">atalur@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>
<br>
----- Original Message -----<br>
> From: "Andreas Mather" <<a href="mailto:andreas@allaboutapps.at">andreas@allaboutapps.at</a>><br>
</span><span class="">> To: "Anuradha Talur" <<a href="mailto:atalur@redhat.com">atalur@redhat.com</a>><br>
> Cc: "<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a> List" <<a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a>><br>
> Sent: Thursday, September 24, 2015 6:59:38 PM<br>
> Subject: Re: [Gluster-users] gluster 3.7.3 - volume heal info hangs - unknown heal status<br>
><br>
> Hi Anuradha!<br>
><br>
> Thanks for your reply! Attached you can find the dump files. As I'm not<br>
> sure if they make their way through as attachments, here're links to them<br>
> as well:<br>
><br>
> brick1 - <a href="http://pastebin.com/3ivkhuRH" rel="noreferrer" target="_blank">http://pastebin.com/3ivkhuRH</a><br>
> brick2 - <a href="http://pastebin.com/77sT1mut" rel="noreferrer" target="_blank">http://pastebin.com/77sT1mut</a><br>
</span>Hi,<br>
<br>
I see some blocked locks from the statedump.<br>
Could you let me know what kind of workload you had when you observed the hang?<br>
<div class="HOEnZb"><div class="h5">><br>
> - Andreas<br>
><br>
><br>
><br>
><br>
> On Thu, Sep 24, 2015 at 3:18 PM, Anuradha Talur <<a href="mailto:atalur@redhat.com">atalur@redhat.com</a>> wrote:<br>
><br>
> ><br>
> ><br>
> > ----- Original Message -----<br>
> > > From: "Andreas Mather" <<a href="mailto:andreas@allaboutapps.at">andreas@allaboutapps.at</a>><br>
> > > To: "<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a> List" <<a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a>><br>
> > > Sent: Thursday, September 24, 2015 1:24:12 PM<br>
> > > Subject: [Gluster-users] gluster 3.7.3 - volume heal info hangs -<br>
> > unknown heal status<br>
> > ><br>
> > > Hi!<br>
> > ><br>
> > > Our provider had network maintenance this night, so 2 of our 4 servers<br>
> > got<br>
> > > disconnected and reconnected. Since we knew this was coming, we shifted<br>
> > all<br>
> > > work load off the affected servers. This morning, most of the cluster<br>
> > seems<br>
> > > fine, but for one volume, no heal info can be retrieved, so we basically<br>
> > > don't know about the healing state of the volume. The volume is a<br>
> > replica 2<br>
> > > volume between vhost4-int/brick1 and vhost3-int/brick2.<br>
> > ><br>
> > > The volume is accessible, but since I don't get any heal info, I don't<br>
> > know<br>
> > > if it is probably replicated. Any help to resolve this situation is<br>
> > highly<br>
> > > appreciated.<br>
> > ><br>
> > > hangs forever:<br>
> > > [root@vhost4 ~]# gluster volume heal vol4 info<br>
> > ><br>
> > > glfsheal-vol4.log:<br>
> > > [2015-09-24 07:47:59.<a href="tel:284723" value="+43284723">284723</a>] I [MSGID: 101190]<br>
> > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread<br>
> > with<br>
> > > index 1<br>
> > > [2015-09-24 07:47:59.293735] I [MSGID: 101190]<br>
> > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread<br>
> > with<br>
> > > index 2<br>
> > > [2015-09-24 07:47:59.294061] I [MSGID: 104045] [glfs-master.c:95:notify]<br>
> > > 0-gfapi: New graph 76686f73-7434-2e61-6c6c-61626f757461 (0) coming up<br>
> > > [2015-09-24 07:47:59.294081] I [MSGID: 114020] [client.c:2118:notify]<br>
> > > 0-vol4-client-1: parent translators are ready, attempting connect on<br>
> > > transport<br>
> > > [2015-09-24 07:47:59.309470] I [MSGID: 114020] [client.c:2118:notify]<br>
> > > 0-vol4-client-2: parent translators are ready, attempting connect on<br>
> > > transport<br>
> > > [2015-09-24 07:47:59.310525] I [rpc-clnt.c:1819:rpc_clnt_reconfig]<br>
> > > 0-vol4-client-1: changing port to 49155 (from 0)<br>
> > > [2015-09-24 07:47:59.315958] I [MSGID: 114057]<br>
> > > [client-handshake.c:1437:select_server_supported_programs]<br>
> > 0-vol4-client-1:<br>
> > > Using Program GlusterFS 3.3, Num (1298437), Version (330)<br>
> > > [2015-09-24 07:47:59.316481] I [MSGID: 114046]<br>
> > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-1:<br>
> > Connected to<br>
> > > vol4-client-1, attached to remote volume '/storage/brick2/brick2'.<br>
> > > [2015-09-24 07:47:59.316495] I [MSGID: 114047]<br>
> > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-1: Server<br>
> > and<br>
> > > Client lk-version numbers are not same, reopening the fds<br>
> > > [2015-09-24 07:47:59.316538] I [MSGID: 108005]<br>
> > [afr-common.c:3960:afr_notify]<br>
> > > 0-vol4-replicate-0: Subvolume 'vol4-client-1' came back up; going online.<br>
> > > [2015-09-24 07:47:59.317150] I [MSGID: 114035]<br>
> > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-1:<br>
> > Server<br>
> > > lk version = 1<br>
> > > [2015-09-24 07:47:59.320898] I [rpc-clnt.c:1819:rpc_clnt_reconfig]<br>
> > > 0-vol4-client-2: changing port to 49154 (from 0)<br>
> > > [2015-09-24 07:47:59.325633] I [MSGID: 114057]<br>
> > > [client-handshake.c:1437:select_server_supported_programs]<br>
> > 0-vol4-client-2:<br>
> > > Using Program GlusterFS 3.3, Num (1298437), Version (330)<br>
> > > [2015-09-24 07:47:59.325780] I [MSGID: 114046]<br>
> > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-2:<br>
> > Connected to<br>
> > > vol4-client-2, attached to remote volume '/storage/brick1/brick1'.<br>
> > > [2015-09-24 07:47:59.325791] I [MSGID: 114047]<br>
> > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-2: Server<br>
> > and<br>
> > > Client lk-version numbers are not same, reopening the fds<br>
> > > [2015-09-24 07:47:59.333346] I [MSGID: 114035]<br>
> > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-2:<br>
> > Server<br>
> > > lk version = 1<br>
> > > [2015-09-24 07:47:59.334545] I [MSGID: 108031]<br>
> > > [afr-common.c:1745:afr_local_discovery_cbk] 0-vol4-replicate-0: selecting<br>
> > > local read_child vol4-client-2<br>
> > > [2015-09-24 07:47:59.335833] I [MSGID: 104041]<br>
> > > [glfs-resolve.c:862:__glfs_active_subvol] 0-vol4: switched to graph<br>
> > > 76686f73-7434-2e61-6c6c-61626f757461 (0)<br>
> > ><br>
> > > Questions to this output:<br>
> > > -) Why does it report " Using Program GlusterFS 3.3, Num (1298437),<br>
> > Version<br>
> > > (330) ". We run 3.7.3 ?!<br>
> > > -) guster logs timestamps in UTC not taking server timezone into<br>
> > account. Is<br>
> > > there a way to fix this?<br>
> > ><br>
> > > etc-glusterfs-glusterd.vol.log:<br>
> > > no logs to after volume heal info command<br>
> > ><br>
> > > storage-brick1-brick1.log:<br>
> > > [2015-09-24 07:47:59.325720] I [login.c:81:gf_auth] 0-auth/login: allowed<br>
> > > user names: 67ef1559-d3a1-403a-b8e7-fb145c3acf4e<br>
> > > [2015-09-24 07:47:59.325743] I [MSGID: 115029]<br>
> > > [server-handshake.c:610:server_setvolume] 0-vol4-server: accepted client<br>
> > > from<br>
> > > vhost4.allaboutapps.at-14900-2015/09/24-07:47:59:282313-vol4-client-2-0-0<br>
> > > (version: 3.7.3)<br>
> > ><br>
> > > storage-brick2-brick2.log:<br>
> > > no logs to after volume heal info command<br>
> > ><br>
> > ><br>
> > Hi Andreas,<br>
> ><br>
> > Could you please provide the following information so that we can<br>
> > understand why the command is hanging?<br>
> > When the command is hung, run the following command from one of the<br>
> > servers:<br>
> > `gluster volume statedump <volname>`<br>
> > This command will generate statedumps of glusterfsd processes in the<br>
> > servers. You can find them at /var/run/gluster . A typical statedump for a<br>
> > brick has "<brick-path>.<pid-of-brick>.dump.<timestamp>" as its name. Could<br>
> > you please attach them and respond?<br>
> ><br>
> > > Thanks,<br>
> > ><br>
> > > - Andreas<br>
> > ><br>
> > ><br>
> > ><br>
> > > _______________________________________________<br>
> > > Gluster-users mailing list<br>
> > > <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
> > > <a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a><br>
> ><br>
> > --<br>
> > Thanks,<br>
> > Anuradha.<br>
> ><br>
><br>
<br>
</div></div><span class="HOEnZb"><font color="#888888">--<br>
Thanks,<br>
Anuradha.<br>
</font></span></blockquote></div><br></div></div>