<html><head></head><body>No, self-heal daemon is glusterfs (client) with the glustershd vol file. <br>

<br>

glusterfsd is the brick server. <br>

<br>

Normally the network would stay up through the final process kill as part of shutdown. That kill gracefully shuts down the brick process(es) allowing the clients to continue without waiting for the tcp connection.<br>

<br>

Apparently your init shutdown process disconnects the network. This is uncommon and may be considered a bug in whatever K script that&#39;s doing it. <br><br><div class="gmail_quote">On April 28, 2015 12:28:40 AM PDT, Corey Kovacs &lt;corey.kovacs@gmail.com&gt; wrote:<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div dir="ltr">Someone correct me if i am wrong, but glusterfsd is for self healing as I recall. Its launched when it&#39;s needed.</div><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Apr 27, 2015 at 1:59 PM, CJ Baar <span dir="ltr">&lt;<a href="mailto:gsml@ffisys.com" target="_blank">gsml@ffisys.com</a>&gt;</span> wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>FYI, I’ve tried with both glusterfs and NFS mounts, and the reaction is the same. The value of ping.timeout seems to have no effect at all.</div><div><br /></div><div>I did discover one thing that makes a difference on reboot. There is a second service descriptor for “glusterfsd”, which is not enabled by default, but is started by something else (glusterd, I assume?). However, whatever it is that starts the process, does not shut it down cleanly during a reboot… and it appears to be the loss of

that process without de-registration in the peer group that causes the other nodes to hang. If I enable the service (chkconfig glusterfsd on), it does nothing by default because the config is commented out (/etc/sysconfig/glusterfsd). But, having those K scripts in place in rc.d, I can manually touch /var/lock/subsys/glusterfsd, and then I can successfully reboot one node without the others hanging. This at least helps when I need to take a node down for maintenance; it obviously still does nothing for a true node failure.</div><div><br /></div><div>I guess my next step is to figure out to modify the init scripts for glusterd to touch the other lock file on startup as well. Does not seem a very elegant solution, but having the lock file in place and the init scripts enabled seems to solve at least half of the issue.</div><span class="HOEnZb"><font color="#888888"></font><div><font color="#888888"><br /></font></div><div>—CJ</div></span><div><div class="h5"><div><br

/></div><div><br /></div><br /><div><blockquote type="cite"><div>On Apr 25, 2015, at 11:34 AM, Corey Kovacs &lt;<a href="mailto:corey.kovacs@gmail.com" target="_blank">corey.kovacs@gmail.com</a>&gt; wrote:</div><br /><div><p dir="ltr">That&#39;s not cool..you certainly have a quorum. are you using the fuse client or regular old nfs?</p><p dir="ltr">C</p>

<div class="gmail_quote">On Apr 24, 2015 4:50 PM, &quot;CJ Baar&quot; &lt;<a href="mailto:gsml@ffisys.com" target="_blank">gsml@ffisys.com</a>&gt; wrote:<br type="attribution" /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>Corey—</div><div>I was able to get a third node setup. I recreated the volume as “replica 3”. The hang still happens (on two nodes, now) when I reboot a single node, even though two are still surviving, which should constitute a quorum.</div><div>—CJ</div><div><br /></div><br /><div><blockquote type="cite"><div>On Apr 17, 2015, at 6:18 AM, Corey Kovacs &lt;<a href="mailto:corey.kovacs@gmail.com" target="_blank">corey.kovacs@gmail.com</a>&gt; wrote:</div><br /><div><p dir="ltr">Typically you need to meet a quorum requirement to run just about any cluster.  By definition,  two nodes doesn&#39;t make a good cluster. A third node would let you start with just two

since that would allow you to meet quorum. Can you add a third node to at least test?</p><p dir="ltr">Corey</p>

<div class="gmail_quote">On Apr 16, 2015 6:52 PM, &quot;CJ Baar&quot; &lt;<a href="mailto:gsml@ffisys.com" target="_blank">gsml@ffisys.com</a>&gt; wrote:<br type="attribution" /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>I appreciate the info. I have tried adjust the ping-timeout setting, and it has seems to have no effect. The whole system hangs for 45+ seconds, which is about what it takes the second node to reboot, no matter what the value of ping-timeout is.  The output of the mnt-log is below.  It shows the adjust value I am currently testing (30s), but the system still hangs for longer than that.</div><div><br /></div><div>Also, I have realized that the problem is deeper than I originally thought.  It’s not just the mount that is hanging when a node reboots… it appears to be the entire system.  I cannot use my SSH connection, no matter where I am in the system, and

services such as httpd become unresponsive.  I can ping the “surviving” system, but other than that it appears pretty unusable.  This is a major drawback to using gluster.  I can’t afford to lost two entire systems if one dies.</div><div><br /></div><div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:21.281365] C [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-common-client-0: server <a href="http://172.31.64.200:49152/" target="_blank">172.31.64.200:49152</a> has not responded in the last 30 seconds, disconnecting.</div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:21.281560] E [rpc-clnt.c:362:saved_frames_unwind] (--&gt; /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (--&gt; /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (--&gt;

/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (--&gt; /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951] (--&gt; /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] ))))) 0-common-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-04-16 22:58:45.830962 (xid=0x6d)</div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:21.281588] W [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-common-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)</div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:21.281788] E [rpc-clnt.c:362:saved_frames_unwind] (--&gt; /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (--&gt;

/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (--&gt; /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (--&gt; /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951] (--&gt; /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] ))))) 0-common-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-04-16 22:58:51.277528 (xid=0x6e)</div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:21.281806] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk] 0-common-client-0: socket disconnected</div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:21.281816] I [client.c:2215:client_rpc_notify] 0-common-client-0: disconnected from common-client-0. Client process will keep trying to connect to glusterd until brick&#39;s port is available</div><div

style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:21.283637] I [socket.c:3292:socket_submit_request] 0-common-client-0: not connected (priv-&gt;connected = 0)</div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:21.283663] W [rpc-clnt.c:1562:rpc_clnt_submit] 0-common-client-0: failed to submit rpc-request (XID: 0x6f Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (common-client-0)</div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:21.283674] W [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-common-client-0: remote operation failed: Transport endpoint is not connected. Path: /src (63fc077b-869d-4928-8819-a79cc5c5ffa6)</div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16

22:59:21.284219] W [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-common-client-0: remote operation failed: Transport endpoint is not connected. Path: (null) (00000000-0000-0000-0000-000000000000)</div><div style="margin:0px;font-size:9px;font-family:Monaco;color:rgb(255,251,0);background-color:rgb(0,0,0)">[2015-04-16 22:59:52.322952] E [client-handshake.c:1496:client_query_portmap_cbk] 0-common-client-0: failed to get the port number for [root@cfm-c glusterfs]#</div></div><div><br /></div><div><br /></div><div>—CJ</div><div><br /></div><div><br /></div><br /><div><blockquote type="cite"><div>On Apr 7, 2015, at 10:26 PM, Ravishankar N &lt;<a href="mailto:ravishankar@redhat.com" target="_blank">ravishankar@redhat.com</a>&gt; wrote:</div><br /><div><br /><br />On 04/07/2015 10:11 PM, CJ Baar wrote:<br /><blockquote type="cite">Then, I issue “init 0” on node2, and the mount on node1 becomes unresponsive. This is the log from node1<br />[2015-04-07 16:36:04.250693] W

[glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed<br />[2015-04-07 16:36:04.251102] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume test1<br />The message &quot;I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected from glusterd.&quot; repeated 39 times between [2015-04-07 16:34:40.609878] and [2015-04-07 16:36:37.752489]<br />[2015-04-07 16:36:40.755989] I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected from glusterd.<br /></blockquote>This is the glusterd log. Could you also share the mount log of the healthy node in the non-responsive --&gt;responsive time interval?<br />If this is indeed the ping timer issue, you should see something like:

&quot;server xxx has not responded in the last 42 seconds, disconnecting.&quot;<br />Have you, for testing sake, tried reducing the network.ping-timeout value to something lower and checked that the hang happens only for that time?<br /><blockquote type="cite"><br />This does not seem like desired behaviour. I was trying to create this cluster because I was under the impression it would be more resilient than a single-point-of-failure NFS server. However, if the mount halts when one node in the cluster dies, then I’m no better off.<br /><br />I also can’t seem to figure out how to bring a volume online if only one node in the cluster is running; again, not really functioning as HA. The gluster service runs and the volume “starts”, but it is not “online” or mountable until both nodes are running. In a situation where a node fails and we need storage online before we can troubleshoot the cause of the node failure, how do I get a volume to go online?<br /></blockquote>This

is expected behavior. In a two node cluster, if only one is powered on, glusterd will not start other gluster processes (brick, nfs, shd ) until the glusterd of the other node is also up (i.e. quorum is met). If you want to override this behavior, do a `gluster vol start &lt;volname&gt; force` on the node that is up.<br /><br />-Ravi<br /><blockquote type="cite"><br />Thanks.<br /></blockquote><br /></div></blockquote></div><br /></div><br />_______________________________________________<br />

Gluster-users mailing list<br />

<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br />

<a href="http://www.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a><br /></blockquote></div>

</div></blockquote></div><br /></div></blockquote></div>

</div></blockquote></div><br /></div></div></div></blockquote></div><br /></div>

<p style="margin-top: 2.5em; margin-bottom: 1em; border-bottom: 1px solid #000"></p><pre class="k9mail"><hr /><br />Gluster-users mailing list<br />Gluster-users@gluster.org<br /><a href="http://www.gluster.org/mailman/listinfo/gluster-users">http://www.gluster.org/mailman/listinfo/gluster-users</a></pre></blockquote></div><br>

-- <br>

Sent from my Android device with K-9 Mail. Please excuse my brevity.</body></html>