<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">I appreciate the info. I have tried adjust the ping-timeout setting, and it has seems to have no effect. The whole system hangs for 45+ seconds, which is about what it takes the second node to reboot, no matter what the value of ping-timeout is. &nbsp;The output of the mnt-log is below. &nbsp;It shows the adjust value I am currently testing (30s), but the system still hangs for longer than that.</div><div class=""><br class=""></div><div class="">Also, I have realized that the problem is deeper than I originally thought. &nbsp;It’s not just the mount that is hanging when a node reboots… it appears to be the entire system. &nbsp;I cannot use my SSH connection, no matter where I am in the system, and services such as httpd become unresponsive. &nbsp;I can ping the “surviving” system, but other than that it appears pretty unusable. &nbsp;This is a major drawback to using gluster. &nbsp;I can’t afford to lost two entire systems if one dies.</div><div class=""><br class=""></div><div class=""><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.281365] C [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-common-client-0: server 172.31.64.200:49152 has not responded in the last 30 seconds, disconnecting.</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.281560] E [rpc-clnt.c:362:saved_frames_unwind] (--&gt; /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (--&gt; /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (--&gt; /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (--&gt; /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951] (--&gt; /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] ))))) 0-common-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-04-16 22:58:45.830962 (xid=0x6d)</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.281588] W [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-common-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.281788] E [rpc-clnt.c:362:saved_frames_unwind] (--&gt; /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (--&gt; /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (--&gt; /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (--&gt; /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951] (--&gt; /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] ))))) 0-common-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-04-16 22:58:51.277528 (xid=0x6e)</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.281806] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk] 0-common-client-0: socket disconnected</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.281816] I [client.c:2215:client_rpc_notify] 0-common-client-0: disconnected from common-client-0. Client process will keep trying to connect to glusterd until brick's port is available</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.283637] I [socket.c:3292:socket_submit_request] 0-common-client-0: not connected (priv-&gt;connected = 0)</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.283663] W [rpc-clnt.c:1562:rpc_clnt_submit] 0-common-client-0: failed to submit rpc-request (XID: 0x6f Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (common-client-0)</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.283674] W [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-common-client-0: remote operation failed: Transport endpoint is not connected. Path: /src (63fc077b-869d-4928-8819-a79cc5c5ffa6)</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:21.284219] W [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-common-client-0: remote operation failed: Transport endpoint is not connected. Path: (null) (00000000-0000-0000-0000-000000000000)</div><div style="margin: 0px; font-size: 9px; font-family: Monaco; color: rgb(255, 251, 0); background-color: rgb(0, 0, 0);" class="">[2015-04-16 22:59:52.322952] E [client-handshake.c:1496:client_query_portmap_cbk] 0-common-client-0: failed to get the port number for [root@cfm-c glusterfs]#</div></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">—CJ</div><div class=""><br class=""></div><div class=""><br class=""></div><br class=""><div><blockquote type="cite" class=""><div class="">On Apr 7, 2015, at 10:26 PM, Ravishankar N &lt;<a href="mailto:ravishankar@redhat.com" class="">ravishankar@redhat.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><br class=""><br class="">On 04/07/2015 10:11 PM, CJ Baar wrote:<br class=""><blockquote type="cite" class="">Then, I issue “init 0” on node2, and the mount on node1 becomes unresponsive. This is the log from node1<br class="">[2015-04-07 16:36:04.250693] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed<br class="">[2015-04-07 16:36:04.251102] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume test1<br class="">The message "I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected from glusterd." repeated 39 times between [2015-04-07 16:34:40.609878] and [2015-04-07 16:36:37.752489]<br class="">[2015-04-07 16:36:40.755989] I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected from glusterd.<br class=""></blockquote>This is the glusterd log. Could you also share the mount log of the healthy node in the non-responsive --&gt;responsive time interval?<br class="">If this is indeed the ping timer issue, you should see something like: "server xxx has not responded in the last 42 seconds, disconnecting."<br class="">Have you, for testing sake, tried reducing the network.ping-timeout value to something lower and checked that the hang happens only for that time?<br class=""><blockquote type="cite" class=""><br class="">This does not seem like desired behaviour. I was trying to create this cluster because I was under the impression it would be more resilient than a single-point-of-failure NFS server. However, if the mount halts when one node in the cluster dies, then I’m no better off.<br class=""><br class="">I also can’t seem to figure out how to bring a volume online if only one node in the cluster is running; again, not really functioning as HA. The gluster service runs and the volume “starts”, but it is not “online” or mountable until both nodes are running. In a situation where a node fails and we need storage online before we can troubleshoot the cause of the node failure, how do I get a volume to go online?<br class=""></blockquote>This is expected behavior. In a two node cluster, if only one is powered on, glusterd will not start other gluster processes (brick, nfs, shd ) until the glusterd of the other node is also up (i.e. quorum is met). If you want to override this behavior, do a `gluster vol start &lt;volname&gt; force` on the node that is up.<br class=""><br class="">-Ravi<br class=""><blockquote type="cite" class=""><br class="">Thanks.<br class=""></blockquote><br class=""></div></blockquote></div><br class=""></body></html>