<div dir="ltr">We&#39;re running a fairly large 2-replica volume across two servers.  The volume is approximately 20TB of small 1K-4MB files.  The volume is exported via NFS, and mounted remotely by two clients.<div><br></div><div>For the past few weeks the Gluster brick processes have been randomly restarting.  Luckily they&#39;ve been doing so at non-peak times, so we didn&#39;t notice until our monitoring checks happened to pick up on zombied &#39;glusterfs&#39; process.</div><div><br></div><div>From the logs it looks like something is blocking communication to the brick processes, and Gluster automatically restarts everything to compensate.  I&#39;ve so far not been able to figure out the underlying cause.</div><div><br></div><div>I&#39;ve included log snippets from &#39;glustershd.log&#39; and &#39;etc-glusterfs-glusterd.vol.log&#39; here.  If anyone can provide some insight into the issue it would be greatly appreciated.  I&#39;ll also be happy to provide any further details as needed.</div><div><br></div><div><br></div><div><div>[2016-01-29 05:03:47.039886] I [MSGID: 106144] [glusterd-pmap.c:274:pmap_registry_remove] 0-pmap: removing brick /export/data/brick02 on port 49155</div><div>[2016-01-29 05:03:47.075521] W [socket.c:588:__socket_rwv] 0-management: readv on /var/run/gluster/53a233b05f5d4be45dc94391bc3ebfe5.socket failed (No data available)</div><div>[2016-01-29 05:03:47.078282] I [MSGID: 106005] [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management: Brick gluster-stor02:/export/data/brick02 has disconnected from glusterd.</div><div>[2016-01-29 05:03:47.149161] W [glusterfsd.c:1236:cleanup_and_exit] (--&gt;/lib64/libpthread.so.0() [0x3e47a079d1] --&gt;/usr/sbin/glusterd(glusterfs_sigwaiter+0xcd) [0x405e6d] --&gt;/usr/sbin/glusterd(cleanup_and_exit+0x65) [0x4059d5] ) 0-: recei</div><div>ved signum (15), shutting down</div><div>[2016-01-29 05:03:54.067012] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 (args: /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)</div><div>[2016-01-29 05:03:54.071901] I [MSGID: 106478] [glusterd.c:1350:init] 0-management: Maximum allowed open file descriptors set to 65536</div><div>[2016-01-29 05:03:54.071935] I [MSGID: 106479] [glusterd.c:1399:init] 0-management: Using /var/lib/glusterd as working directory</div><div>[2016-01-29 05:03:54.075655] E [rpc-transport.c:292:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.7.6/rpc-transport/rdma.so: cannot open shared object file: No such file or directory</div><div>[2016-01-29 05:03:54.075672] W [rpc-transport.c:296:rpc_transport_load] 0-rpc-transport: volume &#39;rdma.management&#39;: transport-type &#39;rdma&#39; is not valid or not found on this machine</div><div>[2016-01-29 05:03:54.075680] W [rpcsvc.c:1597:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed</div><div>[2016-01-29 05:03:54.075687] E [MSGID: 106243] [glusterd.c:1623:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport</div><div>[2016-01-29 05:03:55.869717] I [MSGID: 106513] [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 30702</div><div>[2016-01-29 05:03:55.995747] I [MSGID: 106498] [glusterd-handler.c:3579:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0</div><div>[2016-01-29 05:03:55.995866] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600</div><div>[2016-01-29 05:03:56.000937] I [MSGID: 106544] [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID: 9b103ea8-d248-44fc-8f80-3e87f7c4971c</div><div>Final graph:</div><div>+------------------------------------------------------------------------------+</div><div>  1: volume management</div><div>  2:     type mgmt/glusterd</div><div>  3:     option rpc-auth.auth-glusterfs on</div><div>  4:     option rpc-auth.auth-unix on</div><div>  5:     option rpc-auth.auth-null on</div><div>  6:     option rpc-auth-allow-insecure on</div><div>  7:     option transport.socket.listen-backlog 128</div><div>  8:     option ping-timeout 30</div><div>  9:     option transport.socket.read-fail-log off</div><div> 10:     option transport.socket.keepalive-interval 2</div><div> 11:     option transport.socket.keepalive-time 10</div><div> 12:     option transport-type rdma</div><div> 13:     option working-directory /var/lib/glusterd</div><div> 14: end-volume</div><div> 15:  </div><div>+------------------------------------------------------------------------------+</div><div>[2016-01-29 05:03:56.002570] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2</div><div>[2016-01-29 05:03:56.003098] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1</div><div>[2016-01-29 05:03:56.003158] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2</div><div>[2016-01-29 05:03:56.855628] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 388a8bb4-c530-44ff-838b-8f7b9e4c95db, host: 10.1.1.10, port: 0</div><div>[2016-01-29 05:03:56.856787] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600</div><div>[2016-01-29 05:03:57.859093] I [MSGID: 106540] [glusterd-utils.c:4191:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV3 successfully</div><div>[2016-01-29 05:03:57.860228] I [MSGID: 106540] [glusterd-utils.c:4200:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV1 successfully</div><div>[2016-01-29 05:03:57.861329] I [MSGID: 106540] [glusterd-utils.c:4209:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NFSV3 successfully</div><div>[2016-01-29 05:03:57.862421] I [MSGID: 106540] [glusterd-utils.c:4218:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v4 successfully</div><div>[2016-01-29 05:03:57.863510] I [MSGID: 106540] [glusterd-utils.c:4227:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v1 successfully</div><div>[2016-01-29 05:03:57.864600] I [MSGID: 106540] [glusterd-utils.c:4236:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered ACL v3 successfully</div><div>[2016-01-29 05:03:57.870948] W [socket.c:3009:socket_connect] 0-nfs: Ignore failed connection attempt on , (No such file or directory) </div></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><div>[2016-01-29 05:03:47.075614] W [socket.c:588:__socket_rwv] 0-data02-client-1: readv on <a href="http://10.1.1.10:49155">10.1.1.10:49155</a> failed (No data available)</div><div>[2016-01-29 05:03:47.076871] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from data02-client-1. Client process will keep trying to connect to glusterd until brick&#39;s port is available</div><div>[2016-01-29 05:03:47.170284] W [socket.c:588:__socket_rwv] 0-glusterfs: readv on <a href="http://127.0.0.1:24007">127.0.0.1:24007</a> failed (No data available)</div><div>[2016-01-29 05:03:47.639163] W [socket.c:588:__socket_rwv] 0-data02-client-0: readv on <a href="http://10.1.1.11:49153">10.1.1.11:49153</a> failed (No data available)</div><div>[2016-01-29 05:03:47.639206] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from data02-client-0. Client process will keep trying to connect to glusterd until brick&#39;s port is available</div><div>[2016-01-29 05:03:47.640222] E [MSGID: 108006] [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.</div><div>[2016-01-29 05:03:57.872983] W [glusterfsd.c:1236:cleanup_and_exit] (--&gt;/lib64/libpthread.so.0() [0x3e47a079d1] --&gt;/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e6d] --&gt;/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059d5] ) 0-: rec</div><div>eived signum (15), shutting down</div><div>[2016-01-29 05:03:58.881541] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.6 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/</div><div>glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/8d72de580ccac07d2ecfc2491a9b1648.socket --xlator-option *replicate*.node-uuid=9b103ea8-d248-44fc-8f80-3e87f7c4971c)</div><div>[2016-01-29 05:03:58.890833] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1</div><div>[2016-01-29 05:03:59.340030] I [graph.c:269:gf_add_cmdline_options] 0-data02-replicate-0: adding option &#39;node-uuid&#39; for volume &#39;data02-replicate-0&#39; with value &#39;9b103ea8-d248-44fc-8f80-3e87f7c4971c&#39;</div><div>[2016-01-29 05:03:59.342682] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2</div><div>[2016-01-29 05:03:59.342742] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3</div><div>[2016-01-29 05:03:59.342827] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4</div><div>[2016-01-29 05:03:59.342892] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 5</div><div>[2016-01-29 05:03:59.342917] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 6</div><div>[2016-01-29 05:03:59.343563] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 8</div><div>[2016-01-29 05:03:59.343569] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 7</div><div>[2016-01-29 05:03:59.343657] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 9</div><div>[2016-01-29 05:03:59.343705] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 11</div><div>[2016-01-29 05:03:59.343710] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 10</div><div>[2016-01-29 05:03:59.344278] I [MSGID: 114020] [client.c:2118:notify] 0-data02-client-0: parent translators are ready, attempting connect on transport</div><div>[2016-01-29 05:03:59.346553] I [MSGID: 114020] [client.c:2118:notify] 0-data02-client-1: parent translators are ready, attempting connect on transport</div><div>Final graph:</div><div>+------------------------------------------------------------------------------+</div><div>  1: volume data02-client-0</div><div>  2:     type protocol/client</div><div>  3:     option ping-timeout 42</div><div>  4:     option remote-host gluster-stor01</div><div>  5:     option remote-subvolume /export/data/brick02</div><div>  6:     option transport-type socket</div><div>  7:     option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452</div><div>  8:     option password 66b85782-5833-4f2d-ad0e-8de75247b094F</div><div>  9:     option event-threads 11</div><div> 10: end-volume</div><div> 11:  </div><div> 12: volume data02-client-1</div><div> 13:     type protocol/client</div><div> 14:     option ping-timeout 42</div><div> 15:     option remote-host gluster-stor02</div><div> 16:     option remote-subvolume /export/data/brick02</div><div> 17:     option transport-type socket</div><div> 18:     option username 5cc4f5d1-bcc8-4e06-ac74-520b20e2b452</div><div> 19:     option password 66b85782-5833-4f2d-ad0e-8de75247b094</div><div> 20:     option event-threads 11</div><div> 21: end-volume</div><div> 22:  </div><div> 23: volume data02-replicate-0</div><div> 24:     type cluster/replicate</div><div> 25:     option node-uuid 9b103ea8-d248-44fc-8f80-3e87f7c4971c</div><div> 26:     option background-self-heal-count 0</div><div> 27:     option metadata-self-heal on</div><div> 28:     option data-self-heal on</div><div> 29:     option entry-self-heal on</div><div> 30:     option self-heal-daemon enable</div><div> 31:     option iam-self-heal-daemon yes</div><div> 32:     subvolumes data02-client-0 data02-client-1</div><div> 33: end-volume</div><div> 34:  </div><div> 35: volume glustershd</div><div> 36:     type debug/io-stats</div><div> 37:     subvolumes data02-replicate-0</div><div> 38: end-volume</div><div> 39:  </div><div>+------------------------------------------------------------------------------+</div><div>[2016-01-29 05:03:59.348913] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-1: failed to get the port number for remote subvolume. Please run &#39;gluster volume status&#39; on server to see if brick process </div><div>is running.</div><div>[2016-01-29 05:03:59.348960] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-data02-client-1: disconnected from data02-client-1. Client process will keep trying to connect to glusterd until brick&#39;s port is available</div><div>[2016-01-29 05:03:59.436909] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-data02-client-0: failed to get the port number for remote subvolume. Please run &#39;gluster volume status&#39; on server to see if brick process </div><div>is running.</div><div>[2016-01-29 05:03:59.436974] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-data02-client-0: disconnected from data02-client-0. Client process will keep trying to connect to glusterd until brick&#39;s port is available</div><div>[2016-01-29 05:03:59.436991] E [MSGID: 108006] [afr-common.c:3880:afr_notify] 0-data02-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.</div><div>[2016-01-29 05:04:02.886317] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-data02-client-0: changing port to 49153 (from 0)</div><div>[2016-01-29 05:04:02.888761] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-data02-client-1: changing port to 49155 (from 0)</div><div>[2016-01-29 05:04:02.891105] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-data02-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)</div><div>[2016-01-29 05:04:02.891360] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-0: Connected to data02-client-0, attached to remote volume &#39;/export/data/brick02&#39;.</div><div>[2016-01-29 05:04:02.891373] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-0: Server and Client lk-version numbers are not same, reopening the fds</div><div>[2016-01-29 05:04:02.891403] I [MSGID: 108005] [afr-common.c:3841:afr_notify] 0-data02-replicate-0: Subvolume &#39;data02-client-0&#39; came back up; going online.</div><div>[2016-01-29 05:04:02.891518] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-0: Server lk version = 1</div><div>[2016-01-29 05:04:02.893074] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-data02-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)</div><div>[2016-01-29 05:04:02.893251] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-data02-client-1: Connected to data02-client-1, attached to remote volume &#39;/export/data/brick02&#39;.</div><div>[2016-01-29 05:04:02.893276] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-data02-client-1: Server and Client lk-version numbers are not same, reopening the fds</div><div>[2016-01-29 05:04:02.893401] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-data02-client-1: Server lk version = 1</div></div><div><br></div><div><br></div><div><br></div><div><br></div></div>