<div dir="ltr"><div>Hi guys. Ive been using gluster for a while now and despite a few hiccups, i find its a great system to use. One of my more persistent hiccups is an issue with one brick going offline. </div><div><br></div><div>My setup is a 2 brick 2 node setup. my main brick is gfs1 which has not given me any problem. gfs2 however keeps going offline. Following <a href="http://www.gluster.org/pipermail/gluster-users/2014-June/017583.html">http://www.gluster.org/pipermail/gluster-users/2014-June/017583.html</a> temporarily fixed the error but the brick goes offline within the hour. </div><div><br></div><div><div>This is what i get from my volume status command :</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">sudo gluster volume status<br><blockquote>Status of volume: gfsvolume<br>Gluster process<span class="" style="white-space:pre">                                                </span>Port<span class="" style="white-space:pre">        </span>Online<span class="" style="white-space:pre">        </span>Pid<br>------------------------------------------------------------------------------<br>Brick gfs1:/export/sda/brick<span class="" style="white-space:pre">                                </span>49153<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>9760<br>Brick gfs2:/export/sda/brick<span class="" style="white-space:pre">                                </span>N/A<span class="" style="white-space:pre">        </span>N<span class="" style="white-space:pre">        </span>13461<br>NFS Server on localhost<span class="" style="white-space:pre">                                        </span>2049<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>13473<br>Self-heal Daemon on localhost<span class="" style="white-space:pre">                                </span>N/A<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>13480<br>NFS Server on gfs1<span class="" style="white-space:pre">                                        </span>2049<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>16166<br>Self-heal Daemon on gfs1<span class="" style="white-space:pre">                                </span>N/A<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>16173<br> <br>Task Status of Volume gfsvolume<br>------------------------------------------------------------------------------<br>There are no active volume tasks</blockquote></blockquote><div><div><br></div><div>doing sudo gluster volume start gfsvolume force gives me this: </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">sudo gluster volume status<br><blockquote>Status of volume: gfsvolume<br>Gluster process<span class="" style="white-space:pre">                                                </span>Port<span class="" style="white-space:pre">        </span>Online<span class="" style="white-space:pre">        </span>Pid<br>------------------------------------------------------------------------------<br>Brick gfs1:/export/sda/brick<span class="" style="white-space:pre">                                </span>49153<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>9760<br>Brick gfs2:/export/sda/brick<span class="" style="white-space:pre">                                </span><span class="">49153</span><span class="" style="white-space:pre">        </span><span class="">Y</span><span class="" style="white-space:pre">        </span>13461<br>NFS Server on localhost<span class="" style="white-space:pre">                                        </span>2049<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>13473<br>Self-heal Daemon on localhost<span class="" style="white-space:pre">                                </span>N/A<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>13480<br>NFS Server on gfs1<span class="" style="white-space:pre">                                        </span>2049<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>16166<br>Self-heal Daemon on gfs1<span class="" style="white-space:pre">                                </span>N/A<span class="" style="white-space:pre">        </span>Y<span class="" style="white-space:pre">        </span>16173<br> <br>Task Status of Volume gfsvolume<br>------------------------------------------------------------------------------<br>There are no active volume tasks</blockquote></blockquote><div>half an hour later and my brick goes down again. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><blockquote> </blockquote></blockquote></div></div><div>This is my glustershd.log. I snipped it because the rest of the log is a repeat of the same error </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br><blockquote>[2015-03-13 02:09:41.951556] I [glusterfsd.c:1959:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.5.0 (/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glus<br>terd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/deac2f873d0ac5b6c3e84b23c4790172.socket --xlator-option *replicate*.node-uuid=adbb7505-3342-4c6d-be3d-75938633612c)<br>[2015-03-13 02:09:41.954173] I [socket.c:3561:socket_init] 0-socket.glusterfsd: SSL support is NOT enabled<br>[2015-03-13 02:09:41.954236] I [socket.c:3576:socket_init] 0-socket.glusterfsd: using system polling thread<br>[2015-03-13 02:09:41.954421] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled<br>[2015-03-13 02:09:41.954443] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread<br>[2015-03-13 02:09:41.956731] I [graph.c:254:gf_add_cmdline_options] 0-gfsvolume-replicate-0: adding option 'node-uuid' for volume 'gfsvolume-replicate-0' with value 'adbb7505-3342-4c6d-be3d-75938633612c'<br>[2015-03-13 02:09:41.960210] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-gfsvolume-client-1: setting frame-timeout to 90<br>[2015-03-13 02:09:41.960288] I [socket.c:3561:socket_init] 0-gfsvolume-client-1: SSL support is NOT enabled<br>[2015-03-13 02:09:41.960301] I [socket.c:3576:socket_init] 0-gfsvolume-client-1: using system polling thread<br>[2015-03-13 02:09:41.961095] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-gfsvolume-client-0: setting frame-timeout to 90<br>[2015-03-13 02:09:41.961134] I [socket.c:3561:socket_init] 0-gfsvolume-client-0: SSL support is NOT enabled<br>[2015-03-13 02:09:41.961145] I [socket.c:3576:socket_init] 0-gfsvolume-client-0: using system polling thread<br>[2015-03-13 02:09:41.961173] I [client.c:2273:notify] 0-gfsvolume-client-0: parent translators are ready, attempting connect on transport<br>[2015-03-13 02:09:41.961412] I [client.c:2273:notify] 0-gfsvolume-client-1: parent translators are ready, attempting connect on transport<br>Final graph:<br>+------------------------------------------------------------------------------+<br> 1: volume gfsvolume-client-0<br> 2: type protocol/client<br> 3: option remote-host gfs1<br> 4: option remote-subvolume /export/sda/brick<br> 5: option transport-type socket<br> 6: option frame-timeout 90<br> 7: option ping-timeout 30<br> 8: end-volume<br> 9: <br> 10: volume gfsvolume-client-1<br> 11: type protocol/client<br> 12: option remote-host gfs2<br> 13: option remote-subvolume /export/sda/brick<br> 14: option transport-type socket<br> 15: option frame-timeout 90<br> 16: option ping-timeout 30<br> 17: end-volume<br> 18: <br> 19: volume gfsvolume-replicate-0<br> 20: type cluster/replicate<br> 21: option node-uuid adbb7505-3342-4c6d-be3d-75938633612c<br> 22: option background-self-heal-count 0<br> 23: option metadata-self-heal on<br> 24: option data-self-heal on<br> 25: option entry-self-heal on<br> 26: option self-heal-daemon on<br> 27: option data-self-heal-algorithm diff<br> 28: option quorum-type fixed<br> 29: option quorum-count 1<br> 30: option iam-self-heal-daemon yes<br> 31: subvolumes gfsvolume-client-0 gfsvolume-client-1<br> 32: end-volume<br> 33: <br> 34: volume glustershd<br> 35: type debug/io-stats<br> 36: subvolumes gfsvolume-replicate-0<br> 37: end-volume<br>+------------------------------------------------------------------------------+<br>[2015-03-13 02:09:41.961871] I [rpc-clnt.c:1685:rpc_clnt_reconfig] 0-gfsvolume-client-1: changing port to 49153 (from 0)<br>[2015-03-13 02:09:41.962129] I [client-handshake.c:1659:select_server_supported_programs] 0-gfsvolume-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)<br>[2015-03-13 02:09:41.962344] I [client-handshake.c:1456:client_setvolume_cbk] 0-gfsvolume-client-1: Connected to <a href="http://172.20.20.22:49153">172.20.20.22:49153</a>, attached to remote volume '/export/sda/brick'.<br>[2015-03-13 02:09:41.962363] I [client-handshake.c:1468:client_setvolume_cbk] 0-gfsvolume-client-1: Server and Client lk-version numbers are not same, reopening the fds<br>[2015-03-13 02:09:41.962416] I [afr-common.c:3922:afr_notify] 0-gfsvolume-replicate-0: Subvolume 'gfsvolume-client-1' came back up; going online.<br>[2015-03-13 02:09:41.962487] I [client-handshake.c:450:client_set_lk_version_cbk] 0-gfsvolume-client-1: Server lk version = 1<br>[2015-03-13 02:09:41.963109] E [afr-self-heald.c:1479:afr_find_child_position] 0-gfsvolume-replicate-0: getxattr failed on gfsvolume-client-0 - (Transport endpoint is not connected)<br>[2015-03-13 02:09:41.963502] I [afr-self-heald.c:1687:afr_dir_exclusive_crawl] 0-gfsvolume-replicate-0: Another crawl is in progress for gfsvolume-client-1<br>[2015-03-13 02:09:41.967478] E [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk] 0-gfsvolume-replicate-0: Non Blocking entrylks failed for <gfid:66af7dc1-a2e6-4919-9ea1-ad75fe2d40b9>.<br>[2015-03-13 02:09:41.968550] E [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk] 0-gfsvolume-replicate-0: Non Blocking entrylks failed for <gfid:8a7cfa39-9a12-43cd-a9f3-9142b7403d0e>.<br>[2015-03-13 02:09:41.969663] E [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk] 0-gfsvolume-replicate-0: Non Blocking entrylks failed for <gfid:3762920e-9631-4a52-9a9f-4f04d09e8d84>.<br>[2015-03-13 02:09:41.974345] E [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk] 0-gfsvolume-replicate-0: Non Blocking entrylks failed for <gfid:66af7dc1-a2e6-4919-9ea1-ad75fe2d40b9>.<br>[2015-03-13 02:09:41.975657] E [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk] 0-gfsvolume-replicate-0: Non Blocking entrylks failed for <gfid:8a7cfa39-9a12-43cd-a9f3-9142b7403d0e>.<br>[2015-03-13 02:09:41.977020] E [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk] 0-gfsvolume-replicate-0: Non Blocking entrylks failed for <gfid:3762920e-9631-4a52-9a9f-4f04d09e8d84>.<br>[2015-03-13 02:09:44.307219] I [rpc-clnt.c:1685:rpc_clnt_reconfig] 0-gfsvolume-client-0: changing port to 49153 (from 0)<br>[2015-03-13 02:09:44.307748] I [client-handshake.c:1659:select_server_supported_programs] 0-gfsvolume-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)<br>[2015-03-13 02:09:44.448377] I [client-handshake.c:1456:client_setvolume_cbk] 0-gfsvolume-client-0: Connected to <a href="http://172.20.20.21:49153">172.20.20.21:49153</a>, attached to remote volume '/export/sda/brick'.<br>[2015-03-13 02:09:44.448418] I [client-handshake.c:1468:client_setvolume_cbk] 0-gfsvolume-client-0: Server and Client lk-version numbers are not same, reopening the fds<br>[2015-03-13 02:09:44.448713] I [client-handshake.c:450:client_set_lk_version_cbk] 0-gfsvolume-client-0: Server lk version = 1<br>[2015-03-13 02:09:44.515112] I [afr-self-heal-common.c:2859:afr_log_self_heal_completion_status] 0-gfsvolume-replicate-0: foreground data self heal is successfully completed, data self heal from gfsvolume-client-0 to sinks gfsvolume-client-1, with 892928 bytes on gfsvolume-client-0, 892928 bytes on gfsvolume-client-1, data - Pending matrix: [ [ 0 155762 ] [ 0 0 ] ] on <gfid:123536cc-c34b-43d7-b0c6-cf80eefa8322><br>[2015-03-13 02:09:44.809988] I [afr-self-heal-common.c:2859:afr_log_self_heal_completion_status] 0-gfsvolume-replicate-0: foreground data self heal is successfully completed, data self heal from gfsvolume-client-0 to sinks gfsvolume-client-1, with 15998976 bytes on gfsvolume-client-0, 15998976 bytes on gfsvolume-client-1, data - Pending matrix: [ [ 0 36506 ] [ 0 0 ] ] on <gfid:b6dc0e74-31bf-469a-b629-ee51ab4cf729><br>[2015-03-13 02:09:44.946050] W [client-rpc-fops.c:574:client3_3_readlink_cbk] 0-gfsvolume-client-0: remote operation failed: Stale NFS file handle<br>[2015-03-13 02:09:44.946097] I [afr-self-heal-entry.c:1538:afr_sh_entry_impunge_readlink_sink_cbk] 0-gfsvolume-replicate-0: readlink of <gfid:66af7dc1-a2e6-4919-9ea1-ad75fe2d40b9>/PB2_corrected.fastq on gfsvolume-client-1 failed (Stale NFS file handle)<br>[2015-03-13 02:09:44.951370] I [afr-self-heal-entry.c:2321:afr_sh_entry_fix] 0-gfsvolume-replicate-0: <gfid:8a7cfa39-9a12-43cd-a9f3-9142b7403d0e>: Performing conservative merge<br>[2015-03-13 02:09:45.149995] W [client-rpc-fops.c:574:client3_3_readlink_cbk] 0-gfsvolume-client-0: remote operation failed: Stale NFS file handle<br>[2015-03-13 02:09:45.150036] I [afr-self-heal-entry.c:1538:afr_sh_entry_impunge_readlink_sink_cbk] 0-gfsvolume-replicate-0: readlink of <gfid:8a7cfa39-9a12-43cd-a9f3-9142b7403d0e>/Rscript on gfsvolume-client-1 failed (Stale NFS file handle)<br>[2015-03-13 02:09:45.214253] W [client-rpc-fops.c:574:client3_3_readlink_cbk] 0-gfsvolume-client-0: remote operation failed: Stale NFS file handle<br>[2015-03-13 02:09:45.214295] I [afr-self-heal-entry.c:1538:afr_sh_entry_impunge_readlink_sink_cbk] 0-gfsvolume-replicate-0: readlink of <gfid:3762920e-9631-4a52-9a9f-4f04d09e8d84>/ananas_d_tmp on gfsvolume-client-1 failed (Stale NFS file handle)<br>[2015-03-13 02:13:27.324856] W [socket.c:522:__socket_rwv] 0-gfsvolume-client-1: readv on <a href="http://172.20.20.22:49153">172.20.20.22:49153</a> failed (No data available)<br>[2015-03-13 02:13:27.324961] I [client.c:2208:client_rpc_notify] 0-gfsvolume-client-1: disconnected from <a href="http://172.20.20.22:49153">172.20.20.22:49153</a>. Client process will keep trying to connect to glusterd until brick's port is available<br>[2015-03-13 02:13:37.981531] I [rpc-clnt.c:1685:rpc_clnt_reconfig] 0-gfsvolume-client-1: changing port to 49153 (from 0)<br>[2015-03-13 02:13:37.981781] E [socket.c:2161:socket_connect_finish] 0-gfsvolume-client-1: connection to <a href="http://172.20.20.22:49153">172.20.20.22:49153</a> failed (Connection refused)<br>[2015-03-13 02:13:41.982125] I [rpc-clnt.c:1685:rpc_clnt_reconfig] 0-gfsvolume-client-1: changing port to 49153 (from 0)<br>[2015-03-13 02:13:41.982353] E [socket.c:2161:socket_connect_finish] 0-gfsvolume-client-1: connection to <a href="http://172.20.20.22:49153">172.20.20.22:49153</a> failed (Connection refused)<br>[2015-03-13 02:13:45.982693] I [rpc-clnt.c:1685:rpc_clnt_reconfig] 0-gfsvolume-client-1: changing port to 49153 (from 0)<br>[2015-03-13 02:13:45.982926] E [socket.c:2161:socket_connect_finish] 0-gfsvolume-client-1: connection to <a href="http://172.20.20.22:49153">172.20.20.22:49153</a> failed (Connection refused)<br>[2015-03-13 02:13:49.983309] I [rpc-clnt.c:1685:rpc_clnt_reconfig] 0-gfsvolume-client-1: changing port to 49153 (from 0)</blockquote></blockquote><div><br></div><div><br></div><div>Any help would be greatly appreciated.</div><div><div class="gmail_signature"><div dir="ltr"><div>Thank You Kindly,</div><div>Kaamesh</div><div><br></div></div></div></div>
</div>