<div dir="ltr"><div><div><div><div>Hi Atin,<br><br></div>Thanks.<br><br></div>Have more doubts here.<br><br>Brick and glusterd connected by unix domain socket.It is just a local socket then why it is disconnect in below logs:<br><br> 1667 [2016-04-0<span tabindex="0" class=""><span class="">3 10:12:</span></span>32.984331] I [MSGID: 106005]<br>

[glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management:<br>

Brick 10.32.       1.144:/opt/lvmdir/c2/brick has disconnected from<br>

glusterd.<br>

 1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]<br>

[glusterd-utils.c:4872:glusterd_set_brick_status] 0-glusterd: Setting<br>

brick 10.32.1.        144:/opt/lvmdir/c2/brick status to stopped<br>

<br><br></div>Regards,<br></div>Abhishek<br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Apr 15, 2016 at 9:14 AM, Atin Mukherjee <span dir="ltr">&lt;<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

<br>

On 04/14/2016 04:07 PM, ABHISHEK PALIWAL wrote:<br>

&gt;<br>

&gt;<br>

</span><span class="">&gt; On Thu, Apr 14, 2016 at 2:33 PM, Atin Mukherjee &lt;<a href="mailto:amukherj@redhat.com">amukherj@redhat.com</a><br>

</span><span class="">&gt; &lt;mailto:<a href="mailto:amukherj@redhat.com">amukherj@redhat.com</a>&gt;&gt; wrote:<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;     On 04/05/2016 03:35 PM, ABHISHEK PALIWAL wrote:<br>

&gt;     &gt;<br>

&gt;     &gt;<br>

&gt;     &gt; On Tue, Apr 5, 2016 at 2:22 PM, Atin Mukherjee &lt;<a href="mailto:amukherj@redhat.com">amukherj@redhat.com</a> &lt;mailto:<a href="mailto:amukherj@redhat.com">amukherj@redhat.com</a>&gt;<br>

</span><div><div class="h5">&gt;     &gt; &lt;mailto:<a href="mailto:amukherj@redhat.com">amukherj@redhat.com</a> &lt;mailto:<a href="mailto:amukherj@redhat.com">amukherj@redhat.com</a>&gt;&gt;&gt; wrote:<br>

&gt;     &gt;<br>

&gt;     &gt;<br>

&gt;     &gt;<br>

&gt;     &gt;     On 04/05/2016 01:04 PM, ABHISHEK PALIWAL wrote:<br>

&gt;     &gt;     &gt; Hi Team,<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt; We are using Gluster 3.7.6 and facing one problem in which<br>

&gt;     brick is not<br>

&gt;     &gt;     &gt; comming online after restart the board.<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt; To understand our setup, please look the following steps:<br>

&gt;     &gt;     &gt; 1. We have two boards A and B on which Gluster volume is<br>

&gt;     running in<br>

&gt;     &gt;     &gt; replicated mode having one brick on each board.<br>

&gt;     &gt;     &gt; 2. Gluster mount point is present on the Board A which is<br>

&gt;     sharable<br>

&gt;     &gt;     &gt; between number of processes.<br>

&gt;     &gt;     &gt; 3. Till now our volume is in sync and everthing is working fine.<br>

&gt;     &gt;     &gt; 4. Now we have test case in which we&#39;ll stop the glusterd,<br>

&gt;     reboot the<br>

&gt;     &gt;     &gt; Board B and when this board comes up, starts the glusterd<br>

&gt;     again on it.<br>

&gt;     &gt;     &gt; 5. We repeated Steps 4 multiple times to check the<br>

&gt;     reliability of system.<br>

&gt;     &gt;     &gt; 6. After the Step 4, sometimes system comes in working state<br>

&gt;     (i.e. in<br>

&gt;     &gt;     &gt; sync) but sometime we faces that brick of Board B is present in<br>

&gt;     &gt;     &gt;     “gluster volume status” command but not be online even<br>

&gt;     waiting for<br>

&gt;     &gt;     &gt; more than a minute.<br>

&gt;     &gt;     As I mentioned in another email thread until and unless the<br>

&gt;     log shows<br>

&gt;     &gt;     the evidence that there was a reboot nothing can be concluded.<br>

&gt;     The last<br>

&gt;     &gt;     log what you shared with us few days back didn&#39;t give any<br>

&gt;     indication<br>

&gt;     &gt;     that brick process wasn&#39;t running.<br>

&gt;     &gt;<br>

&gt;     &gt; How can we identify that the brick process is running in brick logs?<br>

&gt;     &gt;<br>

&gt;     &gt;     &gt; 7. When the Step 4 is executing at the same time on Board A some<br>

&gt;     &gt;     &gt; processes are started accessing the files from the Gluster<br>

&gt;     mount point.<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt; As a solution to make this brick online, we found some<br>

&gt;     existing issues<br>

&gt;     &gt;     &gt; in gluster mailing list giving suggestion to use “gluster<br>

&gt;     volume start<br>

&gt;     &gt;     &gt; &lt;vol_name&gt; force” to make the brick &#39;offline&#39; to &#39;online&#39;.<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt; If we use “gluster volume start &lt;vol_name&gt; force” command.<br>

&gt;     It will kill<br>

&gt;     &gt;     &gt; the existing volume process and started the new process then<br>

&gt;     what will<br>

&gt;     &gt;     &gt; happen if other processes are accessing the same volume at<br>

&gt;     the time when<br>

&gt;     &gt;     &gt; volume process is killed by this command internally. Will it<br>

&gt;     impact any<br>

&gt;     &gt;     &gt; failure on these processes?<br>

&gt;     &gt;     This is not true, volume start force will start the brick<br>

&gt;     processes only<br>

&gt;     &gt;     if they are not running. Running brick processes will not be<br>

&gt;     &gt;     interrupted.<br>

&gt;     &gt;<br>

&gt;     &gt; we have tried and check the pid of process before force start and<br>

&gt;     after<br>

&gt;     &gt; force start.<br>

&gt;     &gt; the pid has been changed after force start.<br>

&gt;     &gt;<br>

&gt;     &gt; Please find the logs at the time of failure attached once again with<br>

&gt;     &gt; log-level=debug.<br>

&gt;     &gt;<br>

&gt;     &gt; if you can give me the exact line where you are able to find out that<br>

&gt;     &gt; the brick process<br>

&gt;     &gt; is running in brick log file please give me the line number of<br>

&gt;     that file.<br>

&gt;<br>

&gt;     Here is the sequence at which glusterd and respective brick process is<br>

&gt;     restarted.<br>

&gt;<br>

&gt;     1. glusterd restart trigger - line number 1014 in glusterd.log file:<br>

&gt;<br>

&gt;     [2016-04-03 10:12:29.051735] I [MSGID: 100030] [glusterfsd.c:2318:main]<br>

&gt;     0-/usr/sbin/glusterd: Started running /usr/sbin/              glusterd<br>

&gt;     version 3.7.6 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid<br>

&gt;     --log-level DEBUG)<br>

&gt;<br>

&gt;     2. brick start trigger - line number 190 in opt-lvmdir-c2-brick.log<br>

&gt;<br>

&gt;     [2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]<br>

&gt;     0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd<br>

&gt;     version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id<br>

&gt;     c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /<br>

&gt;     system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid<br>

&gt;     -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket<br>

&gt;     --brick-name /opt/lvmdir/c2/brick -l<br>

&gt;     /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option<br>

&gt;     *-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256<br>

&gt;     --brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)<br>

&gt;<br>

&gt;     3. The following log indicates that brick is up and is now started.<br>

&gt;     Refer to line 16123 in glusterd.log<br>

&gt;<br>

&gt;     [2016-04-03 10:14:25.336855] D [MSGID: 0]<br>

&gt;     [glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:<br>

&gt;     Connected to 10.32.1.144:/opt/lvmdir/c2/brick<br>

&gt;<br>

&gt;     This clearly indicates that the brick is up and running as after that I<br>

&gt;     do not see any disconnect event been processed by glusterd for the brick<br>

&gt;     process.<br>

&gt;<br>

&gt;<br>

&gt; Thanks for replying descriptively but please also clear some more doubts:<br>

&gt;<br>

&gt; 1. At this 10:14:25 moment of time brick is available because we have<br>

&gt; removed brick and added it again to make it online:<br>

&gt; following are the logs from cmd-history.log file of 000300<br>

&gt;<br>

&gt; [2016-04-03 10:14:21.446570]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:14:21.665889]  : volume remove-brick c_glusterfs replica<br>

&gt; 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS<br>

&gt; [2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 : SUCCESS<br>

&gt; [2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 : SUCCESS<br>

&gt; [2016-04-03 10:14:25.649525]  : volume add-brick c_glusterfs replica 2<br>

&gt; 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS<br>

&gt;<br>

&gt; and also 10:12:29 was the last reboot time before this failure. So I am<br>

&gt; totally agree what you said earlier.<br>

&gt;<br>

&gt; 2 .As you said at 10:12:29 glusterd restarted then why we are not<br>

&gt; getting &#39;brick start trigger&#39; related logs<br>

&gt;  like below between 10:12:29 to 10:14:25 time stamp which is something<br>

&gt; two minute of time interval.<br>

</div></div>So here is the culprit:<br>

<br>

 1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]<br>

[glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management:<br>

Brick 10.32.       1.144:/opt/lvmdir/c2/brick has disconnected from<br>

glusterd.<br>

 1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]<br>

[glusterd-utils.c:4872:glusterd_set_brick_status] 0-glusterd: Setting<br>

brick 10.32.1.        144:/opt/lvmdir/c2/brick status to stopped<br>

<br>

<br>

GlusterD received a disconnect event for this brick process and mark it<br>

as stopped. This could happen due to two reasons. 1. brick process goes<br>

down or 2. Network issue. In this case its the later I believe since the<br>

brick process was running at that time. I&#39;d request you to check this<br>

from the N/W side.<br>

<div><div class="h5"><br>

<br>

&gt;<br>

&gt; [2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]<br>

&gt; 0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd<br>

&gt; version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id<br>

&gt; c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /<br>

&gt; system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid<br>

&gt; -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket<br>

&gt; --brick-name /opt/lvmdir/c2/brick -l<br>

&gt; /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option<br>

&gt; *-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256<br>

&gt; --brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)<br>

&gt;<br>

&gt; 3. We are continuously checking brick status in the above time duration<br>

&gt; using  &quot;gluster volume status&quot; refer the cmd-history.log file from 000300<br>

&gt;<br>

&gt; In glusterd.log file we are also getting below logs<br>

&gt;<br>

&gt; [2016-04-03 10:12:31.771051] D [MSGID: 0]<br>

&gt; [glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:<br>

&gt; Connected to 10.32.1.144:/opt/lvmdir/c2/brick<br>

&gt;<br>

&gt; [2016-04-03 10:12:32.981152] D [MSGID: 0]<br>

&gt; [glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:<br>

&gt; Connected to 10.32.1.144:/opt/lvmdir/c2/brick<br>

&gt;<br>

&gt; two times b/w 10:12:29 and 10:14:25 and as you said these logs  &quot;<br>

&gt; clearly indicates that the brick is up and running as after&quot; then why<br>

&gt; brick is not online in &quot;gluster volume status&quot; command<br>

&gt;<br>

&gt; [2016-04-03 10:12:33.990487]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:34.007469]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:35.095918]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:35.126369]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:36.224018]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:36.251032]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:37.352377]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:37.374028]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:38.446148]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:38.468860]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:39.534017]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:39.553711]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:40.616610]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:12:40.636354]  : volume status : SUCCESS<br>

&gt; ......<br>

&gt; ......<br>

&gt; ......<br>

&gt; [2016-04-03 10:14:21.446570]  : volume status : SUCCESS<br>

&gt; [2016-04-03 10:14:21.665889]  : volume remove-brick c_glusterfs replica<br>

&gt; 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS<br>

&gt; [2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 : SUCCESS<br>

&gt; [2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 : SUCCESS<br>

&gt; [2016-04-03 10:14:25.649525]  : volume add-brick c_glusterfs replica 2<br>

&gt; 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS<br>

&gt;<br>

&gt; In above logs we are continuously checking brick status but when we<br>

&gt; don&#39;t find brick status &#39;online&#39; even after ~2 minutes then we removed<br>

&gt; it and add it again to make it online.<br>

&gt;<br>

&gt; [2016-04-03 10:14:21.665889]  : volume remove-brick c_glusterfs replica<br>

&gt; 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS<br>

&gt; [2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 : SUCCESS<br>

&gt; [2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 : SUCCESS<br>

&gt; [2016-04-03 10:14:25.649525]  : volume add-brick c_glusterfs replica 2<br>

&gt; 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS<br>

&gt;<br>

&gt; that is why in logs we are gettting &quot;brick start trigger logs&quot; at time<br>

&gt; stamp 10:14:25<br>

&gt;<br>

&gt; [2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]<br>

&gt; 0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd<br>

&gt; version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id<br>

&gt; c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /<br>

&gt; system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid<br>

&gt; -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket<br>

&gt; --brick-name /opt/lvmdir/c2/brick -l<br>

&gt; /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option<br>

&gt; *-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256<br>

&gt; --brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)<br>

&gt;<br>

&gt;<br>

&gt; Regards,<br>

&gt; Abhishek<br>

&gt;<br>

&gt;<br>

&gt;     Please note that all the logs referred and pasted are from 002500.<br>

&gt;<br>

&gt;     ~Atin<br>

&gt;     &gt;<br>

&gt;     &gt; 002500 - Board B that brick is offline<br>

&gt;     &gt; 00300 - Board A logs<br>

&gt;     &gt;<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt; *Question : What could be contributing to brick offline?*<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt; --<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt; Regards<br>

&gt;     &gt;     &gt; Abhishek Paliwal<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;     &gt; _______________________________________________<br>

&gt;     &gt;     &gt; Gluster-devel mailing list<br>

&gt;     &gt;     &gt; <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a> &lt;mailto:<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a>&gt;<br>

</div></div>&gt;     &lt;mailto:<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a> &lt;mailto:<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a>&gt;&gt;<br>

<div class="HOEnZb"><div class="h5">&gt;     &gt;     &gt; <a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-devel</a><br>

&gt;     &gt;     &gt;<br>

&gt;     &gt;<br>

&gt;     &gt;<br>

&gt;     &gt;<br>

&gt;     &gt;<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; --<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;</div></div></blockquote></div><br></div></div>