Thanks a lot Vijay for the insights, will test it out and post a patch.<span></span><br><br>On Tuesday 18 October 2016, Vijay Bellur &lt;<a href="mailto:vbellur@redhat.com">vbellur@redhat.com</a>&gt; wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee &lt;<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;amukherj@redhat.com&#39;)">amukherj@redhat.com</a>&gt; wrote:<br>
&gt; Final reminder before I take out the test case from the test file.<br>
&gt;<br>
&gt;<br>
&gt; On Thursday 13 October 2016, Atin Mukherjee &lt;<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;amukherj@redhat.com&#39;)">amukherj@redhat.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On Wednesday 12 October 2016, Atin Mukherjee &lt;<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;amukherj@redhat.com&#39;)">amukherj@redhat.com</a>&gt; wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; So the test fails (intermittently) in check_fs which tries to do a df on<br>
&gt;&gt;&gt; the mount point for a volume which is carved out of three bricks from 3<br>
&gt;&gt;&gt; nodes and one node is completely down. A quick look at the mount log reveals<br>
&gt;&gt;&gt; the following:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; [2016-10-10 13:58:59.279446]:++++++++++<br>
&gt;&gt;&gt; G_LOG:./tests/bugs/glusterd/<wbr>bug-913555.t: TEST: 48 0 check_fs<br>
&gt;&gt;&gt; /mnt/glusterfs/0 ++++++++++<br>
&gt;&gt;&gt; [2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>
&gt;&gt;&gt; [client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:      remote<br>
&gt;&gt;&gt; operation failed. Path: / (00000000-0000-0000-0000-<wbr>000000000001) [Transport<br>
&gt;&gt;&gt; endpoint is not connected]<br>
&gt;&gt;&gt; [2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>
&gt;&gt;&gt; [dht-layout.c:713:dht_layout_<wbr>normalize] 0-patchy-dht: Found anomalies in /<br>
&gt;&gt;&gt; (gfid = 00000000-0000-0000-0000-<wbr>000000000001). Holes=1 overlaps=0<br>
&gt;&gt;&gt; [2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>
&gt;&gt;&gt; [dht-selfheal.c:2102:dht_<wbr>selfheal_directory] 0-patchy-dht: Directory<br>
&gt;&gt;&gt; selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>
&gt;&gt;&gt; [2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>
&gt;&gt;&gt; [client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:      remote<br>
&gt;&gt;&gt; operation failed. Path: / (00000000-0000-0000-0000-<wbr>000000000001) [Transport<br>
&gt;&gt;&gt; endpoint is not connected]<br>
&gt;&gt;&gt; [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk]<br>
&gt;&gt;&gt; 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve<br>
&gt;&gt;&gt; (Stale file handle)<br>
&gt;&gt;&gt; [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_<wbr>opendir_resume]<br>
&gt;&gt;&gt; 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)<br>
&gt;&gt;&gt; resolution failed<br>
&gt;&gt;&gt; [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk]<br>
&gt;&gt;&gt; 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve<br>
&gt;&gt;&gt; (Stale file handle)<br>
&gt;&gt;&gt; [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_<wbr>statfs_resume]<br>
&gt;&gt;&gt; 0-glusterfs-fuse: 8: STATFS (00000000-0000-   0000-0000-000000000001)<br>
&gt;&gt;&gt; resolution fail<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; DHT  team - are these anomalies expected here? I also see opendir and<br>
&gt;&gt;&gt; statfs failing here too.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Any luck with this? I don&#39;t see any relevance of having a check_fs test<br>
&gt;&gt; w.r.t the bug this test case is tagged to. If I don&#39;t get to hear on this in<br>
&gt;&gt; few days, I&#39;d go ahead and remove this check from the test to avoid the<br>
&gt;&gt; spurious failure.<br>
&gt;&gt;<br>
<br>
<br>
Looks like dht was not aware of a subvolume being down. We pick up<br>
first_up_subvolume for winding lookup on the root gfid in dht and in<br>
this case we have picked up the subvolume referring to the brick which<br>
was brought down and hence the failure.<br>
<br>
The test has this snippet:<br>
<br>
&lt;snippet&gt;<br>
# Kill one pseudo-node, make sure the others survive and volume stays up.<br>
TEST kill_node 3;<br>
EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;<br>
EXPECT 0 check_fs $M0;<br>
&lt;/snippet&gt;<br>
<br>
Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN<br>
percolate to dht?<br>
<br>
Logs indicate that dht was not aware of the subvolume being down for<br>
at least 1 second after protocol/client sensed the disconnection.<br>
<br>
[2016-10-10 13:58:58.235700] I [MSGID: 114018]<br>
[client.c:2276:client_rpc_<wbr>notify] 0-patchy-client-2: disconnected from<br>
patchy-client-2. Client process will keep trying to connect to<br>
glusterd until brick&#39;s port is available<br>
[2016-10-10 13:58:58.245060]:++++++++++<br>
G_LOG:./tests/bugs/glusterd/<wbr>bug-913555.t: TEST: 47 3<br>
online_brick_count ++++++++++<br>
[2016-10-10 13:58:59.279446]:++++++++++<br>
G_LOG:./tests/bugs/glusterd/<wbr>bug-913555.t: TEST: 48 0 check_fs<br>
/mnt/glusterfs/0 ++++++++++<br>
[2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>
[client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:<br>
remote operation failed. Path: /<br>
(00000000-0000-0000-0000-<wbr>000000000001) [Transport endpoint is not<br>
connected]<br>
[2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>
[dht-layout.c:713:dht_layout_<wbr>normalize] 0-patchy-dht: Found anomalies<br>
in / (gfid = 00000000-0000-0000-0000-<wbr>000000000001). Holes=1 overlaps=0<br>
[2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>
[dht-selfheal.c:2102:dht_<wbr>selfheal_directory] 0-patchy-dht: Directory<br>
selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>
[2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>
[client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:<br>
remote operation failed. Path: /<br>
(00000000-0000-0000-0000-<wbr>000000000001) [Transport endpoint is not<br>
connected]<br>
[2016-10-10 13:58:59.288927] W<br>
[fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk] 0-fuse:<br>
00000000-0000-0000-0000-<wbr>000000000001: failed to resolve (Stale file<br>
handle)<br>
[2016-10-10 13:58:59.288949] W<br>
[fuse-bridge.c:2597:fuse_<wbr>opendir_resume] 0-glusterfs-fuse: 7: OPENDIR<br>
(00000000-0000-0000-0000-<wbr>000000000001) resolution failed<br>
[2016-10-10 13:58:59.289505] W<br>
[fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk] 0-fuse:<br>
00000000-0000-0000-0000-<wbr>000000000001: failed to resolve (Stale file<br>
handle)<br>
[2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_<wbr>statfs_resume]<br>
0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-<wbr>000000000001)<br>
resolution fail<br>
<br>
Regards,<br>
Vijay<br>
</blockquote><br><br>-- <br>--Atin<br>