Thanks a lot Vijay for the insights, will test it out and post a patch.<span></span><br><br>On Tuesday 18 October 2016, Vijay Bellur &lt;<a href="mailto:vbellur@redhat.com">vbellur@redhat.com</a>&gt; wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee &lt;<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;amukherj@redhat.com&#39;)">amukherj@redhat.com</a>&gt; wrote:<br>

&gt; Final reminder before I take out the test case from the test file.<br>

&gt;<br>

&gt;<br>

&gt; On Thursday 13 October 2016, Atin Mukherjee &lt;<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;amukherj@redhat.com&#39;)">amukherj@redhat.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; On Wednesday 12 October 2016, Atin Mukherjee &lt;<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;amukherj@redhat.com&#39;)">amukherj@redhat.com</a>&gt; wrote:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; So the test fails (intermittently) in check_fs which tries to do a df on<br>

&gt;&gt;&gt; the mount point for a volume which is carved out of three bricks from 3<br>

&gt;&gt;&gt; nodes and one node is completely down. A quick look at the mount log reveals<br>

&gt;&gt;&gt; the following:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.279446]:++++++++++<br>

&gt;&gt;&gt; G_LOG:./tests/bugs/glusterd/<wbr>bug-913555.t: TEST: 48 0 check_fs<br>

&gt;&gt;&gt; /mnt/glusterfs/0 ++++++++++<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>

&gt;&gt;&gt; [client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:      remote<br>

&gt;&gt;&gt; operation failed. Path: / (00000000-0000-0000-0000-<wbr>000000000001) [Transport<br>

&gt;&gt;&gt; endpoint is not connected]<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>

&gt;&gt;&gt; [dht-layout.c:713:dht_layout_<wbr>normalize] 0-patchy-dht: Found anomalies in /<br>

&gt;&gt;&gt; (gfid = 00000000-0000-0000-0000-<wbr>000000000001). Holes=1 overlaps=0<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>

&gt;&gt;&gt; [dht-selfheal.c:2102:dht_<wbr>selfheal_directory] 0-patchy-dht: Directory<br>

&gt;&gt;&gt; selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>

&gt;&gt;&gt; [client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:      remote<br>

&gt;&gt;&gt; operation failed. Path: / (00000000-0000-0000-0000-<wbr>000000000001) [Transport<br>

&gt;&gt;&gt; endpoint is not connected]<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk]<br>

&gt;&gt;&gt; 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve<br>

&gt;&gt;&gt; (Stale file handle)<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_<wbr>opendir_resume]<br>

&gt;&gt;&gt; 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)<br>

&gt;&gt;&gt; resolution failed<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk]<br>

&gt;&gt;&gt; 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve<br>

&gt;&gt;&gt; (Stale file handle)<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_<wbr>statfs_resume]<br>

&gt;&gt;&gt; 0-glusterfs-fuse: 8: STATFS (00000000-0000-   0000-0000-000000000001)<br>

&gt;&gt;&gt; resolution fail<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; DHT  team - are these anomalies expected here? I also see opendir and<br>

&gt;&gt;&gt; statfs failing here too.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Any luck with this? I don&#39;t see any relevance of having a check_fs test<br>

&gt;&gt; w.r.t the bug this test case is tagged to. If I don&#39;t get to hear on this in<br>

&gt;&gt; few days, I&#39;d go ahead and remove this check from the test to avoid the<br>

&gt;&gt; spurious failure.<br>

&gt;&gt;<br>

<br>

<br>

Looks like dht was not aware of a subvolume being down. We pick up<br>

first_up_subvolume for winding lookup on the root gfid in dht and in<br>

this case we have picked up the subvolume referring to the brick which<br>

was brought down and hence the failure.<br>

<br>

The test has this snippet:<br>

<br>

&lt;snippet&gt;<br>

# Kill one pseudo-node, make sure the others survive and volume stays up.<br>

TEST kill_node 3;<br>

EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;<br>

EXPECT 0 check_fs $M0;<br>

&lt;/snippet&gt;<br>

<br>

Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN<br>

percolate to dht?<br>

<br>

Logs indicate that dht was not aware of the subvolume being down for<br>

at least 1 second after protocol/client sensed the disconnection.<br>

<br>

[2016-10-10 13:58:58.235700] I [MSGID: 114018]<br>

[client.c:2276:client_rpc_<wbr>notify] 0-patchy-client-2: disconnected from<br>

patchy-client-2. Client process will keep trying to connect to<br>

glusterd until brick&#39;s port is available<br>

[2016-10-10 13:58:58.245060]:++++++++++<br>

G_LOG:./tests/bugs/glusterd/<wbr>bug-913555.t: TEST: 47 3<br>

online_brick_count ++++++++++<br>

[2016-10-10 13:58:59.279446]:++++++++++<br>

G_LOG:./tests/bugs/glusterd/<wbr>bug-913555.t: TEST: 48 0 check_fs<br>

/mnt/glusterfs/0 ++++++++++<br>

[2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>

[client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:<br>

remote operation failed. Path: /<br>

(00000000-0000-0000-0000-<wbr>000000000001) [Transport endpoint is not<br>

connected]<br>

[2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>

[dht-layout.c:713:dht_layout_<wbr>normalize] 0-patchy-dht: Found anomalies<br>

in / (gfid = 00000000-0000-0000-0000-<wbr>000000000001). Holes=1 overlaps=0<br>

[2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>

[dht-selfheal.c:2102:dht_<wbr>selfheal_directory] 0-patchy-dht: Directory<br>

selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>

[2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>

[client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:<br>

remote operation failed. Path: /<br>

(00000000-0000-0000-0000-<wbr>000000000001) [Transport endpoint is not<br>

connected]<br>

[2016-10-10 13:58:59.288927] W<br>

[fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk] 0-fuse:<br>

00000000-0000-0000-0000-<wbr>000000000001: failed to resolve (Stale file<br>

handle)<br>

[2016-10-10 13:58:59.288949] W<br>

[fuse-bridge.c:2597:fuse_<wbr>opendir_resume] 0-glusterfs-fuse: 7: OPENDIR<br>

(00000000-0000-0000-0000-<wbr>000000000001) resolution failed<br>

[2016-10-10 13:58:59.289505] W<br>

[fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk] 0-fuse:<br>

00000000-0000-0000-0000-<wbr>000000000001: failed to resolve (Stale file<br>

handle)<br>

[2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_<wbr>statfs_resume]<br>

0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-<wbr>000000000001)<br>

resolution fail<br>

<br>

Regards,<br>

Vijay<br>

</blockquote><br><br>-- <br>--Atin<br>