<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Oct 18, 2016 at 11:34 PM, Atin Mukherjee <span dir="ltr">&lt;<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks a lot Vijay for the insights, will test it out and post a patch.</blockquote><div><br></div><div>Unfortunately this didn&#39;t work. Even replacing EXPECT with EXPECT_WITHIN fails spuriously. <br><br></div><div>@Nigel - I&#39;d like to see how often this test fails and based on that take a call to temporarily remove this check. Could you share the last two weekly report of the regression failure to help me in figuring it out?<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><span></span><br><br>On Tuesday 18 October 2016, Vijay Bellur &lt;<a href="mailto:vbellur@redhat.com" target="_blank">vbellur@redhat.com</a>&gt; wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee &lt;<a>amukherj@redhat.com</a>&gt; wrote:<br>

&gt; Final reminder before I take out the test case from the test file.<br>

&gt;<br>

&gt;<br>

&gt; On Thursday 13 October 2016, Atin Mukherjee &lt;<a>amukherj@redhat.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; On Wednesday 12 October 2016, Atin Mukherjee &lt;<a>amukherj@redhat.com</a>&gt; wrote:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; So the test fails (intermittently) in check_fs which tries to do a df on<br>

&gt;&gt;&gt; the mount point for a volume which is carved out of three bricks from 3<br>

&gt;&gt;&gt; nodes and one node is completely down. A quick look at the mount log reveals<br>

&gt;&gt;&gt; the following:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.279446]:++++++++++<br>

&gt;&gt;&gt; G_LOG:./tests/bugs/glusterd/bu<wbr>g-913555.t: TEST: 48 0 check_fs<br>

&gt;&gt;&gt; /mnt/glusterfs/0 ++++++++++<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>

&gt;&gt;&gt; [client-rpc-fops.c:2930:client<wbr>3_3_lookup_cbk] 0-patchy-client-2:      remote<br>

&gt;&gt;&gt; operation failed. Path: / (00000000-0000-0000-0000-00000<wbr>0000001) [Transport<br>

&gt;&gt;&gt; endpoint is not connected]<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>

&gt;&gt;&gt; [dht-layout.c:713:dht_layout_n<wbr>ormalize] 0-patchy-dht: Found anomalies in /<br>

&gt;&gt;&gt; (gfid = 00000000-0000-0000-0000-000000<wbr>000001). Holes=1 overlaps=0<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>

&gt;&gt;&gt; [dht-selfheal.c:2102:dht_selfh<wbr>eal_directory] 0-patchy-dht: Directory<br>

&gt;&gt;&gt; selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>

&gt;&gt;&gt; [client-rpc-fops.c:2930:client<wbr>3_3_lookup_cbk] 0-patchy-client-2:      remote<br>

&gt;&gt;&gt; operation failed. Path: / (00000000-0000-0000-0000-00000<wbr>0000001) [Transport<br>

&gt;&gt;&gt; endpoint is not connected]<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_resol<wbr>ve_gfid_cbk]<br>

&gt;&gt;&gt; 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve<br>

&gt;&gt;&gt; (Stale file handle)<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_opend<wbr>ir_resume]<br>

&gt;&gt;&gt; 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)<br>

&gt;&gt;&gt; resolution failed<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_resol<wbr>ve_gfid_cbk]<br>

&gt;&gt;&gt; 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve<br>

&gt;&gt;&gt; (Stale file handle)<br>

&gt;&gt;&gt; [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statf<wbr>s_resume]<br>

&gt;&gt;&gt; 0-glusterfs-fuse: 8: STATFS (00000000-0000-   0000-0000-000000000001)<br>

&gt;&gt;&gt; resolution fail<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; DHT  team - are these anomalies expected here? I also see opendir and<br>

&gt;&gt;&gt; statfs failing here too.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Any luck with this? I don&#39;t see any relevance of having a check_fs test<br>

&gt;&gt; w.r.t the bug this test case is tagged to. If I don&#39;t get to hear on this in<br>

&gt;&gt; few days, I&#39;d go ahead and remove this check from the test to avoid the<br>

&gt;&gt; spurious failure.<br>

&gt;&gt;<br>

<br>

<br>

Looks like dht was not aware of a subvolume being down. We pick up<br>

first_up_subvolume for winding lookup on the root gfid in dht and in<br>

this case we have picked up the subvolume referring to the brick which<br>

was brought down and hence the failure.<br>

<br>

The test has this snippet:<br>

<br>

&lt;snippet&gt;<br>

# Kill one pseudo-node, make sure the others survive and volume stays up.<br>

TEST kill_node 3;<br>

EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;<br>

EXPECT 0 check_fs $M0;<br>

&lt;/snippet&gt;<br>

<br>

Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN<br>

percolate to dht?<br>

<br>

Logs indicate that dht was not aware of the subvolume being down for<br>

at least 1 second after protocol/client sensed the disconnection.<br>

<br>

[2016-10-10 13:58:58.235700] I [MSGID: 114018]<br>

[client.c:2276:client_rpc_noti<wbr>fy] 0-patchy-client-2: disconnected from<br>

patchy-client-2. Client process will keep trying to connect to<br>

glusterd until brick&#39;s port is available<br>

[2016-10-10 13:58:58.245060]:++++++++++<br>

G_LOG:./tests/bugs/glusterd/bu<wbr>g-913555.t: TEST: 47 3<br>

online_brick_count ++++++++++<br>

[2016-10-10 13:58:59.279446]:++++++++++<br>

G_LOG:./tests/bugs/glusterd/bu<wbr>g-913555.t: TEST: 48 0 check_fs<br>

/mnt/glusterfs/0 ++++++++++<br>

[2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>

[client-rpc-fops.c:2930:client<wbr>3_3_lookup_cbk] 0-patchy-client-2:<br>

remote operation failed. Path: /<br>

(00000000-0000-0000-0000-00000<wbr>0000001) [Transport endpoint is not<br>

connected]<br>

[2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>

[dht-layout.c:713:dht_layout_n<wbr>ormalize] 0-patchy-dht: Found anomalies<br>

in / (gfid = 00000000-0000-0000-0000-000000<wbr>000001). Holes=1 overlaps=0<br>

[2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>

[dht-selfheal.c:2102:dht_selfh<wbr>eal_directory] 0-patchy-dht: Directory<br>

selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>

[2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>

[client-rpc-fops.c:2930:client<wbr>3_3_lookup_cbk] 0-patchy-client-2:<br>

remote operation failed. Path: /<br>

(00000000-0000-0000-0000-00000<wbr>0000001) [Transport endpoint is not<br>

connected]<br>

[2016-10-10 13:58:59.288927] W<br>

[fuse-resolve.c:132:fuse_resol<wbr>ve_gfid_cbk] 0-fuse:<br>

00000000-0000-0000-0000-000000<wbr>000001: failed to resolve (Stale file<br>

handle)<br>

[2016-10-10 13:58:59.288949] W<br>

[fuse-bridge.c:2597:fuse_opend<wbr>ir_resume] 0-glusterfs-fuse: 7: OPENDIR<br>

(00000000-0000-0000-0000-00000<wbr>0000001) resolution failed<br>

[2016-10-10 13:58:59.289505] W<br>

[fuse-resolve.c:132:fuse_resol<wbr>ve_gfid_cbk] 0-fuse:<br>

00000000-0000-0000-0000-000000<wbr>000001: failed to resolve (Stale file<br>

handle)<br>

[2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statf<wbr>s_resume]<br>

0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-00000<wbr>0000001)<br>

resolution fail<br>

<br>

Regards,<br>

Vijay<br>

</blockquote><br><br></div></div><span class="HOEnZb"><font color="#888888">-- <br>--Atin<br>

</font></span></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><br></div><div>~ Atin (atinm)<br></div></div></div></div>

</div></div>