<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Oct 18, 2016 at 11:34 PM, Atin Mukherjee <span dir="ltr"><<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks a lot Vijay for the insights, will test it out and post a patch.</blockquote><div><br></div><div>Unfortunately this didn't work. Even replacing EXPECT with EXPECT_WITHIN fails spuriously. <br><br></div><div>@Nigel - I'd like to see how often this test fails and based on that take a call to temporarily remove this check. Could you share the last two weekly report of the regression failure to help me in figuring it out?<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><span></span><br><br>On Tuesday 18 October 2016, Vijay Bellur <<a href="mailto:vbellur@redhat.com" target="_blank">vbellur@redhat.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee <<a>amukherj@redhat.com</a>> wrote:<br>
> Final reminder before I take out the test case from the test file.<br>
><br>
><br>
> On Thursday 13 October 2016, Atin Mukherjee <<a>amukherj@redhat.com</a>> wrote:<br>
>><br>
>><br>
>><br>
>> On Wednesday 12 October 2016, Atin Mukherjee <<a>amukherj@redhat.com</a>> wrote:<br>
>>><br>
>>> So the test fails (intermittently) in check_fs which tries to do a df on<br>
>>> the mount point for a volume which is carved out of three bricks from 3<br>
>>> nodes and one node is completely down. A quick look at the mount log reveals<br>
>>> the following:<br>
>>><br>
>>> [2016-10-10 13:58:59.279446]:++++++++++<br>
>>> G_LOG:./tests/bugs/glusterd/bu<wbr>g-913555.t: TEST: 48 0 check_fs<br>
>>> /mnt/glusterfs/0 ++++++++++<br>
>>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>
>>> [client-rpc-fops.c:2930:client<wbr>3_3_lookup_cbk] 0-patchy-client-2: remote<br>
>>> operation failed. Path: / (00000000-0000-0000-0000-00000<wbr>0000001) [Transport<br>
>>> endpoint is not connected]<br>
>>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>
>>> [dht-layout.c:713:dht_layout_n<wbr>ormalize] 0-patchy-dht: Found anomalies in /<br>
>>> (gfid = 00000000-0000-0000-0000-000000<wbr>000001). Holes=1 overlaps=0<br>
>>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>
>>> [dht-selfheal.c:2102:dht_selfh<wbr>eal_directory] 0-patchy-dht: Directory<br>
>>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>
>>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>
>>> [client-rpc-fops.c:2930:client<wbr>3_3_lookup_cbk] 0-patchy-client-2: remote<br>
>>> operation failed. Path: / (00000000-0000-0000-0000-00000<wbr>0000001) [Transport<br>
>>> endpoint is not connected]<br>
>>> [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_resol<wbr>ve_gfid_cbk]<br>
>>> 0-fuse: 00000000-0000-0000-0000- 000000000001: failed to resolve<br>
>>> (Stale file handle)<br>
>>> [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_opend<wbr>ir_resume]<br>
>>> 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)<br>
>>> resolution failed<br>
>>> [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_resol<wbr>ve_gfid_cbk]<br>
>>> 0-fuse: 00000000-0000-0000-0000- 000000000001: failed to resolve<br>
>>> (Stale file handle)<br>
>>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statf<wbr>s_resume]<br>
>>> 0-glusterfs-fuse: 8: STATFS (00000000-0000- 0000-0000-000000000001)<br>
>>> resolution fail<br>
>>><br>
>>> DHT team - are these anomalies expected here? I also see opendir and<br>
>>> statfs failing here too.<br>
>><br>
>><br>
>> Any luck with this? I don't see any relevance of having a check_fs test<br>
>> w.r.t the bug this test case is tagged to. If I don't get to hear on this in<br>
>> few days, I'd go ahead and remove this check from the test to avoid the<br>
>> spurious failure.<br>
>><br>
<br>
<br>
Looks like dht was not aware of a subvolume being down. We pick up<br>
first_up_subvolume for winding lookup on the root gfid in dht and in<br>
this case we have picked up the subvolume referring to the brick which<br>
was brought down and hence the failure.<br>
<br>
The test has this snippet:<br>
<br>
<snippet><br>
# Kill one pseudo-node, make sure the others survive and volume stays up.<br>
TEST kill_node 3;<br>
EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;<br>
EXPECT 0 check_fs $M0;<br>
</snippet><br>
<br>
Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN<br>
percolate to dht?<br>
<br>
Logs indicate that dht was not aware of the subvolume being down for<br>
at least 1 second after protocol/client sensed the disconnection.<br>
<br>
[2016-10-10 13:58:58.235700] I [MSGID: 114018]<br>
[client.c:2276:client_rpc_noti<wbr>fy] 0-patchy-client-2: disconnected from<br>
patchy-client-2. Client process will keep trying to connect to<br>
glusterd until brick's port is available<br>
[2016-10-10 13:58:58.245060]:++++++++++<br>
G_LOG:./tests/bugs/glusterd/bu<wbr>g-913555.t: TEST: 47 3<br>
online_brick_count ++++++++++<br>
[2016-10-10 13:58:59.279446]:++++++++++<br>
G_LOG:./tests/bugs/glusterd/bu<wbr>g-913555.t: TEST: 48 0 check_fs<br>
/mnt/glusterfs/0 ++++++++++<br>
[2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>
[client-rpc-fops.c:2930:client<wbr>3_3_lookup_cbk] 0-patchy-client-2:<br>
remote operation failed. Path: /<br>
(00000000-0000-0000-0000-00000<wbr>0000001) [Transport endpoint is not<br>
connected]<br>
[2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>
[dht-layout.c:713:dht_layout_n<wbr>ormalize] 0-patchy-dht: Found anomalies<br>
in / (gfid = 00000000-0000-0000-0000-000000<wbr>000001). Holes=1 overlaps=0<br>
[2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>
[dht-selfheal.c:2102:dht_selfh<wbr>eal_directory] 0-patchy-dht: Directory<br>
selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>
[2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>
[client-rpc-fops.c:2930:client<wbr>3_3_lookup_cbk] 0-patchy-client-2:<br>
remote operation failed. Path: /<br>
(00000000-0000-0000-0000-00000<wbr>0000001) [Transport endpoint is not<br>
connected]<br>
[2016-10-10 13:58:59.288927] W<br>
[fuse-resolve.c:132:fuse_resol<wbr>ve_gfid_cbk] 0-fuse:<br>
00000000-0000-0000-0000-000000<wbr>000001: failed to resolve (Stale file<br>
handle)<br>
[2016-10-10 13:58:59.288949] W<br>
[fuse-bridge.c:2597:fuse_opend<wbr>ir_resume] 0-glusterfs-fuse: 7: OPENDIR<br>
(00000000-0000-0000-0000-00000<wbr>0000001) resolution failed<br>
[2016-10-10 13:58:59.289505] W<br>
[fuse-resolve.c:132:fuse_resol<wbr>ve_gfid_cbk] 0-fuse:<br>
00000000-0000-0000-0000-000000<wbr>000001: failed to resolve (Stale file<br>
handle)<br>
[2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statf<wbr>s_resume]<br>
0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-00000<wbr>0000001)<br>
resolution fail<br>
<br>
Regards,<br>
Vijay<br>
</blockquote><br><br></div></div><span class="HOEnZb"><font color="#888888">-- <br>--Atin<br>
</font></span></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><br></div><div>~ Atin (atinm)<br></div></div></div></div>
</div></div>