Thanks a lot Vijay for the insights, will test it out and post a patch.<span></span><br><br>On Tuesday 18 October 2016, Vijay Bellur <<a href="mailto:vbellur@redhat.com">vbellur@redhat.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee <<a href="javascript:;" onclick="_e(event, 'cvml', 'amukherj@redhat.com')">amukherj@redhat.com</a>> wrote:<br>
> Final reminder before I take out the test case from the test file.<br>
><br>
><br>
> On Thursday 13 October 2016, Atin Mukherjee <<a href="javascript:;" onclick="_e(event, 'cvml', 'amukherj@redhat.com')">amukherj@redhat.com</a>> wrote:<br>
>><br>
>><br>
>><br>
>> On Wednesday 12 October 2016, Atin Mukherjee <<a href="javascript:;" onclick="_e(event, 'cvml', 'amukherj@redhat.com')">amukherj@redhat.com</a>> wrote:<br>
>>><br>
>>> So the test fails (intermittently) in check_fs which tries to do a df on<br>
>>> the mount point for a volume which is carved out of three bricks from 3<br>
>>> nodes and one node is completely down. A quick look at the mount log reveals<br>
>>> the following:<br>
>>><br>
>>> [2016-10-10 13:58:59.279446]:++++++++++<br>
>>> G_LOG:./tests/bugs/glusterd/<wbr>bug-913555.t: TEST: 48 0 check_fs<br>
>>> /mnt/glusterfs/0 ++++++++++<br>
>>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>
>>> [client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2: remote<br>
>>> operation failed. Path: / (00000000-0000-0000-0000-<wbr>000000000001) [Transport<br>
>>> endpoint is not connected]<br>
>>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>
>>> [dht-layout.c:713:dht_layout_<wbr>normalize] 0-patchy-dht: Found anomalies in /<br>
>>> (gfid = 00000000-0000-0000-0000-<wbr>000000000001). Holes=1 overlaps=0<br>
>>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>
>>> [dht-selfheal.c:2102:dht_<wbr>selfheal_directory] 0-patchy-dht: Directory<br>
>>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>
>>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>
>>> [client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2: remote<br>
>>> operation failed. Path: / (00000000-0000-0000-0000-<wbr>000000000001) [Transport<br>
>>> endpoint is not connected]<br>
>>> [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk]<br>
>>> 0-fuse: 00000000-0000-0000-0000- 000000000001: failed to resolve<br>
>>> (Stale file handle)<br>
>>> [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_<wbr>opendir_resume]<br>
>>> 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)<br>
>>> resolution failed<br>
>>> [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk]<br>
>>> 0-fuse: 00000000-0000-0000-0000- 000000000001: failed to resolve<br>
>>> (Stale file handle)<br>
>>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_<wbr>statfs_resume]<br>
>>> 0-glusterfs-fuse: 8: STATFS (00000000-0000- 0000-0000-000000000001)<br>
>>> resolution fail<br>
>>><br>
>>> DHT team - are these anomalies expected here? I also see opendir and<br>
>>> statfs failing here too.<br>
>><br>
>><br>
>> Any luck with this? I don't see any relevance of having a check_fs test<br>
>> w.r.t the bug this test case is tagged to. If I don't get to hear on this in<br>
>> few days, I'd go ahead and remove this check from the test to avoid the<br>
>> spurious failure.<br>
>><br>
<br>
<br>
Looks like dht was not aware of a subvolume being down. We pick up<br>
first_up_subvolume for winding lookup on the root gfid in dht and in<br>
this case we have picked up the subvolume referring to the brick which<br>
was brought down and hence the failure.<br>
<br>
The test has this snippet:<br>
<br>
<snippet><br>
# Kill one pseudo-node, make sure the others survive and volume stays up.<br>
TEST kill_node 3;<br>
EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;<br>
EXPECT 0 check_fs $M0;<br>
</snippet><br>
<br>
Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN<br>
percolate to dht?<br>
<br>
Logs indicate that dht was not aware of the subvolume being down for<br>
at least 1 second after protocol/client sensed the disconnection.<br>
<br>
[2016-10-10 13:58:58.235700] I [MSGID: 114018]<br>
[client.c:2276:client_rpc_<wbr>notify] 0-patchy-client-2: disconnected from<br>
patchy-client-2. Client process will keep trying to connect to<br>
glusterd until brick's port is available<br>
[2016-10-10 13:58:58.245060]:++++++++++<br>
G_LOG:./tests/bugs/glusterd/<wbr>bug-913555.t: TEST: 47 3<br>
online_brick_count ++++++++++<br>
[2016-10-10 13:58:59.279446]:++++++++++<br>
G_LOG:./tests/bugs/glusterd/<wbr>bug-913555.t: TEST: 48 0 check_fs<br>
/mnt/glusterfs/0 ++++++++++<br>
[2016-10-10 13:58:59.287973] W [MSGID: 114031]<br>
[client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:<br>
remote operation failed. Path: /<br>
(00000000-0000-0000-0000-<wbr>000000000001) [Transport endpoint is not<br>
connected]<br>
[2016-10-10 13:58:59.288326] I [MSGID: 109063]<br>
[dht-layout.c:713:dht_layout_<wbr>normalize] 0-patchy-dht: Found anomalies<br>
in / (gfid = 00000000-0000-0000-0000-<wbr>000000000001). Holes=1 overlaps=0<br>
[2016-10-10 13:58:59.288352] W [MSGID: 109005]<br>
[dht-selfheal.c:2102:dht_<wbr>selfheal_directory] 0-patchy-dht: Directory<br>
selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =<br>
[2016-10-10 13:58:59.288643] W [MSGID: 114031]<br>
[client-rpc-fops.c:2930:<wbr>client3_3_lookup_cbk] 0-patchy-client-2:<br>
remote operation failed. Path: /<br>
(00000000-0000-0000-0000-<wbr>000000000001) [Transport endpoint is not<br>
connected]<br>
[2016-10-10 13:58:59.288927] W<br>
[fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk] 0-fuse:<br>
00000000-0000-0000-0000-<wbr>000000000001: failed to resolve (Stale file<br>
handle)<br>
[2016-10-10 13:58:59.288949] W<br>
[fuse-bridge.c:2597:fuse_<wbr>opendir_resume] 0-glusterfs-fuse: 7: OPENDIR<br>
(00000000-0000-0000-0000-<wbr>000000000001) resolution failed<br>
[2016-10-10 13:58:59.289505] W<br>
[fuse-resolve.c:132:fuse_<wbr>resolve_gfid_cbk] 0-fuse:<br>
00000000-0000-0000-0000-<wbr>000000000001: failed to resolve (Stale file<br>
handle)<br>
[2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_<wbr>statfs_resume]<br>
0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-<wbr>000000000001)<br>
resolution fail<br>
<br>
Regards,<br>
Vijay<br>
</blockquote><br><br>-- <br>--Atin<br>