On Tue, Oct 18, 2016 at 11:34 PM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
Thanks a lot Vijay for the insights, will test it out and post a patch.
Unfortunately this didn't work. Even replacing EXPECT with EXPECT_WITHIN fails spuriously.
@Nigel - I'd like to see how often this test fails and based on that take a call to temporarily remove this check. Could you share the last two weekly report of the regression failure to help me in figuring it out?
--
On Tuesday 18 October 2016, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
> Final reminder before I take out the test case from the test file.
>
>
> On Thursday 13 October 2016, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
>>
>>
>>
>> On Wednesday 12 October 2016, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
>>>
>>> So the test fails (intermittently) in check_fs which tries to do a df on
>>> the mount point for a volume which is carved out of three bricks from 3
>>> nodes and one node is completely down. A quick look at the mount log reveals
>>> the following:
>>>
>>> [2016-10-10 13:58:59.279446]:++++++++++
>>> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
>>> /mnt/glusterfs/0 ++++++++++
>>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
>>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2: remote
>>> operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport
>>> endpoint is not connected]
>>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
>>> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies in /
>>> (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
>>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
>>> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
>>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
>>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
>>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2: remote
>>> operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport
>>> endpoint is not connected]
>>> [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
>>> 0-fuse: 00000000-0000-0000-0000- 000000000001: failed to resolve
>>> (Stale file handle)
>>> [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_opendir_resume]
>>> 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)
>>> resolution failed
>>> [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
>>> 0-fuse: 00000000-0000-0000-0000- 000000000001: failed to resolve
>>> (Stale file handle)
>>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
>>> 0-glusterfs-fuse: 8: STATFS (00000000-0000- 0000-0000-000000000001)
>>> resolution fail
>>>
>>> DHT team - are these anomalies expected here? I also see opendir and
>>> statfs failing here too.
>>
>>
>> Any luck with this? I don't see any relevance of having a check_fs test
>> w.r.t the bug this test case is tagged to. If I don't get to hear on this in
>> few days, I'd go ahead and remove this check from the test to avoid the
>> spurious failure.
>>
Looks like dht was not aware of a subvolume being down. We pick up
first_up_subvolume for winding lookup on the root gfid in dht and in
this case we have picked up the subvolume referring to the brick which
was brought down and hence the failure.
The test has this snippet:
<snippet>
# Kill one pseudo-node, make sure the others survive and volume stays up.
TEST kill_node 3;
EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;
EXPECT 0 check_fs $M0;
</snippet>
Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN
percolate to dht?
Logs indicate that dht was not aware of the subvolume being down for
at least 1 second after protocol/client sensed the disconnection.
[2016-10-10 13:58:58.235700] I [MSGID: 114018]
[client.c:2276:client_rpc_notify] 0-patchy-client-2: disconnected from
patchy-client-2. Client process will keep trying to connect to
glusterd until brick's port is available
[2016-10-10 13:58:58.245060]:++++++++++
G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 47 3
online_brick_count ++++++++++
[2016-10-10 13:58:59.279446]:++++++++++
G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
/mnt/glusterfs/0 ++++++++++
[2016-10-10 13:58:59.287973] W [MSGID: 114031]
[client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
remote operation failed. Path: /
(00000000-0000-0000-0000-000000000001) [Transport endpoint is not
connected]
[2016-10-10 13:58:59.288326] I [MSGID: 109063]
[dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies
in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
[2016-10-10 13:58:59.288352] W [MSGID: 109005]
[dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
[2016-10-10 13:58:59.288643] W [MSGID: 114031]
[client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
remote operation failed. Path: /
(00000000-0000-0000-0000-000000000001) [Transport endpoint is not
connected]
[2016-10-10 13:58:59.288927] W
[fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
handle)
[2016-10-10 13:58:59.288949] W
[fuse-bridge.c:2597:fuse_opendir_resume] 0-glusterfs-fuse: 7: OPENDIR
(00000000-0000-0000-0000-000000000001) resolution failed
[2016-10-10 13:58:59.289505] W
[fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
handle)
[2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-000000000001)
resolution fail
Regards,
Vijay
--Atin
--
~ Atin (atinm)
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel