Re: Spurious failure of ./tests/bugs/glusterd/bug-913555.t

Atin Mukherjee <amukherj@xxxxxxxxxx> · Wed, 19 Oct 2016 18:15:35 +0530

On Tue, Oct 18, 2016 at 11:34 PM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
Thanks a lot Vijay for the insights, will test it out and post a patch.

Unfortunately this didn't work. Even replacing EXPECT with EXPECT_WITHIN fails spuriously. 

@Nigel - I'd like to see how often this test fails and based on that take a call to temporarily remove this check. Could you share the last two weekly report of the regression failure to help me in figuring it out?

On Tuesday 18 October 2016, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:
On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:

> Final reminder before I take out the test case from the test file.

>

>

> On Thursday 13 October 2016, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:

>>

>>

>>

>> On Wednesday 12 October 2016, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:

>>>

>>> So the test fails (intermittently) in check_fs which tries to do a df on

>>> the mount point for a volume which is carved out of three bricks from 3

>>> nodes and one node is completely down. A quick look at the mount log reveals

>>> the following:

>>>

>>> [2016-10-10 13:58:59.279446]:++++++++++

>>> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs

>>> /mnt/glusterfs/0 ++++++++++

>>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]

>>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:      remote

>>> operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport

>>> endpoint is not connected]

>>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]

>>> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies in /

>>> (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0

>>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]

>>> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory

>>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =

>>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]

>>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:      remote

>>> operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport

>>> endpoint is not connected]

>>> [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]

>>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve

>>> (Stale file handle)

>>> [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_opendir_resume]

>>> 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)

>>> resolution failed

>>> [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]

>>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve

>>> (Stale file handle)

>>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]

>>> 0-glusterfs-fuse: 8: STATFS (00000000-0000-   0000-0000-000000000001)

>>> resolution fail

>>>

>>> DHT  team - are these anomalies expected here? I also see opendir and

>>> statfs failing here too.

>>

>>

>> Any luck with this? I don't see any relevance of having a check_fs test

>> w.r.t the bug this test case is tagged to. If I don't get to hear on this in

>> few days, I'd go ahead and remove this check from the test to avoid the

>> spurious failure.

>>

Looks like dht was not aware of a subvolume being down. We pick up

first_up_subvolume for winding lookup on the root gfid in dht and in

this case we have picked up the subvolume referring to the brick which

was brought down and hence the failure.

The test has this snippet:

<snippet>

# Kill one pseudo-node, make sure the others survive and volume stays up.

TEST kill_node 3;

EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;

EXPECT 0 check_fs $M0;

</snippet>

Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN

percolate to dht?

Logs indicate that dht was not aware of the subvolume being down for

at least 1 second after protocol/client sensed the disconnection.

[2016-10-10 13:58:58.235700] I [MSGID: 114018]

[client.c:2276:client_rpc_notify] 0-patchy-client-2: disconnected from

patchy-client-2. Client process will keep trying to connect to

glusterd until brick's port is available

[2016-10-10 13:58:58.245060]:++++++++++

G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 47 3

online_brick_count ++++++++++

[2016-10-10 13:58:59.279446]:++++++++++

G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs

/mnt/glusterfs/0 ++++++++++

[2016-10-10 13:58:59.287973] W [MSGID: 114031]

[client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:

remote operation failed. Path: /

(00000000-0000-0000-0000-000000000001) [Transport endpoint is not

connected]

[2016-10-10 13:58:59.288326] I [MSGID: 109063]

[dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies

in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0

[2016-10-10 13:58:59.288352] W [MSGID: 109005]

[dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory

selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =

[2016-10-10 13:58:59.288643] W [MSGID: 114031]

[client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:

remote operation failed. Path: /

(00000000-0000-0000-0000-000000000001) [Transport endpoint is not

connected]

[2016-10-10 13:58:59.288927] W

[fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:

00000000-0000-0000-0000-000000000001: failed to resolve (Stale file

handle)

[2016-10-10 13:58:59.288949] W

[fuse-bridge.c:2597:fuse_opendir_resume] 0-glusterfs-fuse: 7: OPENDIR

(00000000-0000-0000-0000-000000000001) resolution failed

[2016-10-10 13:58:59.289505] W

[fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:

00000000-0000-0000-0000-000000000001: failed to resolve (Stale file

handle)

[2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]

0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-000000000001)

resolution fail

Regards,

Vijay

-- 
--Atin

-- 

~ Atin (atinm)

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel