Re: [Gluster-infra] rebal-all-nodes-migrate.t always fails now

Yaniv Kaul <ykaul@xxxxxxxxxx> · Thu, 4 Apr 2019 19:10:34 +0300

I'm not convinced this is solved. Just had what I believe is a similar failure:
00:12:02.532 A dependency job for rpc-statd.service failed. See 'journalctl -xe' for details.
00:12:02.532 mount.nfs: rpc.statd is not running but is required for remote locking.
00:12:02.532 mount.nfs: Either use '-o nolock' to keep locks local, or start statd.
00:12:02.532 mount.nfs: an incorrect mount option was specified

(of course, it can always be my patch!)

https://build.gluster.org/job/centos7-regression/5384/console

On Thu, Apr 4, 2019 at 6:56 PM Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
Thanks misc. I have always seen a pattern that on a reattempt (recheck centos) the same builder is picked up many time even though it's promised to pick up the builders in a round robin manner.

On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer <mscherer@xxxxxxxxxx> wrote:
Le jeudi 04 avril 2019 à 15:19 +0200, Michael Scherer a écrit :

> Le jeudi 04 avril 2019 à 13:53 +0200, Michael Scherer a écrit :

> > Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit :

> > > Based on what I have seen that any multi node test case will fail

> > > and

> > > the

> > > above one is picked first from that group and If I am correct

> > > none

> > > of

> > > the

> > > code fixes will go through the regression until this is fixed. I

> > > suspect it

> > > to be an infra issue again. If we look at

> > > https://review.gluster.org/#/c/glusterfs/+/22501/ &

> > > https://build.gluster.org/job/centos7-regression/5382/ peer

> > > handshaking is

> > > stuck as 127.1.1.1 is unable to receive a response back, did we

> > > end

> > > up

> > > having firewall and other n/w settings screwed up? The test never

> > > fails

> > > locally.

> > 

> > The firewall didn't change, and since the start has a line:

> > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost

> > interface

> > work. (I am not even sure that netfilter do anything meaningful on

> > the

> > loopback interface, but maybe I am wrong, and not keen on looking

> > kernel code for that).

> > 

> > 

> > Ping seems to work fine as well, so we can exclude a routing issue.

> > 

> > Maybe we should look at the socket, does it listen to a specific

> > address or not ?

> 

> So, I did look at the 20 first ailure, removed all not related to

> rebal-all-nodes-migrate.t and seen all were run on builder203, who

> was

> freshly reinstalled. As Deepshika noticed today, this one had a issue

> with ipv6, the 2nd issue we were tracking.

> 

> Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6

> being

> disabled, and the fix is to reload systemd. We have so far no idea on

> why it happen, but suspect this might be related to the network issue

> we did identify, as that happen only after a reboot, that happen only

> if a build is cancelled/crashed/aborted.

> 

> I apply the workaround on builder203, so if the culprit is that

> specific issue, guess that's fixed. 

> 

> I started a test to see how it go:

> https://build.gluster.org/job/centos7-regression/5383/

The test did just pass, so I would assume the problem was local to

builder203. Not sure why it was always selected, except because this

was the only one that failed, so was always up for getting new jobs. 

Maybe we should increase the number of builder so this doesn't happen,

as I guess the others builders were busy at that time ?

-- 

Michael Scherer

Sysadmin, Community Infrastructure and Platform, OSAS

_______________________________________________

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel