Re: [Gluster-infra] rebal-all-nodes-migrate.t always fails now

Michael Scherer <mscherer@xxxxxxxxxx> · Thu, 04 Apr 2019 18:24:56 +0200

Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
> I'm not convinced this is solved. Just had what I believe is a
> similar
> failure:
> 
> *00:12:02.532* A dependency job for rpc-statd.service failed. See
> 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is
> not running but is required for remote locking.*00:12:02.532*
> mount.nfs: Either use '-o nolock' to keep locks local, or start
> statd.*00:12:02.532* mount.nfs: an incorrect mount option was
> specified
> 
> (of course, it can always be my patch!)
> 
> https://build.gluster.org/job/centos7-regression/5384/console

same issue, different builder (206). I will check them all, as the
issue is more widespread than I expected (or it did popup since last
time I checked).

> 
> On Thu, Apr 4, 2019 at 6:56 PM Atin Mukherjee <amukherj@xxxxxxxxxx>
> wrote:
> 
> > Thanks misc. I have always seen a pattern that on a reattempt
> > (recheck
> > centos) the same builder is picked up many time even though it's
> > promised
> > to pick up the builders in a round robin manner.
> > 
> > On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer <mscherer@xxxxxxxxxx
> > >
> > wrote:
> > 
> > > Le jeudi 04 avril 2019 à 15:19 +0200, Michael Scherer a écrit :
> > > > Le jeudi 04 avril 2019 à 13:53 +0200, Michael Scherer a écrit :
> > > > > Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit
> > > > > :
> > > > > > Based on what I have seen that any multi node test case
> > > > > > will fail
> > > > > > and
> > > > > > the
> > > > > > above one is picked first from that group and If I am
> > > > > > correct
> > > > > > none
> > > > > > of
> > > > > > the
> > > > > > code fixes will go through the regression until this is
> > > > > > fixed. I
> > > > > > suspect it
> > > > > > to be an infra issue again. If we look at
> > > > > > https://review.gluster.org/#/c/glusterfs/+/22501/ &
> > > > > > https://build.gluster.org/job/centos7-regression/5382/ peer
> > > > > > handshaking is
> > > > > > stuck as 127.1.1.1 is unable to receive a response back,
> > > > > > did we
> > > > > > end
> > > > > > up
> > > > > > having firewall and other n/w settings screwed up? The test
> > > > > > never
> > > > > > fails
> > > > > > locally.
> > > > > 
> > > > > The firewall didn't change, and since the start has a line:
> > > > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost
> > > > > interface
> > > > > work. (I am not even sure that netfilter do anything
> > > > > meaningful on
> > > > > the
> > > > > loopback interface, but maybe I am wrong, and not keen on
> > > > > looking
> > > > > kernel code for that).
> > > > > 
> > > > > 
> > > > > Ping seems to work fine as well, so we can exclude a routing
> > > > > issue.
> > > > > 
> > > > > Maybe we should look at the socket, does it listen to a
> > > > > specific
> > > > > address or not ?
> > > > 
> > > > So, I did look at the 20 first ailure, removed all not related
> > > > to
> > > > rebal-all-nodes-migrate.t and seen all were run on builder203,
> > > > who
> > > > was
> > > > freshly reinstalled. As Deepshika noticed today, this one had a
> > > > issue
> > > > with ipv6, the 2nd issue we were tracking.
> > > > 
> > > > Summary, rpcbind.socket systemd unit listen on ipv6 despites
> > > > ipv6
> > > > being
> > > > disabled, and the fix is to reload systemd. We have so far no
> > > > idea on
> > > > why it happen, but suspect this might be related to the network
> > > > issue
> > > > we did identify, as that happen only after a reboot, that
> > > > happen only
> > > > if a build is cancelled/crashed/aborted.
> > > > 
> > > > I apply the workaround on builder203, so if the culprit is that
> > > > specific issue, guess that's fixed.
> > > > 
> > > > I started a test to see how it go:
> > > > https://build.gluster.org/job/centos7-regression/5383/
> > > 
> > > The test did just pass, so I would assume the problem was local
> > > to
> > > builder203. Not sure why it was always selected, except because
> > > this
> > > was the only one that failed, so was always up for getting new
> > > jobs.
> > > 
> > > Maybe we should increase the number of builder so this doesn't
> > > happen,
> > > as I guess the others builders were busy at that time ?
> > > 
> > > --
> > > Michael Scherer
> > > Sysadmin, Community Infrastructure and Platform, OSAS
> > > 
> > > 
> > > _______________________________________________
> > 
> > Gluster-devel mailing list
> > Gluster-devel@xxxxxxxxxxx
> > https://lists.gluster.org/mailman/listinfo/gluster-devel
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel