Re: [Gluster-infra] rebal-all-nodes-migrate.t always fails now

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I recently added 3 builders builder208, builder209, builder210 to the regression pool. Network to these new builders did not come up because it was looking for non-existing ethernet card eth0 on reboot and hence failing. I'll reconnect them back and update here once I fix the issue today.

Sorry for the inconvenience.


On Tue, Jun 4, 2019 at 7:07 PM Yaniv Kaul <ykaul@xxxxxxxxxx> wrote:
What was the result of this investigation? I suspect seeing the same issue on builder209[1].
Y.


On Fri, Apr 5, 2019 at 5:40 PM Michael Scherer <mscherer@xxxxxxxxxx> wrote:
Le vendredi 05 avril 2019 à 16:55 +0530, Nithya Balachandran a écrit :
> On Fri, 5 Apr 2019 at 12:16, Michael Scherer <mscherer@xxxxxxxxxx>
> wrote:
>
> > Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
> > > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
> > > > I'm not convinced this is solved. Just had what I believe is a
> > > > similar
> > > > failure:
> > > >
> > > > *00:12:02.532* A dependency job for rpc-statd.service failed.
> > > > See
> > > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs:
> > > > rpc.statd is
> > > > not running but is required for remote locking.*00:12:02.532*
> > > > mount.nfs: Either use '-o nolock' to keep locks local, or start
> > > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
> > > > specified
> > > >
> > > > (of course, it can always be my patch!)
> > > >
> > > > https://build.gluster.org/job/centos7-regression/5384/console
> > >
> > > same issue, different builder (206). I will check them all, as
> > > the
> > > issue is more widespread than I expected (or it did popup since
> > > last
> > > time I checked).
> >
> > Deepshika did notice that the issue came back on one server
> > (builder202) after a reboot, so the rpcbind issue is not related to
> > the
> > network initscript one, so the RCA continue.
> >
> > We are looking for another workaround involving fiddling with the
> > socket (until we find why it do use ipv6 at boot, but not after,
> > when
> > ipv6 is disabled).
> >
>
> Could this be relevant?
> https://access.redhat.com/solutions/2798411

Good catch.

So, we already do that, Nigel took care of that (after 2 days of
research). But I didn't knew the exact symptoms, and decided to double
check just in case.

And... there is no sysctl.conf in the initrd. Running dracut -v -f do
not change anything.

Running "dracut -v -f -H" take care of that (and this fix the problem),
but:
- our ansible script already run that
- -H is hostonly, which is already the default on EL7 according to the
doc. 

However, if dracut-config-generic is installed, it doesn't build a
hostonly initrd, and so do not include the sysctl.conf file (who break
rpcbnd, who break the test suite).

And for some reason, it is installed the image in ec2 (likely default),
but not by default on the builders.

So what happen is that after a kernel upgrade, dracut rebuild a generic
initrd instead of a hostonly one, who break things. And kernel was
likely upgraded recently (and upgrade happen nightly (for some value of
"night"), so we didn't see that earlier, nor with a fresh system.


So now, we have several solution:
- be explicit on using hostonly in dracut, so this doesn't happen again
(or not for this reason)

- disable ipv6 in rpcbind in a cleaner way (to be tested)

- get the test suite work with ip v6

In the long term, I also want to monitor the processes, but for that, I
need a VPN between the nagios server and ec2, and that project got
blocked by several issues (like EC2 not support ecdsa keys, and we use
that for ansible, so we have to come back to RSA for full automated
deployment, and openvon requires to use certificates, so I need a newer
python openssl for doing what I want, and RHEL 7 is too old, etc, etc).

As the weekend approach for me, I just rebuilt the initrd for the time
being. I guess forcing hostonly is the safest fix for now, but this
will be for monday.
--
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS


_______________________________________________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel


[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux