Re: [Gluster-infra] is_nfs_export_available from nfs.rc failing too often?

Michael Scherer <mscherer@xxxxxxxxxx> · Wed, 03 Apr 2019 15:56:36 +0200

Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <jthottan@xxxxxxxxxx>
> wrote:
> 
> > Hi,
> > 
> > is_nfs_export_available is just a wrapper around "showmount"
> > command AFAIR.
> > I saw following messages in console output.
> >  mount.nfs: rpc.statd is not running but is required for remote
> > locking.
> > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or
> > start
> > statd.
> > 05:06:55 mount.nfs: an incorrect mount option was specified
> > 
> > For me it looks rpcbind may not be running on the machine.
> > Usually rpcbind starts automatically on machines, don't know
> > whether it
> > can happen or not.
> > 
> 
> That's precisely what the question is. Why suddenly we're seeing this
> happening too frequently. Today I saw atleast 4 to 5 such failures
> already.
> 
> Deepshika - Can you please help in inspecting this?

So we think (we are not sure) that the issue is a bit complex.

What we were investigating was nightly run fail on aws. When the build
crash, the builder is restarted, since that's the easiest way to clean
everything (since even with a perfect test suite that would clean
itself, we could always end in a corrupt state on the system, WRT
mount, fs, etc).

In turn, this seems to cause trouble on aws, since cloud-init or
something rename eth0 interface to ens5, without cleaning to the
network configuration. 

So the network init script fail (because the image say "start eth0" and
that's not present), but fail in a weird way. Network is initialised
and working (we can connect), but the dhclient process is not in the
right cgroup, and network.service is in failed state. Restarting
network didn't work. In turn, this mean that rpc-statd refuse to start
(due to systemd dependencies), which seems to impact various NFS tests.

We have also seen that on some builders, rpcbind pick some IP v6
autoconfiguration, but we can't reproduce that, and there is no ip v6
set up anywhere. I suspect the network.service failure is somehow
involved, but fail to see how. In turn, rpcbind.socket not starting
could cause NFS test troubles.

Our current stop gap fix was to fix all the builders one by one. Remove
the config, kill the rogue dhclient, restart network service. 

However, we can't be sure this is going to fix the problem long term
since this only manifest after a crash of the test suite, and it
doesn't happen so often. (plus, it was working before some day in the
past, when something did make this fail, and I do not know if that's a
system upgrade, or a test change, or both).

So we are still looking at it to have a complete understanding of the
issue, but so far, we hacked our way to make it work (or so do I
think).

Deepshika is working to fix it long term, by fixing the issue regarding
eth0/ens5 with a new base image.
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel