Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <jthottan@xxxxxxxxxx> > wrote: > > > Hi, > > > > is_nfs_export_available is just a wrapper around "showmount" > > command AFAIR. > > I saw following messages in console output. > > mount.nfs: rpc.statd is not running but is required for remote > > locking. > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > start > > statd. > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > For me it looks rpcbind may not be running on the machine. > > Usually rpcbind starts automatically on machines, don't know > > whether it > > can happen or not. > > > > That's precisely what the question is. Why suddenly we're seeing this > happening too frequently. Today I saw atleast 4 to 5 such failures > already. > > Deepshika - Can you please help in inspecting this? So we think (we are not sure) that the issue is a bit complex. What we were investigating was nightly run fail on aws. When the build crash, the builder is restarted, since that's the easiest way to clean everything (since even with a perfect test suite that would clean itself, we could always end in a corrupt state on the system, WRT mount, fs, etc). In turn, this seems to cause trouble on aws, since cloud-init or something rename eth0 interface to ens5, without cleaning to the network configuration. So the network init script fail (because the image say "start eth0" and that's not present), but fail in a weird way. Network is initialised and working (we can connect), but the dhclient process is not in the right cgroup, and network.service is in failed state. Restarting network didn't work. In turn, this mean that rpc-statd refuse to start (due to systemd dependencies), which seems to impact various NFS tests. We have also seen that on some builders, rpcbind pick some IP v6 autoconfiguration, but we can't reproduce that, and there is no ip v6 set up anywhere. I suspect the network.service failure is somehow involved, but fail to see how. In turn, rpcbind.socket not starting could cause NFS test troubles. Our current stop gap fix was to fix all the builders one by one. Remove the config, kill the rogue dhclient, restart network service. However, we can't be sure this is going to fix the problem long term since this only manifest after a crash of the test suite, and it doesn't happen so often. (plus, it was working before some day in the past, when something did make this fail, and I do not know if that's a system upgrade, or a test change, or both). So we are still looking at it to have a complete understanding of the issue, but so far, we hacked our way to make it work (or so do I think). Deepshika is working to fix it long term, by fixing the issue regarding eth0/ens5 with a new base image. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel