Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit : > Looks like is_nfs_export_available started failing again in recent > centos-regressions. > > Michael, can you please check? I will try but I am leaving for vacation tonight, so if I find nothing, until I leave, I guess Deepshika will have to look. > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul <ykaul@xxxxxxxxxx> wrote: > > > > > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < > > mscherer@xxxxxxxxxx> > > wrote: > > > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : > > > > Is this back again? The recent patches are failing regression > > > > :-\ . > > > > > > So, on builder206, it took me a while to find that the issue is > > > that > > > nfs (the service) was running. > > > > > > ./tests/basic/afr/tarissue.t failed, because the nfs > > > initialisation > > > failed with a rather cryptic message: > > > > > > [2019-04-23 13:17:05.371733] I > > > [socket.c:991:__socket_server_bind] 0- > > > socket.nfs-server: process started listening on port (38465) > > > [2019-04-23 13:17:05.385819] E > > > [socket.c:972:__socket_server_bind] 0- > > > socket.nfs-server: binding to failed: Address already in use > > > [2019-04-23 13:17:05.385843] E > > > [socket.c:974:__socket_server_bind] 0- > > > socket.nfs-server: Port is already in use > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > > > > > I found where this came from, but a few stuff did surprised me: > > > > > > - the order of print is different that the order in the code > > > > > > > Indeed strange... > > > > > - the message on "started listening" didn't take in account the > > > fact > > > that bind failed on: > > > > > > > Shouldn't it bail out if it failed to bind? > > Some missing 'goto out' around line 975/976? > > Y. > > > > > > > > > > > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > > > > > The message about port 38465 also threw me off the track. The > > > real > > > issue is that the service nfs was already running, and I couldn't > > > find > > > anything listening on port 38465 > > > > > > once I do service nfs stop, it no longer failed. > > > > > > So far, I do know why nfs.service was activated. > > > > > > But at least, 206 should be fixed, and we know a bit more on what > > > would > > > be causing some failure. > > > > > > > > > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < > > > > mscherer@xxxxxxxxxx> > > > > wrote: > > > > > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a > > > > > écrit : > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > > > jthottan@xxxxxxxxxx> > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > is_nfs_export_available is just a wrapper around > > > > > > > "showmount" > > > > > > > command AFAIR. > > > > > > > I saw following messages in console output. > > > > > > > mount.nfs: rpc.statd is not running but is required for > > > > > > > remote > > > > > > > locking. > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks > > > > > > > local, > > > > > > > or > > > > > > > start > > > > > > > statd. > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was > > > > > > > specified > > > > > > > > > > > > > > For me it looks rpcbind may not be running on the > > > > > > > machine. > > > > > > > Usually rpcbind starts automatically on machines, don't > > > > > > > know > > > > > > > whether it > > > > > > > can happen or not. > > > > > > > > > > > > > > > > > > > That's precisely what the question is. Why suddenly we're > > > > > > seeing > > > > > > this > > > > > > happening too frequently. Today I saw atleast 4 to 5 such > > > > > > failures > > > > > > already. > > > > > > > > > > > > Deepshika - Can you please help in inspecting this? > > > > > > > > > > So we think (we are not sure) that the issue is a bit > > > > > complex. > > > > > > > > > > What we were investigating was nightly run fail on aws. When > > > > > the > > > > > build > > > > > crash, the builder is restarted, since that's the easiest way > > > > > to > > > > > clean > > > > > everything (since even with a perfect test suite that would > > > > > clean > > > > > itself, we could always end in a corrupt state on the system, > > > > > WRT > > > > > mount, fs, etc). > > > > > > > > > > In turn, this seems to cause trouble on aws, since cloud-init > > > > > or > > > > > something rename eth0 interface to ens5, without cleaning to > > > > > the > > > > > network configuration. > > > > > > > > > > So the network init script fail (because the image say "start > > > > > eth0" > > > > > and > > > > > that's not present), but fail in a weird way. Network is > > > > > initialised > > > > > and working (we can connect), but the dhclient process is not > > > > > in > > > > > the > > > > > right cgroup, and network.service is in failed state. > > > > > Restarting > > > > > network didn't work. In turn, this mean that rpc-statd refuse > > > > > to > > > > > start > > > > > (due to systemd dependencies), which seems to impact various > > > > > NFS > > > > > tests. > > > > > > > > > > We have also seen that on some builders, rpcbind pick some IP > > > > > v6 > > > > > autoconfiguration, but we can't reproduce that, and there is > > > > > no ip > > > > > v6 > > > > > set up anywhere. I suspect the network.service failure is > > > > > somehow > > > > > involved, but fail to see how. In turn, rpcbind.socket not > > > > > starting > > > > > could cause NFS test troubles. > > > > > > > > > > Our current stop gap fix was to fix all the builders one by > > > > > one. > > > > > Remove > > > > > the config, kill the rogue dhclient, restart network service. > > > > > > > > > > However, we can't be sure this is going to fix the problem > > > > > long > > > > > term > > > > > since this only manifest after a crash of the test suite, and > > > > > it > > > > > doesn't happen so often. (plus, it was working before some > > > > > day in > > > > > the > > > > > past, when something did make this fail, and I do not know if > > > > > that's a > > > > > system upgrade, or a test change, or both). > > > > > > > > > > So we are still looking at it to have a complete > > > > > understanding of > > > > > the > > > > > issue, but so far, we hacked our way to make it work (or so > > > > > do I > > > > > think). > > > > > > > > > > Deepshika is working to fix it long term, by fixing the issue > > > > > regarding > > > > > eth0/ens5 with a new base image. > > > > > -- > > > > > Michael Scherer > > > > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > > > > > > > > > > -- > > > > > > > > - Atin (atinm) > > > > > > -- > > > Michael Scherer > > > Sysadmin, Community Infrastructure > > > > > > > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel@xxxxxxxxxxx > > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -- Michael Scherer Sysadmin, Community Infrastructure
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel