Re: [Gluster-infra] is_nfs_export_available from nfs.rc failing too often?

Atin Mukherjee <amukherj@xxxxxxxxxx> · Thu, 9 May 2019 10:01:47 +0530

On Wed, May 8, 2019 at 7:38 PM Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
builder204 needs to be fixed, too many failures, mostly none of the patches are passing regression.

And with that builder201 joins the pool, https://build.gluster.org/job/centos7-regression/5943/consoleFull

On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:

On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde <srakonde@xxxxxxxxxx> wrote:
Deepshikha,
I see the failure here[1] which ran on builder206. So, we are good.

Not really,  https://build.gluster.org/job/centos7-regression/5909/consoleFull failed on builder204 for similar reasons I believe?

I am bit more worried on this issue being resurfacing more often these days. What can we do to fix this permanently?

[1] https://build.gluster.org/job/centos7-regression/5901/consoleFull

On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal <dkhandel@xxxxxxxxxx> wrote:
Sanju, can you please give us more info about the failures. 

I see the failures occurring on just one of the builder (builder206). I'm taking it back offline for now. 

On Tue, May 7, 2019 at 9:42 PM Michael Scherer <mscherer@xxxxxxxxxx> wrote:
Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit :

> Looks like is_nfs_export_available started failing again in recent

> centos-regressions.

> 

> Michael, can you please check?

I will try but I am leaving for vacation tonight, so if I find nothing,

until I leave, I guess Deepshika will have to look.

> On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul <ykaul@xxxxxxxxxx> wrote:

> 

> > 

> > 

> > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer <

> > mscherer@xxxxxxxxxx>

> > wrote:

> > 

> > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :

> > > > Is this back again? The recent patches are failing regression

> > > > :-\ .

> > > 

> > > So, on builder206, it took me a while to find that the issue is

> > > that

> > > nfs (the service) was running.

> > > 

> > > ./tests/basic/afr/tarissue.t failed, because the nfs

> > > initialisation

> > > failed with a rather cryptic message:

> > > 

> > > [2019-04-23 13:17:05.371733] I

> > > [socket.c:991:__socket_server_bind] 0-

> > > socket.nfs-server: process started listening on port (38465)

> > > [2019-04-23 13:17:05.385819] E

> > > [socket.c:972:__socket_server_bind] 0-

> > > socket.nfs-server: binding to  failed: Address already in use

> > > [2019-04-23 13:17:05.385843] E

> > > [socket.c:974:__socket_server_bind] 0-

> > > socket.nfs-server: Port is already in use

> > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-

> > > socket.nfs-server: __socket_server_bind failed;closing socket 14

> > > 

> > > I found where this came from, but a few stuff did surprised me:

> > > 

> > > - the order of print is different that the order in the code

> > > 

> > 

> > Indeed strange...

> > 

> > > - the message on "started listening" didn't take in account the

> > > fact

> > > that bind failed on:

> > > 

> > 

> > Shouldn't it bail out if it failed to bind?

> > Some missing 'goto out' around line 975/976?

> > Y.

> > 

> > > 

> > > 

> > > 

> > > 

https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967

> > > 

> > > The message about port 38465 also threw me off the track. The

> > > real

> > > issue is that the service nfs was already running, and I couldn't

> > > find

> > > anything listening on port 38465

> > > 

> > > once I do service nfs stop, it no longer failed.

> > > 

> > > So far, I do know why nfs.service was activated.

> > > 

> > > But at least, 206 should be fixed, and we know a bit more on what

> > > would

> > > be causing some failure.

> > > 

> > > 

> > > 

> > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer <

> > > > mscherer@xxxxxxxxxx>

> > > > wrote:

> > > > 

> > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a

> > > > > écrit :

> > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <

> > > > > > jthottan@xxxxxxxxxx>

> > > > > > wrote:

> > > > > > 

> > > > > > > Hi,

> > > > > > > 

> > > > > > > is_nfs_export_available is just a wrapper around

> > > > > > > "showmount"

> > > > > > > command AFAIR.

> > > > > > > I saw following messages in console output.

> > > > > > >  mount.nfs: rpc.statd is not running but is required for

> > > > > > > remote

> > > > > > > locking.

> > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks

> > > > > > > local,

> > > > > > > or

> > > > > > > start

> > > > > > > statd.

> > > > > > > 05:06:55 mount.nfs: an incorrect mount option was

> > > > > > > specified

> > > > > > > 

> > > > > > > For me it looks rpcbind may not be running on the

> > > > > > > machine.

> > > > > > > Usually rpcbind starts automatically on machines, don't

> > > > > > > know

> > > > > > > whether it

> > > > > > > can happen or not.

> > > > > > > 

> > > > > > 

> > > > > > That's precisely what the question is. Why suddenly we're

> > > > > > seeing

> > > > > > this

> > > > > > happening too frequently. Today I saw atleast 4 to 5 such

> > > > > > failures

> > > > > > already.

> > > > > > 

> > > > > > Deepshika - Can you please help in inspecting this?

> > > > > 

> > > > > So we think (we are not sure) that the issue is a bit

> > > > > complex.

> > > > > 

> > > > > What we were investigating was nightly run fail on aws. When

> > > > > the

> > > > > build

> > > > > crash, the builder is restarted, since that's the easiest way

> > > > > to

> > > > > clean

> > > > > everything (since even with a perfect test suite that would

> > > > > clean

> > > > > itself, we could always end in a corrupt state on the system,

> > > > > WRT

> > > > > mount, fs, etc).

> > > > > 

> > > > > In turn, this seems to cause trouble on aws, since cloud-init 

> > > > > or

> > > > > something rename eth0 interface to ens5, without cleaning to

> > > > > the

> > > > > network configuration.

> > > > > 

> > > > > So the network init script fail (because the image say "start

> > > > > eth0"

> > > > > and

> > > > > that's not present), but fail in a weird way. Network is

> > > > > initialised

> > > > > and working (we can connect), but the dhclient process is not

> > > > > in

> > > > > the

> > > > > right cgroup, and network.service is in failed state.

> > > > > Restarting

> > > > > network didn't work. In turn, this mean that rpc-statd refuse

> > > > > to

> > > > > start

> > > > > (due to systemd dependencies), which seems to impact various

> > > > > NFS

> > > > > tests.

> > > > > 

> > > > > We have also seen that on some builders, rpcbind pick some IP

> > > > > v6

> > > > > autoconfiguration, but we can't reproduce that, and there is

> > > > > no ip

> > > > > v6

> > > > > set up anywhere. I suspect the network.service failure is

> > > > > somehow

> > > > > involved, but fail to see how. In turn, rpcbind.socket not

> > > > > starting

> > > > > could cause NFS test troubles.

> > > > > 

> > > > > Our current stop gap fix was to fix all the builders one by

> > > > > one.

> > > > > Remove

> > > > > the config, kill the rogue dhclient, restart network service.

> > > > > 

> > > > > However, we can't be sure this is going to fix the problem

> > > > > long

> > > > > term

> > > > > since this only manifest after a crash of the test suite, and

> > > > > it

> > > > > doesn't happen so often. (plus, it was working before some

> > > > > day in

> > > > > the

> > > > > past, when something did make this fail, and I do not know if

> > > > > that's a

> > > > > system upgrade, or a test change, or both).

> > > > > 

> > > > > So we are still looking at it to have a complete

> > > > > understanding of

> > > > > the

> > > > > issue, but so far, we hacked our way to make it work (or so

> > > > > do I

> > > > > think).

> > > > > 

> > > > > Deepshika is working to fix it long term, by fixing the issue

> > > > > regarding

> > > > > eth0/ens5 with a new base image.

> > > > > --

> > > > > Michael Scherer

> > > > > Sysadmin, Community Infrastructure and Platform, OSAS

> > > > > 

> > > > > 

> > > > > --

> > > > 

> > > > - Atin (atinm)

> > > 

> > > --

> > > Michael Scherer

> > > Sysadmin, Community Infrastructure

> > > 

> > > 

> > > 

> > > _______________________________________________

> > > Gluster-devel mailing list

> > > Gluster-devel@xxxxxxxxxxx

> > > https://lists.gluster.org/mailman/listinfo/gluster-devel

> > 

> > _______________________________________________

> > Gluster-devel mailing list

> > Gluster-devel@xxxxxxxxxxx

> > https://lists.gluster.org/mailman/listinfo/gluster-devel

> 

> 

> 

-- 

Michael Scherer

Sysadmin, Community Infrastructure

_______________________________________________

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-devel

-- 
Thanks,
Sanju

_______________________________________________

Community Meeting Calendar:

APAC Schedule -

Every 2nd and 4th Tuesday at 11:30 AM IST

Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -

Every 1st and 3rd Tuesday at 01:00 PM EDT

Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel