On Wed, May 8, 2019 at 7:38 PM Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
builder204 needs to be fixed, too many failures, mostly none of the patches are passing regression.
And with that builder201 joins the pool, https://build.gluster.org/job/centos7-regression/5943/consoleFull
On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde <srakonde@xxxxxxxxxx> wrote:Deepshikha,I see the failure here[1] which ran on builder206. So, we are good.Not really, https://build.gluster.org/job/centos7-regression/5909/consoleFull failed on builder204 for similar reasons I believe?I am bit more worried on this issue being resurfacing more often these days. What can we do to fix this permanently?On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal <dkhandel@xxxxxxxxxx> wrote:Sanju, can you please give us more info about the failures.I see the failures occurring on just one of the builder (builder206). I'm taking it back offline for now.On Tue, May 7, 2019 at 9:42 PM Michael Scherer <mscherer@xxxxxxxxxx> wrote:Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit :
> Looks like is_nfs_export_available started failing again in recent
> centos-regressions.
>
> Michael, can you please check?
I will try but I am leaving for vacation tonight, so if I find nothing,
until I leave, I guess Deepshika will have to look.
> On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul <ykaul@xxxxxxxxxx> wrote:
>
> >
> >
> > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer <
> > mscherer@xxxxxxxxxx>
> > wrote:
> >
> > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :
> > > > Is this back again? The recent patches are failing regression
> > > > :-\ .
> > >
> > > So, on builder206, it took me a while to find that the issue is
> > > that
> > > nfs (the service) was running.
> > >
> > > ./tests/basic/afr/tarissue.t failed, because the nfs
> > > initialisation
> > > failed with a rather cryptic message:
> > >
> > > [2019-04-23 13:17:05.371733] I
> > > [socket.c:991:__socket_server_bind] 0-
> > > socket.nfs-server: process started listening on port (38465)
> > > [2019-04-23 13:17:05.385819] E
> > > [socket.c:972:__socket_server_bind] 0-
> > > socket.nfs-server: binding to failed: Address already in use
> > > [2019-04-23 13:17:05.385843] E
> > > [socket.c:974:__socket_server_bind] 0-
> > > socket.nfs-server: Port is already in use
> > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-
> > > socket.nfs-server: __socket_server_bind failed;closing socket 14
> > >
> > > I found where this came from, but a few stuff did surprised me:
> > >
> > > - the order of print is different that the order in the code
> > >
> >
> > Indeed strange...
> >
> > > - the message on "started listening" didn't take in account the
> > > fact
> > > that bind failed on:
> > >
> >
> > Shouldn't it bail out if it failed to bind?
> > Some missing 'goto out' around line 975/976?
> > Y.
> >
> > >
> > >
> > >
> > >
https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967
> > >
> > > The message about port 38465 also threw me off the track. The
> > > real
> > > issue is that the service nfs was already running, and I couldn't
> > > find
> > > anything listening on port 38465
> > >
> > > once I do service nfs stop, it no longer failed.
> > >
> > > So far, I do know why nfs.service was activated.
> > >
> > > But at least, 206 should be fixed, and we know a bit more on what
> > > would
> > > be causing some failure.
> > >
> > >
> > >
> > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer <
> > > > mscherer@xxxxxxxxxx>
> > > > wrote:
> > > >
> > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a
> > > > > écrit :
> > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
> > > > > > jthottan@xxxxxxxxxx>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > is_nfs_export_available is just a wrapper around
> > > > > > > "showmount"
> > > > > > > command AFAIR.
> > > > > > > I saw following messages in console output.
> > > > > > > mount.nfs: rpc.statd is not running but is required for
> > > > > > > remote
> > > > > > > locking.
> > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks
> > > > > > > local,
> > > > > > > or
> > > > > > > start
> > > > > > > statd.
> > > > > > > 05:06:55 mount.nfs: an incorrect mount option was
> > > > > > > specified
> > > > > > >
> > > > > > > For me it looks rpcbind may not be running on the
> > > > > > > machine.
> > > > > > > Usually rpcbind starts automatically on machines, don't
> > > > > > > know
> > > > > > > whether it
> > > > > > > can happen or not.
> > > > > > >
> > > > > >
> > > > > > That's precisely what the question is. Why suddenly we're
> > > > > > seeing
> > > > > > this
> > > > > > happening too frequently. Today I saw atleast 4 to 5 such
> > > > > > failures
> > > > > > already.
> > > > > >
> > > > > > Deepshika - Can you please help in inspecting this?
> > > > >
> > > > > So we think (we are not sure) that the issue is a bit
> > > > > complex.
> > > > >
> > > > > What we were investigating was nightly run fail on aws. When
> > > > > the
> > > > > build
> > > > > crash, the builder is restarted, since that's the easiest way
> > > > > to
> > > > > clean
> > > > > everything (since even with a perfect test suite that would
> > > > > clean
> > > > > itself, we could always end in a corrupt state on the system,
> > > > > WRT
> > > > > mount, fs, etc).
> > > > >
> > > > > In turn, this seems to cause trouble on aws, since cloud-init
> > > > > or
> > > > > something rename eth0 interface to ens5, without cleaning to
> > > > > the
> > > > > network configuration.
> > > > >
> > > > > So the network init script fail (because the image say "start
> > > > > eth0"
> > > > > and
> > > > > that's not present), but fail in a weird way. Network is
> > > > > initialised
> > > > > and working (we can connect), but the dhclient process is not
> > > > > in
> > > > > the
> > > > > right cgroup, and network.service is in failed state.
> > > > > Restarting
> > > > > network didn't work. In turn, this mean that rpc-statd refuse
> > > > > to
> > > > > start
> > > > > (due to systemd dependencies), which seems to impact various
> > > > > NFS
> > > > > tests.
> > > > >
> > > > > We have also seen that on some builders, rpcbind pick some IP
> > > > > v6
> > > > > autoconfiguration, but we can't reproduce that, and there is
> > > > > no ip
> > > > > v6
> > > > > set up anywhere. I suspect the network.service failure is
> > > > > somehow
> > > > > involved, but fail to see how. In turn, rpcbind.socket not
> > > > > starting
> > > > > could cause NFS test troubles.
> > > > >
> > > > > Our current stop gap fix was to fix all the builders one by
> > > > > one.
> > > > > Remove
> > > > > the config, kill the rogue dhclient, restart network service.
> > > > >
> > > > > However, we can't be sure this is going to fix the problem
> > > > > long
> > > > > term
> > > > > since this only manifest after a crash of the test suite, and
> > > > > it
> > > > > doesn't happen so often. (plus, it was working before some
> > > > > day in
> > > > > the
> > > > > past, when something did make this fail, and I do not know if
> > > > > that's a
> > > > > system upgrade, or a test change, or both).
> > > > >
> > > > > So we are still looking at it to have a complete
> > > > > understanding of
> > > > > the
> > > > > issue, but so far, we hacked our way to make it work (or so
> > > > > do I
> > > > > think).
> > > > >
> > > > > Deepshika is working to fix it long term, by fixing the issue
> > > > > regarding
> > > > > eth0/ens5 with a new base image.
> > > > > --
> > > > > Michael Scherer
> > > > > Sysadmin, Community Infrastructure and Platform, OSAS
> > > > >
> > > > >
> > > > > --
> > > >
> > > > - Atin (atinm)
> > >
> > > --
> > > Michael Scherer
> > > Sysadmin, Community Infrastructure
> > >
> > >
> > >
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel@xxxxxxxxxxx
> > > https://lists.gluster.org/mailman/listinfo/gluster-devel
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel@xxxxxxxxxxx
> > https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
>
--
Michael Scherer
Sysadmin, Community Infrastructure
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel--_______________________________________________Thanks,Sanju
Community Meeting Calendar:
APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017
NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/836554017 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/486278655 Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel