Re: NFSv4.1 regression with v4.15-rc

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 9 Jan 2018 20:26:57 +0000

On Sun, 2017-12-31 at 13:35 -0500, Chuck Lever wrote:
> > On Dec 30, 2017, at 1:14 PM, Chuck Lever <chuck.lever@xxxxxxxxxx>
> > wrote:
> > 
> > > 
> > > On Dec 30, 2017, at 1:05 PM, Bruce Fields <bfields@xxxxxxxxxxxx>
> > > wrote:
> > > 
> > > On Wed, Dec 27, 2017 at 03:40:58PM -0500, Chuck Lever wrote:
> > > > Last week I updated my test server from v4.14 to v4.15-rc4, and
> > > > began to
> > > > observe intermittent failures in the git regression suite on
> > > > NFSv4.1.
> > > 
> > > I haven't run that before.  Should I just
> > > 
> > > 	mount -overs=4.1 server:/fs /mnt/
> > > 	cd /mnt/
> > > 	git clone git://git.kernel.org/pub/scm/git/git.git
> > > 	cd git
> > > 	make test
> > > 
> > > ?
> > 
> > You'll need to install SVN and CVS on your client as well.
> > The failures seem to occur only in the SVN/CVS related
> > tests.
> > 
> > 
> > > > I
> > > > was able to reproduce these failures with NFSv4.1 on both TCP
> > > > and RDMA,
> > > > yet there has not been a reproduction with NFSv3 or NFSv4.0.
> > > > 
> > > > The server hardware is a single-socket 4-core system with 32GB
> > > > of RAM.
> > > > The export is a tmpfs. Networking is 56Gb InfiniBand (or
> > > > IPoIB).
> > > > 
> > > > The git regression suite reports individual test failures in
> > > > the SVN
> > > > and CVS tests. On occasion, the client mount point freezes,
> > > > requiring
> > > > that the client be rebooted in order to unstick the mount.
> > > > 
> > > > Just before Christmas, I bisected the problem to:
> > > 
> > > Thanks for the report!  I'll make some time for this next
> > > week.  What's
> > > your client?
> 
> Oops, I didn't answer this question. The client is v4.15-rc4.
> 
> 
> > > I guess one start might be to see if the reproducer can be
> > > simplified e.g. by running just one of the tests from the suite.
> > 
> > The failures are intermittent, and occur in a different test
> > each time. You have to wait for the 9000-series scripts, which
> > test SVN/CVS repo operations. To speed up time-to-failure, use
> > "make -jN test" where N is more than a few.
> > 
> > My client and server both have multiple real cores. I'm
> > thinking it's the server that matters here (possibly a race
> > condition is introduced by the below commit?).
> > 
> > 
> > > --b.
> > > 
> > > > 
> > > > commit 659aefb68eca28ba9aa482a9fc64de107332e256
> > > > Author: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > Date:   Fri Nov 3 08:00:13 2017 -0400
> > > > 
> > > >   nfsd: Ensure we don't recognise lock stateids after freeing
> > > > them
> > > > 
> > > >   In order to deal with lookup races, nfsd4_free_lock_stateid()
> > > > needs
> > > >   to be able to signal to other stateful functions that the
> > > > lock stateid
> > > >   is no longer valid. Right now, nfsd_lock() will check whether
> > > > or not an
> > > >   existing stateid is still hashed, but only in the "new lock"
> > > > path.
> > > > 
> > > >   To ensure the stateid invalidation is also recognised by the
> > > > "existing lock"
> > > >   path, and also by a second call to nfsd4_free_lock_stateid()
> > > > itself, we can
> > > >   change the type to NFS4_CLOSED_STID under the stp->st_mutex.
> > > > 
> > > >   Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.c
> > > > om>
> > > >   Signed-off-by: J. Bruce Fields <bfields@xxxxxxxxxx>
> > > > 
> > > > 

So, I'm thinking that release_open_stateid_locks() and
nfsd4_release_lockowner() should probably be setting NFS4_CLOSED_STID
when they call unhash_lock_stateid() (sorry for missing that).

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥