On Sun, 2017-12-31 at 13:35 -0500, Chuck Lever wrote: > > On Dec 30, 2017, at 1:14 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> > > wrote: > > > > > > > > On Dec 30, 2017, at 1:05 PM, Bruce Fields <bfields@xxxxxxxxxxxx> > > > wrote: > > > > > > On Wed, Dec 27, 2017 at 03:40:58PM -0500, Chuck Lever wrote: > > > > Last week I updated my test server from v4.14 to v4.15-rc4, and > > > > began to > > > > observe intermittent failures in the git regression suite on > > > > NFSv4.1. > > > > > > I haven't run that before. Should I just > > > > > > mount -overs=4.1 server:/fs /mnt/ > > > cd /mnt/ > > > git clone git://git.kernel.org/pub/scm/git/git.git > > > cd git > > > make test > > > > > > ? > > > > You'll need to install SVN and CVS on your client as well. > > The failures seem to occur only in the SVN/CVS related > > tests. > > > > > > > > I > > > > was able to reproduce these failures with NFSv4.1 on both TCP > > > > and RDMA, > > > > yet there has not been a reproduction with NFSv3 or NFSv4.0. > > > > > > > > The server hardware is a single-socket 4-core system with 32GB > > > > of RAM. > > > > The export is a tmpfs. Networking is 56Gb InfiniBand (or > > > > IPoIB). > > > > > > > > The git regression suite reports individual test failures in > > > > the SVN > > > > and CVS tests. On occasion, the client mount point freezes, > > > > requiring > > > > that the client be rebooted in order to unstick the mount. > > > > > > > > Just before Christmas, I bisected the problem to: > > > > > > Thanks for the report! I'll make some time for this next > > > week. What's > > > your client? > > Oops, I didn't answer this question. The client is v4.15-rc4. > > > > > I guess one start might be to see if the reproducer can be > > > simplified e.g. by running just one of the tests from the suite. > > > > The failures are intermittent, and occur in a different test > > each time. You have to wait for the 9000-series scripts, which > > test SVN/CVS repo operations. To speed up time-to-failure, use > > "make -jN test" where N is more than a few. > > > > My client and server both have multiple real cores. I'm > > thinking it's the server that matters here (possibly a race > > condition is introduced by the below commit?). > > > > > > > --b. > > > > > > > > > > > commit 659aefb68eca28ba9aa482a9fc64de107332e256 > > > > Author: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > > Date: Fri Nov 3 08:00:13 2017 -0400 > > > > > > > > nfsd: Ensure we don't recognise lock stateids after freeing > > > > them > > > > > > > > In order to deal with lookup races, nfsd4_free_lock_stateid() > > > > needs > > > > to be able to signal to other stateful functions that the > > > > lock stateid > > > > is no longer valid. Right now, nfsd_lock() will check whether > > > > or not an > > > > existing stateid is still hashed, but only in the "new lock" > > > > path. > > > > > > > > To ensure the stateid invalidation is also recognised by the > > > > "existing lock" > > > > path, and also by a second call to nfsd4_free_lock_stateid() > > > > itself, we can > > > > change the type to NFS4_CLOSED_STID under the stp->st_mutex. > > > > > > > > Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.c > > > > om> > > > > Signed-off-by: J. Bruce Fields <bfields@xxxxxxxxxx> > > > > > > > > So, I'm thinking that release_open_stateid_locks() and nfsd4_release_lockowner() should probably be setting NFS4_CLOSED_STID when they call unhash_lock_stateid() (sorry for missing that). -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥