Re: NFSv4.1 regression with v4.15-rc

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 9 Jan 2018 15:28:23 -0500

> On Jan 9, 2018, at 3:26 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Sun, 2017-12-31 at 13:35 -0500, Chuck Lever wrote:
>>> On Dec 30, 2017, at 1:14 PM, Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> wrote:
>>> 
>>>> 
>>>> On Dec 30, 2017, at 1:05 PM, Bruce Fields <bfields@xxxxxxxxxxxx>
>>>> wrote:
>>>> 
>>>> On Wed, Dec 27, 2017 at 03:40:58PM -0500, Chuck Lever wrote:
>>>>> Last week I updated my test server from v4.14 to v4.15-rc4, and
>>>>> began to
>>>>> observe intermittent failures in the git regression suite on
>>>>> NFSv4.1.
>>>> 
>>>> I haven't run that before.  Should I just
>>>> 
>>>> 	mount -overs=4.1 server:/fs /mnt/
>>>> 	cd /mnt/
>>>> 	git clone git://git.kernel.org/pub/scm/git/git.git
>>>> 	cd git
>>>> 	make test
>>>> 
>>>> ?
>>> 
>>> You'll need to install SVN and CVS on your client as well.
>>> The failures seem to occur only in the SVN/CVS related
>>> tests.
>>> 
>>> 
>>>>> I
>>>>> was able to reproduce these failures with NFSv4.1 on both TCP
>>>>> and RDMA,
>>>>> yet there has not been a reproduction with NFSv3 or NFSv4.0.
>>>>> 
>>>>> The server hardware is a single-socket 4-core system with 32GB
>>>>> of RAM.
>>>>> The export is a tmpfs. Networking is 56Gb InfiniBand (or
>>>>> IPoIB).
>>>>> 
>>>>> The git regression suite reports individual test failures in
>>>>> the SVN
>>>>> and CVS tests. On occasion, the client mount point freezes,
>>>>> requiring
>>>>> that the client be rebooted in order to unstick the mount.
>>>>> 
>>>>> Just before Christmas, I bisected the problem to:
>>>> 
>>>> Thanks for the report!  I'll make some time for this next
>>>> week.  What's
>>>> your client?
>> 
>> Oops, I didn't answer this question. The client is v4.15-rc4.
>> 
>> 
>>>> I guess one start might be to see if the reproducer can be
>>>> simplified e.g. by running just one of the tests from the suite.
>>> 
>>> The failures are intermittent, and occur in a different test
>>> each time. You have to wait for the 9000-series scripts, which
>>> test SVN/CVS repo operations. To speed up time-to-failure, use
>>> "make -jN test" where N is more than a few.
>>> 
>>> My client and server both have multiple real cores. I'm
>>> thinking it's the server that matters here (possibly a race
>>> condition is introduced by the below commit?).
>>> 
>>> 
>>>> --b.
>>>> 
>>>>> 
>>>>> commit 659aefb68eca28ba9aa482a9fc64de107332e256
>>>>> Author: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
>>>>> Date:   Fri Nov 3 08:00:13 2017 -0400
>>>>> 
>>>>>  nfsd: Ensure we don't recognise lock stateids after freeing
>>>>> them
>>>>> 
>>>>>  In order to deal with lookup races, nfsd4_free_lock_stateid()
>>>>> needs
>>>>>  to be able to signal to other stateful functions that the
>>>>> lock stateid
>>>>>  is no longer valid. Right now, nfsd_lock() will check whether
>>>>> or not an
>>>>>  existing stateid is still hashed, but only in the "new lock"
>>>>> path.
>>>>> 
>>>>>  To ensure the stateid invalidation is also recognised by the
>>>>> "existing lock"
>>>>>  path, and also by a second call to nfsd4_free_lock_stateid()
>>>>> itself, we can
>>>>>  change the type to NFS4_CLOSED_STID under the stp->st_mutex.
>>>>> 
>>>>>  Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.c
>>>>> om>
>>>>>  Signed-off-by: J. Bruce Fields <bfields@xxxxxxxxxx>
>>>>> 
>>>>> 
> 
> So, I'm thinking that release_open_stateid_locks() and
> nfsd4_release_lockowner() should probably be setting NFS4_CLOSED_STID
> when they call unhash_lock_stateid() (sorry for missing that).

Send me a patch and I can test it.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html