Re: [PATCH] NFSv4: fix stateid refreshing when CLOSE racing with OPEN

"Benjamin Coddington" <bcodding@xxxxxxxxxx> · Fri, 04 Sep 2020 06:55:01 -0400

On 3 Sep 2020, at 23:04, Murphy Zhou wrote:

> Hi Benjamin,
>
> On Thu, Sep 03, 2020 at 01:54:26PM -0400, Benjamin Coddington wrote:
>>
>> On 11 Oct 2019, at 10:14, Trond Myklebust wrote:
>>> On Fri, 2019-10-11 at 16:49 +0800, Murphy Zhou wrote:
>>>> On Thu, Oct 10, 2019 at 02:46:40PM +0000, Trond Myklebust wrote:
>>>>> On Thu, 2019-10-10 at 15:40 +0800, Murphy Zhou wrote:
>> ...
>>>>>> @@ -3367,14 +3368,16 @@ static bool
>>>>>> nfs4_refresh_open_old_stateid(nfs4_stateid *dst,
>>>>>>  			break;
>>>>>>  		}
>>>>>>  		seqid_open = state->open_stateid.seqid;
>>>>>> -		if (read_seqretry(&state->seqlock, seq))
>>>>>> -			continue;
>>>>>>
>>>>>>  		dst_seqid = be32_to_cpu(dst->seqid);
>>>>>> -		if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) >= 0)
>>>>>> +		if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) > 0)
>>>>>>  			dst->seqid = cpu_to_be32(dst_seqid + 1);
>>>>>
>>>>> This negates the whole intention of the patch you reference in the
>>>>> 'Fixes:', which was to allow us to CLOSE files even if seqid bumps
>>>>> have
>>>>> been lost due to interrupted RPC calls e.g. when using 'soft' or
>>>>> 'softerr' mounts.
>>>>> With the above change, the check could just be tossed out
>>>>> altogether,
>>>>> because dst_seqid will never become larger than seqid_open.
>>>>
>>>> Hmm.. I got it wrong. Thanks for the explanation.
>>>
>>> So to be clear: I'm not saying that what you describe is not a problem.
>>> I'm just saying that the fix you propose is really no better than
>>> reverting the entire patch. I'd prefer not to do that, and would rather
>>> see us look for ways to fix both problems, but if we can't find such as
>>> fix then that would be the better solution.
>>
>> Hi Trond and Murphy Zhou,
>>
>> Sorry to resurrect this old thread, but I'm wondering if any progress was
>> made on this front.
>
> This failure stoped showing up since v5.6-rc1 release cycle
> in my records. Can you reproduce this on latest upstream kernel?

I'm seeing it on generic/168 on a v5.8 client against a v5.3 knfsd server.
When I test against v5.8 server, the test takes longer to complete and I
have yet to reproduce the livelock.

- on v5.3 server takes ~50 iterations to produce, each test completes in ~40
seconds
- on v5.8 server my test has run ~750 iterations without getting into
the lock, each test takes ~60 seconds.

I suspect recent changes to the server have changed the timing of open
replies such that the problem isn't reproduced on the client.

Ben