Re: [PATCH] NFSv4: fix stateid refreshing when CLOSE racing with OPEN

"Benjamin Coddington" <bcodding@xxxxxxxxxx> · Thu, 03 Sep 2020 13:54:26 -0400

On 11 Oct 2019, at 10:14, Trond Myklebust wrote:
> On Fri, 2019-10-11 at 16:49 +0800, Murphy Zhou wrote:
>> On Thu, Oct 10, 2019 at 02:46:40PM +0000, Trond Myklebust wrote:
>>> On Thu, 2019-10-10 at 15:40 +0800, Murphy Zhou wrote:
...
>>>> @@ -3367,14 +3368,16 @@ static bool
>>>> nfs4_refresh_open_old_stateid(nfs4_stateid *dst,
>>>>  			break;
>>>>  		}
>>>>  		seqid_open = state->open_stateid.seqid;
>>>> -		if (read_seqretry(&state->seqlock, seq))
>>>> -			continue;
>>>>
>>>>  		dst_seqid = be32_to_cpu(dst->seqid);
>>>> -		if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) >= 0)
>>>> +		if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) > 0)
>>>>  			dst->seqid = cpu_to_be32(dst_seqid + 1);
>>>
>>> This negates the whole intention of the patch you reference in the
>>> 'Fixes:', which was to allow us to CLOSE files even if seqid bumps
>>> have
>>> been lost due to interrupted RPC calls e.g. when using 'soft' or
>>> 'softerr' mounts.
>>> With the above change, the check could just be tossed out
>>> altogether,
>>> because dst_seqid will never become larger than seqid_open.
>>
>> Hmm.. I got it wrong. Thanks for the explanation.
>
> So to be clear: I'm not saying that what you describe is not a problem.
> I'm just saying that the fix you propose is really no better than
> reverting the entire patch. I'd prefer not to do that, and would rather
> see us look for ways to fix both problems, but if we can't find such as
> fix then that would be the better solution.

Hi Trond and Murphy Zhou,

Sorry to resurrect this old thread, but I'm wondering if any progress was
made on this front.

I'm seeing this race manifest when process is never able to escape from the
loop in nfs_set_open_stateid_locked() if CLOSE comes through first and
clears out the state.  We can play bit-fiddling games to fix that, but I
feel like we'll just end up breaking it again later with another fix.

Either we should revert 0e0cb35b417f, or talk about how to fix it.  Seems
like we should be able to put the CLOSE on the nfs4_state->waitq as well,
and see if we can't just take that approach anytime our operations get out
of sequence.  Do you see any problems with this approach?

Ben