Re: [PATCH] NFSv4: fix stateid refreshing when CLOSE racing with OPEN

Murphy Zhou <jencce.kernel@xxxxxxxxx> · Fri, 4 Sep 2020 11:04:11 +0800

Hi Benjamin,

On Thu, Sep 03, 2020 at 01:54:26PM -0400, Benjamin Coddington wrote:
> 
> On 11 Oct 2019, at 10:14, Trond Myklebust wrote:
> > On Fri, 2019-10-11 at 16:49 +0800, Murphy Zhou wrote:
> >> On Thu, Oct 10, 2019 at 02:46:40PM +0000, Trond Myklebust wrote:
> >>> On Thu, 2019-10-10 at 15:40 +0800, Murphy Zhou wrote:
> ...
> >>>> @@ -3367,14 +3368,16 @@ static bool
> >>>> nfs4_refresh_open_old_stateid(nfs4_stateid *dst,
> >>>>  			break;
> >>>>  		}
> >>>>  		seqid_open = state->open_stateid.seqid;
> >>>> -		if (read_seqretry(&state->seqlock, seq))
> >>>> -			continue;
> >>>>
> >>>>  		dst_seqid = be32_to_cpu(dst->seqid);
> >>>> -		if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) >= 0)
> >>>> +		if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) > 0)
> >>>>  			dst->seqid = cpu_to_be32(dst_seqid + 1);
> >>>
> >>> This negates the whole intention of the patch you reference in the
> >>> 'Fixes:', which was to allow us to CLOSE files even if seqid bumps
> >>> have
> >>> been lost due to interrupted RPC calls e.g. when using 'soft' or
> >>> 'softerr' mounts.
> >>> With the above change, the check could just be tossed out
> >>> altogether,
> >>> because dst_seqid will never become larger than seqid_open.
> >>
> >> Hmm.. I got it wrong. Thanks for the explanation.
> >
> > So to be clear: I'm not saying that what you describe is not a problem.
> > I'm just saying that the fix you propose is really no better than
> > reverting the entire patch. I'd prefer not to do that, and would rather
> > see us look for ways to fix both problems, but if we can't find such as
> > fix then that would be the better solution.
> 
> Hi Trond and Murphy Zhou,
> 
> Sorry to resurrect this old thread, but I'm wondering if any progress was
> made on this front.

This failure stoped showing up since v5.6-rc1 release cycle
in my records. Can you reproduce this on latest upstream kernel?

Thanks!

> 
> I'm seeing this race manifest when process is never able to escape from the
> loop in nfs_set_open_stateid_locked() if CLOSE comes through first and
> clears out the state.  We can play bit-fiddling games to fix that, but I
> feel like we'll just end up breaking it again later with another fix.
> 
> Either we should revert 0e0cb35b417f, or talk about how to fix it.  Seems
> like we should be able to put the CLOSE on the nfs4_state->waitq as well,
> and see if we can't just take that approach anytime our operations get out
> of sequence.  Do you see any problems with this approach?
> 
> Ben
> 

-- 
Murphy