Re: [PATCH] NFSv4: fix stateid refreshing when CLOSE racing with OPEN

Olga Kornievskaia <aglo@xxxxxxxxx> · Fri, 4 Sep 2020 12:13:25 -0400

On Thu, Sep 3, 2020 at 1:55 PM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>
>
> On 11 Oct 2019, at 10:14, Trond Myklebust wrote:
> > On Fri, 2019-10-11 at 16:49 +0800, Murphy Zhou wrote:
> >> On Thu, Oct 10, 2019 at 02:46:40PM +0000, Trond Myklebust wrote:
> >>> On Thu, 2019-10-10 at 15:40 +0800, Murphy Zhou wrote:
> ...
> >>>> @@ -3367,14 +3368,16 @@ static bool
> >>>> nfs4_refresh_open_old_stateid(nfs4_stateid *dst,
> >>>>                    break;
> >>>>            }
> >>>>            seqid_open = state->open_stateid.seqid;
> >>>> -          if (read_seqretry(&state->seqlock, seq))
> >>>> -                  continue;
> >>>>
> >>>>            dst_seqid = be32_to_cpu(dst->seqid);
> >>>> -          if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) >= 0)
> >>>> +          if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) > 0)
> >>>>                    dst->seqid = cpu_to_be32(dst_seqid + 1);
> >>>
> >>> This negates the whole intention of the patch you reference in the
> >>> 'Fixes:', which was to allow us to CLOSE files even if seqid bumps
> >>> have
> >>> been lost due to interrupted RPC calls e.g. when using 'soft' or
> >>> 'softerr' mounts.
> >>> With the above change, the check could just be tossed out
> >>> altogether,
> >>> because dst_seqid will never become larger than seqid_open.
> >>
> >> Hmm.. I got it wrong. Thanks for the explanation.
> >
> > So to be clear: I'm not saying that what you describe is not a problem.
> > I'm just saying that the fix you propose is really no better than
> > reverting the entire patch. I'd prefer not to do that, and would rather
> > see us look for ways to fix both problems, but if we can't find such as
> > fix then that would be the better solution.
>
> Hi Trond and Murphy Zhou,
>
> Sorry to resurrect this old thread, but I'm wondering if any progress was
> made on this front.
>
> I'm seeing this race manifest when process is never able to escape from the
> loop in nfs_set_open_stateid_locked() if CLOSE comes through first and
> clears out the state.  We can play bit-fiddling games to fix that, but I
> feel like we'll just end up breaking it again later with another fix.
>
> Either we should revert 0e0cb35b417f, or talk about how to fix it.  Seems
> like we should be able to put the CLOSE on the nfs4_state->waitq as well,
> and see if we can't just take that approach anytime our operations get out
> of sequence.  Do you see any problems with this approach?
>

I'm not sure reverting the patch is the solution? Because I'm running
into the infinite ERR_OLD_STATEID loop on CLOSE on SLE15SP2 machines
which don't have this patch at all.

> Ben
>