On Thu, Sep 3, 2020 at 1:55 PM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > > On 11 Oct 2019, at 10:14, Trond Myklebust wrote: > > On Fri, 2019-10-11 at 16:49 +0800, Murphy Zhou wrote: > >> On Thu, Oct 10, 2019 at 02:46:40PM +0000, Trond Myklebust wrote: > >>> On Thu, 2019-10-10 at 15:40 +0800, Murphy Zhou wrote: > ... > >>>> @@ -3367,14 +3368,16 @@ static bool > >>>> nfs4_refresh_open_old_stateid(nfs4_stateid *dst, > >>>> break; > >>>> } > >>>> seqid_open = state->open_stateid.seqid; > >>>> - if (read_seqretry(&state->seqlock, seq)) > >>>> - continue; > >>>> > >>>> dst_seqid = be32_to_cpu(dst->seqid); > >>>> - if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) >= 0) > >>>> + if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) > 0) > >>>> dst->seqid = cpu_to_be32(dst_seqid + 1); > >>> > >>> This negates the whole intention of the patch you reference in the > >>> 'Fixes:', which was to allow us to CLOSE files even if seqid bumps > >>> have > >>> been lost due to interrupted RPC calls e.g. when using 'soft' or > >>> 'softerr' mounts. > >>> With the above change, the check could just be tossed out > >>> altogether, > >>> because dst_seqid will never become larger than seqid_open. > >> > >> Hmm.. I got it wrong. Thanks for the explanation. > > > > So to be clear: I'm not saying that what you describe is not a problem. > > I'm just saying that the fix you propose is really no better than > > reverting the entire patch. I'd prefer not to do that, and would rather > > see us look for ways to fix both problems, but if we can't find such as > > fix then that would be the better solution. > > Hi Trond and Murphy Zhou, > > Sorry to resurrect this old thread, but I'm wondering if any progress was > made on this front. > > I'm seeing this race manifest when process is never able to escape from the > loop in nfs_set_open_stateid_locked() if CLOSE comes through first and > clears out the state. We can play bit-fiddling games to fix that, but I > feel like we'll just end up breaking it again later with another fix. > > Either we should revert 0e0cb35b417f, or talk about how to fix it. Seems > like we should be able to put the CLOSE on the nfs4_state->waitq as well, > and see if we can't just take that approach anytime our operations get out > of sequence. Do you see any problems with this approach? > I'm not sure reverting the patch is the solution? Because I'm running into the infinite ERR_OLD_STATEID loop on CLOSE on SLE15SP2 machines which don't have this patch at all. > Ben >