> On Sep 4, 2020, at 6:55 AM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > On 3 Sep 2020, at 23:04, Murphy Zhou wrote: > >> Hi Benjamin, >> >> On Thu, Sep 03, 2020 at 01:54:26PM -0400, Benjamin Coddington wrote: >>> >>> On 11 Oct 2019, at 10:14, Trond Myklebust wrote: >>>> On Fri, 2019-10-11 at 16:49 +0800, Murphy Zhou wrote: >>>>> On Thu, Oct 10, 2019 at 02:46:40PM +0000, Trond Myklebust wrote: >>>>>> On Thu, 2019-10-10 at 15:40 +0800, Murphy Zhou wrote: >>> ... >>>>>>> @@ -3367,14 +3368,16 @@ static bool >>>>>>> nfs4_refresh_open_old_stateid(nfs4_stateid *dst, >>>>>>> break; >>>>>>> } >>>>>>> seqid_open = state->open_stateid.seqid; >>>>>>> - if (read_seqretry(&state->seqlock, seq)) >>>>>>> - continue; >>>>>>> >>>>>>> dst_seqid = be32_to_cpu(dst->seqid); >>>>>>> - if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) >= 0) >>>>>>> + if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) > 0) >>>>>>> dst->seqid = cpu_to_be32(dst_seqid + 1); >>>>>> >>>>>> This negates the whole intention of the patch you reference in the >>>>>> 'Fixes:', which was to allow us to CLOSE files even if seqid bumps >>>>>> have >>>>>> been lost due to interrupted RPC calls e.g. when using 'soft' or >>>>>> 'softerr' mounts. >>>>>> With the above change, the check could just be tossed out >>>>>> altogether, >>>>>> because dst_seqid will never become larger than seqid_open. >>>>> >>>>> Hmm.. I got it wrong. Thanks for the explanation. >>>> >>>> So to be clear: I'm not saying that what you describe is not a problem. >>>> I'm just saying that the fix you propose is really no better than >>>> reverting the entire patch. I'd prefer not to do that, and would rather >>>> see us look for ways to fix both problems, but if we can't find such as >>>> fix then that would be the better solution. >>> >>> Hi Trond and Murphy Zhou, >>> >>> Sorry to resurrect this old thread, but I'm wondering if any progress was >>> made on this front. >> >> This failure stoped showing up since v5.6-rc1 release cycle >> in my records. Can you reproduce this on latest upstream kernel? > > I'm seeing it on generic/168 on a v5.8 client against a v5.3 knfsd server. > When I test against v5.8 server, the test takes longer to complete and I > have yet to reproduce the livelock. > > - on v5.3 server takes ~50 iterations to produce, each test completes in ~40 > seconds > - on v5.8 server my test has run ~750 iterations without getting into > the lock, each test takes ~60 seconds. > > I suspect recent changes to the server have changed the timing of open > replies such that the problem isn't reproduced on the client. The Linux NFS server in v5.4 does behave differently than earlier kernels with NFSv4.0, and it is performance-related. The filecache went into v5.4, and that seems to change the frequency at which the server offers delegations. I'm looking into it, and learning a bunch. -- Chuck Lever