Hi Ted- It's not clear from your report whether the kernel range applies to the client's kernel or the server's kernel (in the non-loopback case). Since a scratch device is involved, I suspect the livelock might be due to a problem with the NFSD filecache code introduced on or about v5.10. There are patches pending in the NFSD for-next branch that should address this issue. Is there a way that your tester can try these out to confirm? > On Jul 21, 2022, at 10:50 AM, Theodore Ts'o <tytso@xxxxxxx> wrote: > > FYI, modern kernels (anything newer than 5.10 LTS, up to and excluding > bleeding-edge mainline kernels) are looping forever in a livelock or > deadlock when running generic/476 on NFS, both in a loopback and > external export configuration. This *may* be an ENOSPC related issue. > > See the referenced discussion on fstests@xxxxxxxxxxxxxxx for more > details. > > - Ted > > > From: "Theodore Ts'o" <tytso@xxxxxxx> > Subject: Re: [PATCH v1] generic/476: requires 27GB scratch size > Date: July 21, 2022 at 10:03:45 AM EDT > To: Boyang Xue <bxue@xxxxxxxxxx> > Cc: "Darrick J. Wong" <djwong@xxxxxxxxxx>, fstests@xxxxxxxxxxxxxxx > > > Following up, using NFS loopback with a 5GB scratch device on a Google > Compute Engine VM, generic/476 passes using a 4.14 LTS, 4.19 LTS, and > 5.4 LTS kernel. So this looks like it's a regression which is in 5.10 > LTS and newer kernels, and so instead of patching it out of the test, > I think the right thing to do is to add it to a kernel > version-specific exclude file and then filing a bug with the NFS > folks. > > KERNEL: kernel 4.14.284-xfstests #8 SMP Tue Jul 5 08:21:37 EDT 2022 x86_64 > CMDLINE: -c nfs/default generic/476 > CPUS: 2 > MEM: 7680 > > nfs/loopback: 1 tests, 597 seconds > generic/476 Pass 595s > Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 595s > > --- > KERNEL: kernel 4.19.248-xfstests #4 SMP Sat Jun 25 10:43:45 EDT 2022 x86_64 > CMDLINE: -c nfs/default generic/476 > CPUS: 2 > MEM: 7680 > > nfs/loopback: 1 tests, 407 seconds > generic/476 Pass 407s > Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 407s > > ---- > KERNEL: kernel 5.4.199-xfstests #21 SMP Sun Jul 3 12:15:15 EDT 2022 x86_64 > CMDLINE: -c nfs/default generic/476 > CPUS: 2 > MEM: 7680 > > nfs/loopback: 1 tests, 404 seconds > generic/476 Pass 404s > Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 404s > > > See below for what I'm checking into xfstests-bld for > {kvm,gce}-xfstests. I don't believe we should be changing xfstests's > generic/476, since it *does* pass with a smaller scratch device on > older kernels, and presumably, RHEL customers would be cranky if this > issue resulted in their production systems to lock up, and so it > should be considered a kernel bug as opposed to a test bug. > > - Ted > > > commit 4a33b6721d5db9c07f295a10a8ad65d2a0021406 > Author: Theodore Ts'o <tytso@xxxxxxx> > Date: Thu Jul 21 09:54:50 2022 -0400 > > test-appliance: add an nfs test exclusions for kernels newer than 5.4 > > This is apparently an NFS bug which is visible in 5.10 LTS and newer > kernels, and likely appeared sometime after 5.4. Since it causes the > test VM to spin forever (or at least for days), let's exclude it for > now. > > Link: https://lore.kernel.org/all/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@xxxxxxxxxxxxxx/ > Signed-off-by: Theodore Ts'o <tytso@xxxxxxx> > > diff --git a/test-appliance/files/root/fs/nfs/exclude b/test-appliance/files/root/fs/nfs/exclude > index 184750fb..ef4b19bc 100644 > --- a/test-appliance/files/root/fs/nfs/exclude > +++ b/test-appliance/files/root/fs/nfs/exclude > @@ -10,3 +10,14 @@ generic/477 > // failing in the expected output of the linux-nfs Wiki page. So we'll > // suppress this failure for now. > generic/294 > + > +#if LINUX_VERSION_CODE > KERNEL_VERSION(5,4,0) > +// There appears to be a regression that shows up sometime after 5.4. > +// LTS kernels for 4.14, 4.19, and 5.4 will terminate successfully, > +// but newer kernels will spin forever in some kind of deadlock or livelock > +// This apparently does not happen if the scratch device is > 27GB, so it > +// may be some kind of ENOSPC-related bug. > +// For more information see the e-mail thread starting at: > +// https://lore.kernel.org/r/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@xxxxxxxxxxxxxx/ > +generic/476 > +#endif > > -- Chuck Lever