Re: [Bug Report] [PATCH v1] generic/476: requires 27GB scratch size

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Thu, 21 Jul 2022 14:59:06 +0000

Hi Ted-

It's not clear from your report whether the kernel range applies
to the client's kernel or the server's kernel (in the non-loopback
case).

Since a scratch device is involved, I suspect the livelock might
be due to a problem with the NFSD filecache code introduced on or
about v5.10. There are patches pending in the NFSD for-next branch
that should address this issue. Is there a way that your tester
can try these out to confirm?

> On Jul 21, 2022, at 10:50 AM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> 
> FYI, modern kernels (anything newer than 5.10 LTS, up to and excluding
> bleeding-edge mainline kernels) are looping forever in a livelock or
> deadlock when running generic/476 on NFS, both in a loopback and
> external export configuration.  This *may* be an ENOSPC related issue.
> 
> See the referenced discussion on fstests@xxxxxxxxxxxxxxx for more
> details.
> 
> 	 			     	      - Ted
> 
> 
> From: "Theodore Ts'o" <tytso@xxxxxxx>
> Subject: Re: [PATCH v1] generic/476: requires 27GB scratch size
> Date: July 21, 2022 at 10:03:45 AM EDT
> To: Boyang Xue <bxue@xxxxxxxxxx>
> Cc: "Darrick J. Wong" <djwong@xxxxxxxxxx>, fstests@xxxxxxxxxxxxxxx
> 
> 
> Following up, using NFS loopback with a 5GB scratch device on a Google
> Compute Engine VM, generic/476 passes using a 4.14 LTS, 4.19 LTS, and
> 5.4 LTS kernel.  So this looks like it's a regression which is in 5.10
> LTS and newer kernels, and so instead of patching it out of the test,
> I think the right thing to do is to add it to a kernel
> version-specific exclude file and then filing a bug with the NFS
> folks.
> 
> KERNEL:    kernel 4.14.284-xfstests #8 SMP Tue Jul 5 08:21:37 EDT 2022 x86_64
> CMDLINE:   -c nfs/default generic/476
> CPUS:      2
> MEM:       7680
> 
> nfs/loopback: 1 tests, 597 seconds
>  generic/476  Pass     595s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 595s
> 
> ---
> KERNEL:    kernel 4.19.248-xfstests #4 SMP Sat Jun 25 10:43:45 EDT 2022 x86_64
> CMDLINE:   -c nfs/default generic/476
> CPUS:      2
> MEM:       7680
> 
> nfs/loopback: 1 tests, 407 seconds
>  generic/476  Pass     407s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 407s
> 
> ----
> KERNEL:    kernel 5.4.199-xfstests #21 SMP Sun Jul 3 12:15:15 EDT 2022 x86_64
> CMDLINE:   -c nfs/default generic/476
> CPUS:      2
> MEM:       7680
> 
> nfs/loopback: 1 tests, 404 seconds
>  generic/476  Pass     404s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 404s
> 
> 
> See below for what I'm checking into xfstests-bld for
> {kvm,gce}-xfstests.  I don't believe we should be changing xfstests's
> generic/476, since it *does* pass with a smaller scratch device on
> older kernels, and presumably, RHEL customers would be cranky if this
> issue resulted in their production systems to lock up, and so it
> should be considered a kernel bug as opposed to a test bug.
> 
> 						- Ted
> 
> 
> commit 4a33b6721d5db9c07f295a10a8ad65d2a0021406
> Author: Theodore Ts'o <tytso@xxxxxxx>
> Date:   Thu Jul 21 09:54:50 2022 -0400
> 
>    test-appliance: add an nfs test exclusions for kernels newer than 5.4
> 
>    This is apparently an NFS bug which is visible in 5.10 LTS and newer
>    kernels, and likely appeared sometime after 5.4.  Since it causes the
>    test VM to spin forever (or at least for days), let's exclude it for
>    now.
> 
>    Link: https://lore.kernel.org/all/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@xxxxxxxxxxxxxx/
>    Signed-off-by: Theodore Ts'o <tytso@xxxxxxx>
> 
> diff --git a/test-appliance/files/root/fs/nfs/exclude b/test-appliance/files/root/fs/nfs/exclude
> index 184750fb..ef4b19bc 100644
> --- a/test-appliance/files/root/fs/nfs/exclude
> +++ b/test-appliance/files/root/fs/nfs/exclude
> @@ -10,3 +10,14 @@ generic/477
> // failing in the expected output of the linux-nfs Wiki page.  So we'll
> // suppress this failure for now.
> generic/294
> +
> +#if LINUX_VERSION_CODE > KERNEL_VERSION(5,4,0)
> +// There appears to be a regression that shows up sometime after 5.4.
> +// LTS kernels for 4.14, 4.19, and 5.4 will terminate successfully,
> +// but newer kernels will spin forever in some kind of deadlock or livelock
> +// This apparently does not happen if the scratch device is > 27GB, so it
> +// may be some kind of ENOSPC-related bug.
> +// For more information see the e-mail thread starting at:
> +// https://lore.kernel.org/r/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@xxxxxxxxxxxxxx/
> +generic/476
> +#endif
> 
> 

--
Chuck Lever