Re: [PATCH v1] generic/476: requires 27GB scratch size

Boyang Xue <bxue@xxxxxxxxxx> · Thu, 4 Aug 2022 17:10:23 +0800

Thank you everyone for the review and suggestions! Yes it's a real bug
rather than a test flaw, so we should fix the kernel code instead.

Jeff has posted a patchset to linux-nfs@ , and I have done testing
against the patchset. The results are all good. generic/476 would
complete within 30 mins typically in both localhost and multi-host
configurations. Please check out the link for the patchset:

https://www.spinics.net/lists/linux-nfs/msg92009.html

Thanks,
Boyang

On Fri, Jul 22, 2022 at 10:19 AM Theodore Ts'o <tytso@xxxxxxx> wrote:
>
> On Thu, Jul 21, 2022 at 06:13:48PM +0000, Chuck Lever III wrote:
> >
> > I agree that Q/A and dev testing needs are distinct, so a dev might
> > have a simpler series of tests to run and fewer resources to run them
> > in.
> >
> > That said, I've had it on my to-do list for some time to find an easy
> > way to run automated multi-host tests, and I've been told it should be
> > straight-forward, but I've been swamped lately.
>
> Yeah, in a cloud or VM environment it shouldn't be *that* hard.
> Especially for cloud setup, it's just a matter of launching another
> cloud VM with a metadata flag that says, "please provide an NFS server
> using the specified file system as the backing store", noting the IP
> address, and then passing it to the client VM.  The only slightly
> tricky part is that monitoring and saving the serial console of the
> server as a test artifact in case of an oops or triggers a BUG or
> WARN_ON.
>
> Unfortunately, like you, I've been swamped lately.  :-/
>
> > Many of us in the NFS community actually don't run the tests that
> > require a scratch dev, because many of them don't seem relevant to
> > NFS, or they take a long time to run. Someday we should sort through
> > all that :-)
>
> It doesn't take *that* long.  In loopback mode, as well as using the
> GCE Filestore Basic product as the remote NFS server, it takes between
> 2.5 and 3 hours to run the auto group with the test and scratch device
> sized to 5GB:
>
> nfs/filestore: 785 tests, 3 failures, 323 skipped, 9922 seconds
>   Failures: generic/258 generic/444 generic/551
> nfs/loopback: 814 tests, 2 failures, 342 skipped, 9364 seconds
>   Failures: generic/426 generic/551
>
> That's the same order of magnitude for ext4 or xfs running -g auto,
> and at least for me "gce-xfsetsts -c nfs/default -g auto" is fire and
> forget kind of thing.  2-3 hours later, the results show up in my
> inbox.  It's actually *analyzing* the test failures which takes time
> and NFS expertise, both of which I don't have a lot of at the moment.
>
> > For the particular issue with generic/476, I would like to see if
> > there's a reason that test takes a long time and fails with a small
> > scratch dev before agreeing that excluding it is the proper response.
>
> At the moment, my test runner setup assumes that if a single test
> takes more than hour, the system under test is hung and should be
> killed.  So if generic/476 is taking ~400 seconds for pre-5.10 LTS
> kernels, and over 24 hours if the watchdog safety timer isn't in use
> for 5.10+ kernels, I need to exclude it in my test runner, at least
> for now.
>
> Once it's fixed, I can use a linux versioned #ifdef to only exclude
> the test if the fix is not present.
>
> (Also on my todo wishlist is to have some way to automatically exclude
> a test if a specified fix commit isn't present on the tested kernel,
> but to run it automatically once the fix commit is present.
> Unfortunately, I don't have the time or the business case to put
> someone on it as a work project...)
>
>                                         - Ted
>