Re: [PATCH v1] generic/476: requires 27GB scratch size

"Theodore Ts'o" <tytso@xxxxxxx> · Thu, 21 Jul 2022 22:18:58 -0400

On Thu, Jul 21, 2022 at 06:13:48PM +0000, Chuck Lever III wrote:
> 
> I agree that Q/A and dev testing needs are distinct, so a dev might
> have a simpler series of tests to run and fewer resources to run them
> in.
> 
> That said, I've had it on my to-do list for some time to find an easy
> way to run automated multi-host tests, and I've been told it should be
> straight-forward, but I've been swamped lately.

Yeah, in a cloud or VM environment it shouldn't be *that* hard.
Especially for cloud setup, it's just a matter of launching another
cloud VM with a metadata flag that says, "please provide an NFS server
using the specified file system as the backing store", noting the IP
address, and then passing it to the client VM.  The only slightly
tricky part is that monitoring and saving the serial console of the
server as a test artifact in case of an oops or triggers a BUG or
WARN_ON.

Unfortunately, like you, I've been swamped lately.  :-/

> Many of us in the NFS community actually don't run the tests that
> require a scratch dev, because many of them don't seem relevant to
> NFS, or they take a long time to run. Someday we should sort through
> all that :-)

It doesn't take *that* long.  In loopback mode, as well as using the
GCE Filestore Basic product as the remote NFS server, it takes between
2.5 and 3 hours to run the auto group with the test and scratch device
sized to 5GB:

nfs/filestore: 785 tests, 3 failures, 323 skipped, 9922 seconds
  Failures: generic/258 generic/444 generic/551
nfs/loopback: 814 tests, 2 failures, 342 skipped, 9364 seconds
  Failures: generic/426 generic/551

That's the same order of magnitude for ext4 or xfs running -g auto,
and at least for me "gce-xfsetsts -c nfs/default -g auto" is fire and
forget kind of thing.  2-3 hours later, the results show up in my
inbox.  It's actually *analyzing* the test failures which takes time
and NFS expertise, both of which I don't have a lot of at the moment.

> For the particular issue with generic/476, I would like to see if
> there's a reason that test takes a long time and fails with a small
> scratch dev before agreeing that excluding it is the proper response.

At the moment, my test runner setup assumes that if a single test
takes more than hour, the system under test is hung and should be
killed.  So if generic/476 is taking ~400 seconds for pre-5.10 LTS
kernels, and over 24 hours if the watchdog safety timer isn't in use
for 5.10+ kernels, I need to exclude it in my test runner, at least
for now.

Once it's fixed, I can use a linux versioned #ifdef to only exclude
the test if the fix is not present.

(Also on my todo wishlist is to have some way to automatically exclude
a test if a specified fix commit isn't present on the tested kernel,
but to run it automatically once the fix commit is present.
Unfortunately, I don't have the time or the business case to put
someone on it as a work project...)

					- Ted