On Thu, Jul 21, 2022 at 06:13:48PM +0000, Chuck Lever III wrote: > > I agree that Q/A and dev testing needs are distinct, so a dev might > have a simpler series of tests to run and fewer resources to run them > in. > > That said, I've had it on my to-do list for some time to find an easy > way to run automated multi-host tests, and I've been told it should be > straight-forward, but I've been swamped lately. Yeah, in a cloud or VM environment it shouldn't be *that* hard. Especially for cloud setup, it's just a matter of launching another cloud VM with a metadata flag that says, "please provide an NFS server using the specified file system as the backing store", noting the IP address, and then passing it to the client VM. The only slightly tricky part is that monitoring and saving the serial console of the server as a test artifact in case of an oops or triggers a BUG or WARN_ON. Unfortunately, like you, I've been swamped lately. :-/ > Many of us in the NFS community actually don't run the tests that > require a scratch dev, because many of them don't seem relevant to > NFS, or they take a long time to run. Someday we should sort through > all that :-) It doesn't take *that* long. In loopback mode, as well as using the GCE Filestore Basic product as the remote NFS server, it takes between 2.5 and 3 hours to run the auto group with the test and scratch device sized to 5GB: nfs/filestore: 785 tests, 3 failures, 323 skipped, 9922 seconds Failures: generic/258 generic/444 generic/551 nfs/loopback: 814 tests, 2 failures, 342 skipped, 9364 seconds Failures: generic/426 generic/551 That's the same order of magnitude for ext4 or xfs running -g auto, and at least for me "gce-xfsetsts -c nfs/default -g auto" is fire and forget kind of thing. 2-3 hours later, the results show up in my inbox. It's actually *analyzing* the test failures which takes time and NFS expertise, both of which I don't have a lot of at the moment. > For the particular issue with generic/476, I would like to see if > there's a reason that test takes a long time and fails with a small > scratch dev before agreeing that excluding it is the proper response. At the moment, my test runner setup assumes that if a single test takes more than hour, the system under test is hung and should be killed. So if generic/476 is taking ~400 seconds for pre-5.10 LTS kernels, and over 24 hours if the watchdog safety timer isn't in use for 5.10+ kernels, I need to exclude it in my test runner, at least for now. Once it's fixed, I can use a linux versioned #ifdef to only exclude the test if the fix is not present. (Also on my todo wishlist is to have some way to automatically exclude a test if a specified fix commit isn't present on the tested kernel, but to run it automatically once the fix commit is present. Unfortunately, I don't have the time or the business case to put someone on it as a work project...) - Ted