Thank you everyone for the review and suggestions! Yes it's a real bug rather than a test flaw, so we should fix the kernel code instead. Jeff has posted a patchset to linux-nfs@ , and I have done testing against the patchset. The results are all good. generic/476 would complete within 30 mins typically in both localhost and multi-host configurations. Please check out the link for the patchset: https://www.spinics.net/lists/linux-nfs/msg92009.html Thanks, Boyang On Fri, Jul 22, 2022 at 10:19 AM Theodore Ts'o <tytso@xxxxxxx> wrote: > > On Thu, Jul 21, 2022 at 06:13:48PM +0000, Chuck Lever III wrote: > > > > I agree that Q/A and dev testing needs are distinct, so a dev might > > have a simpler series of tests to run and fewer resources to run them > > in. > > > > That said, I've had it on my to-do list for some time to find an easy > > way to run automated multi-host tests, and I've been told it should be > > straight-forward, but I've been swamped lately. > > Yeah, in a cloud or VM environment it shouldn't be *that* hard. > Especially for cloud setup, it's just a matter of launching another > cloud VM with a metadata flag that says, "please provide an NFS server > using the specified file system as the backing store", noting the IP > address, and then passing it to the client VM. The only slightly > tricky part is that monitoring and saving the serial console of the > server as a test artifact in case of an oops or triggers a BUG or > WARN_ON. > > Unfortunately, like you, I've been swamped lately. :-/ > > > Many of us in the NFS community actually don't run the tests that > > require a scratch dev, because many of them don't seem relevant to > > NFS, or they take a long time to run. Someday we should sort through > > all that :-) > > It doesn't take *that* long. In loopback mode, as well as using the > GCE Filestore Basic product as the remote NFS server, it takes between > 2.5 and 3 hours to run the auto group with the test and scratch device > sized to 5GB: > > nfs/filestore: 785 tests, 3 failures, 323 skipped, 9922 seconds > Failures: generic/258 generic/444 generic/551 > nfs/loopback: 814 tests, 2 failures, 342 skipped, 9364 seconds > Failures: generic/426 generic/551 > > That's the same order of magnitude for ext4 or xfs running -g auto, > and at least for me "gce-xfsetsts -c nfs/default -g auto" is fire and > forget kind of thing. 2-3 hours later, the results show up in my > inbox. It's actually *analyzing* the test failures which takes time > and NFS expertise, both of which I don't have a lot of at the moment. > > > For the particular issue with generic/476, I would like to see if > > there's a reason that test takes a long time and fails with a small > > scratch dev before agreeing that excluding it is the proper response. > > At the moment, my test runner setup assumes that if a single test > takes more than hour, the system under test is hung and should be > killed. So if generic/476 is taking ~400 seconds for pre-5.10 LTS > kernels, and over 24 hours if the watchdog safety timer isn't in use > for 5.10+ kernels, I need to exclude it in my test runner, at least > for now. > > Once it's fixed, I can use a linux versioned #ifdef to only exclude > the test if the fix is not present. > > (Also on my todo wishlist is to have some way to automatically exclude > a test if a specified fix commit isn't present on the tested kernel, > but to run it automatically once the fix commit is present. > Unfortunately, I don't have the time or the business case to put > someone on it as a work project...) > > - Ted >