On Wed, Jan 22, 2025 at 05:01:47PM +1100, Dave Chinner wrote: > On Tue, Jan 21, 2025 at 11:08:39PM -0500, Theodore Ts'o wrote: > > On Wed, Jan 22, 2025 at 09:15:48AM +1100, Dave Chinner wrote: > > > check-parallel on my 64p machine runs the full auto group test in > > > under 10 minutes. > > > > > > i.e. if you have a typical modern server (64-128p, 256GB RAM and a > > > couple of NVMe SSDs), then check-parallel allows a full test run in > > > the same time that './check -g smoketest' will run.... > > > > Interesting. I would have thought that even with NVMe SSD's, you'd be > > I/O speed constrained, especially given that some of the tests > > (especially the ENOSPC hitters) can take quite a lot of time to fill > > the storage device, even if they are using fallocate. > > You haven't looked at how check-parallel works, have you? :/ > > > How do you have your test and scratch devices configured? > > Please go and read the check-parallel script. It does all the > per-runner process test and scratch device configuration itself > using loop devices. > > > > Yes, and I've previously made the point about how check-parallel > > > changes the way we should be looking at dev-test cycles. We no > > > longer have to care that auto group testing takes 4 hours to run and > > > have to work around that with things like smoketest groups. If you > > > can run the whole auto test group in 10-15 minutes, then we don't > > > need "quick", "smoketest", etc to reduce dev-test cycle time > > > anymore... > > > > Well, yes, if the only consideration is test run time latency. > > Sure. > > > I can think of two off-setting considerations. The first is if you > > care about cost. > > Which I really don't care about. > > That's something for a QE organisation to worry about, and it's up > to them to make the best use of the tools they have within the > budget they have. > > > The second concern is that for certain class of failures (UBSAN, > > KCSAN, Lockdep, RCU soft lockups, WARN_ON, BUG_ON, and other > > panics/OOPS), if you are runnig 64 tests in parllel it might not be > > obvious which test caused the failure. > > Then multiple tests will fail with the same dmesg error, but it's > generally pretty clear which of the tests caused it. Yes, it's a bit > more work to isolate the specific test, but it's not hard or any > different to how a test failure is debugged now. > > If you want to automate such failures, then my process is to grep > the log files for all the tests that failed with a dmesg error then > run them again using check instead of check-parallel. Then I get > exactly which test generated the dmesg output without having to put > time into trying to work out which test triggered the failure. > > > Today, even if the test VM > > crashes or hangs, I can have test manager (which runs on a e2-small VM > > costing $0.021913 USD/hour and can manage dozens of test VM's all at the > > same time), can restart the test VM, and we know which test is at at > > fault, and we mark that a particular test with the Junit XML status of > > "error" (as distinct from "success" or "failure"). If there are 64 > > test runs in parallel, if I wanted to have automated recovery if the > > test appliance hangs or crashes, life gets a lot more complicated..... > > Not really. Both dmesg and the results files will have tracked all > the tests inflight when the system crashes, so it's just an extra > step to extract all those tests and run them again using check > and/or check-parallel to further isolate which test caused the > failure.... That reminds me to go see if ./check actually fsyncs the state and report files and whatnot between tests, so that we have a better chance of figuring out where exactly fstests blew up the machine. (Luckily xfs is stable enough I haven't had a machine explode in quite some time, good job everyone! :)) --D > I'm sure this could be automated eventually, but that's way down my > priority list right now. > > > I suppose we could have the human (or test automation) try run each > > individual test that had been running at the time of the crash but > > that's a lot more complicated, and what if the tests pass when run > > once at a time? I guess we should happen that check-parallel found a > > bug that plain check didn't find, but the human being still has to > > root cause the failure. > > Yes. This is no different to a test that is flakey or compeltely > fails when run serially by check multiple times. You still need a > human to find the root cause of the failure. > > Nobody is being forced to change their tooling or processes to use > check-parallel if they don't want or need to. It is an alternative > method for running the tests within the fstests suite - if using > check meets your needs, there is no reason to use check-parallel or > even care that it exists... > > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >