Re: [PATCH 13/23] generic/650: revert SOAK DURATION changes

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 22 Jan 2025 17:01:47 +1100

On Tue, Jan 21, 2025 at 11:08:39PM -0500, Theodore Ts'o wrote:
> On Wed, Jan 22, 2025 at 09:15:48AM +1100, Dave Chinner wrote:
> > check-parallel on my 64p machine runs the full auto group test in
> > under 10 minutes.
> > 
> > i.e. if you have a typical modern server (64-128p, 256GB RAM and a
> > couple of NVMe SSDs), then check-parallel allows a full test run in
> > the same time that './check -g smoketest' will run....
> 
> Interesting.  I would have thought that even with NVMe SSD's, you'd be
> I/O speed constrained, especially given that some of the tests
> (especially the ENOSPC hitters) can take quite a lot of time to fill
> the storage device, even if they are using fallocate.

You haven't looked at how check-parallel works, have you? :/

> How do you have your test and scratch devices configured?

Please go and read the check-parallel script. It does all the
per-runner process test and scratch device configuration itself
using loop devices.

> > Yes, and I've previously made the point about how check-parallel
> > changes the way we should be looking at dev-test cycles. We no
> > longer have to care that auto group testing takes 4 hours to run and
> > have to work around that with things like smoketest groups. If you
> > can run the whole auto test group in 10-15 minutes, then we don't
> > need "quick", "smoketest", etc to reduce dev-test cycle time
> > anymore...
> 
> Well, yes, if the only consideration is test run time latency.

Sure.

> I can think of two off-setting considerations.  The first is if you
> care about cost.

Which I really don't care about.

That's something for a QE organisation to worry about, and it's up
to them to make the best use of the tools they have within the
budget they have.

> The second concern is that for certain class of failures (UBSAN,
> KCSAN, Lockdep, RCU soft lockups, WARN_ON, BUG_ON, and other
> panics/OOPS), if you are runnig 64 tests in parllel it might not be
> obvious which test caused the failure.

Then multiple tests will fail with the same dmesg error, but it's
generally pretty clear which of the tests caused it. Yes, it's a bit
more work to isolate the specific test, but it's not hard or any
different to how a test failure is debugged now.

If you want to automate such failures, then my process is to grep
the log files for all the tests that failed with a dmesg error then
run them again using check instead of check-parallel.  Then I get
exactly which test generated the dmesg output without having to put
time into trying to work out which test triggered the failure.

> Today, even if the test VM
> crashes or hangs, I can have test manager (which runs on a e2-small VM
> costing $0.021913 USD/hour and can manage dozens of test VM's all at the
> same time), can restart the test VM, and we know which test is at at
> fault, and we mark that a particular test with the Junit XML status of
> "error" (as distinct from "success" or "failure").  If there are 64
> test runs in parallel, if I wanted to have automated recovery if the
> test appliance hangs or crashes, life gets a lot more complicated.....

Not really. Both dmesg and the results files will have tracked all
the tests inflight when the system crashes, so it's just an extra
step to extract all those tests and run them again using check
and/or check-parallel to further isolate which test caused the
failure....

I'm sure this could be automated eventually, but that's way down my
priority list right now.

> I suppose we could have the human (or test automation) try run each
> individual test that had been running at the time of the crash but
> that's a lot more complicated, and what if the tests pass when run
> once at a time?  I guess we should happen that check-parallel found a
> bug that plain check didn't find, but the human being still has to
> root cause the failure.

Yes. This is no different to a test that is flakey or compeltely
fails when run serially by check multiple times. You still need a
human to find the root cause of the failure.

Nobody is being forced to change their tooling or processes to use
check-parallel if they don't want or need to. It is an alternative
method for running the tests within the fstests suite - if using
check meets your needs, there is no reason to use check-parallel or
even care that it exists...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx