Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 5 Feb 2025 08:14:09 +1100

On Tue, Feb 04, 2025 at 11:58:46AM -0800, Boris Burkov wrote:
> On Tue, Feb 04, 2025 at 11:57:09AM +1100, Dave Chinner wrote:
> > > > - were able to reproduce the bugs with a predictable concoction of "run
> > > >   a workload and some known nasty btrfs operations in parallel". The most
> > > >   common form of this was running 'fsstress' and 'btrfs balance', but it
> > > >   wasn't quite universal. Sometimes we needed reflink threads, or
> > > >   drop_caches, or memory pressure, etc. to trigger a bug.
> > 
> > That's pretty much what check-parallel does to a system. Loads of
> > tests run things like drop_caches, memory compaction, CPU hotplug,
> > etc. check-parallel essentially exposes every test to these sorts
> > of background perturbations rather than just the one test that is
> > running that perturbation. IOWs, even the most basic correctness
> > test now gets exercised while cpu hotplug and memory compaction are
> > going on in the background....
> > 
> > Eventually, I plan to implement these background perturbations as
> > separate control tasks for check-parallel so we don't need specific
> > tests that run a background perturbation whilst the rest of the
> > system is under test.
> 
> I think that a framework for introducing background perturbations while
> running tests is definitely what I'm getting at. If check-parallel is a
> good version of that, then that sounds great to me. I am particularly
> excited about your point that it will smash together *every* stimulus
> with *every* test. I do have some questions in my head about how that
> would work in practice.
> 
> My main questions/concerns are:
> 
> How much do you randomize the interleaving of tests? Does
> check-parallel run them in a random order?

Same as check - the "-r" option will randomise the test run order.

The test run order is also somewhat randomised by default in that
it sorts the test run order based on the runtime of each test in
the previous test run. Hence test run order is not static - it
generally runs long running tests before slow running tests, but the
exact order is not fixed.

> Similarly, their durations are not at all tuned to maximize
> interesting interactions. If test X and test Y would collide on some
> faulty interaction, but test X runs once in 1 second, then you would
> likely never see test X interfere with some interesting moment during
> test Y. Are you considering feeding the tests back into the run-queue
> as they finish for these stress style runs?

Not yet - the infrastructure to directly manage and run tests from
check-parallel is not yet in place. It currently generates a test
list for each runner thread then executes that via a check instance
per runner thread.

I plan to have check-parallel execute tests individually itself by
factoring the run loop out of check (similar to how I'm doing the
test list parsing). Once there is direct control of the test
execution, stuff like dynamic test queues where runners just pull
the next test to run off the queue and they keep going until the
queue is empty will be possible.

> It seems that the two objectives of the test harness are sort of in
> tension with using check-parallel to stress things. On one hand you
> want tests to independently succeed or fail and on the other hand you
> want noise from one test to disturb the other.

Yes. Tests are largely written such that they don't interfere with
each other.

> I fear more of the
> failures will turn out to be "Oh, well, when THAT happens, we would
> expect this condition to be violated". Especially for the more "unit
> test" style fstests that carefully use sync to check specific conditions
> during a run.

That's why I currently have a "unreliable_in_parallel" test group
definition and check-parallel excludes that test group. There's
about 20 tests I've classified this way, most of them xfs specific
tests that are reliant on exact fragmentation patterns being
created. This tests are perturbed by things like sync(1) calls from
other tests which results in a different fragmentation pattern than
the test expects to see.

In each case, there is a comment in the test explaining the
condition that makes the test unreliable in parallel, and so we
have some idea of what needs fixing to be able to remove it from the
unreliable_in_parallel group.

Essentially, I'm using this as a marker and note for future
improvements once all the (more important) infrastructure work is
done and solid.

> This variant also feels like it would be at the extreme of difficulty
> for attempting to distill a failure into a reproducer.

It's pretty obvious when a test is doing something that is
influenced by an outside event. The biggest problem for debugging
them comes when the test failures appear to be real bugs (e.g. all
the weird and whacky off-by-one quota failures that check-parallel
triggers on XFS) but they cannot be reproduced when the tests are
run serially.

.....

> > > > And of course, I would love to discuss anything else of interest to
> > > > people who like stress testing filesystems!
> > 
> > Filesystem stress testing by itself isn't really interesting to me.
> > Using filesystem correctness tests to create massively stressful
> > workloads, OTOH, attacks the problem from multiple angles and
> > exercises the system well outside the bounds of just filesystem
> > code.
> 
> From what I see, today we have a handful of tests which race fsx or
> fsstress with 0-2 operations under test, and you are proposing using
> check-parallel to hammer the computer with the entirety of all 1000
> tests in parallel (awesome).

It's currently running one test per CPU in parallel, not all at
once. Many tests run lots of stuff in parallel themselves, too, and
some of them hammer large CPU count machines really hard just by
themselves, let alone when there's another 63 tests running
concurrently....

> I think I am proposing something in between
> where we run fsx AND fsstress AND ~10 known scary operations.

Write a set of tests that do this for btrfs and put them in the
auto/stress/soak groups. Then run 'check-parallel -g soak,stress
....'

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx