Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 4 Feb 2025 11:57:09 +1100

On Mon, Feb 03, 2025 at 08:12:59PM +0100, Amir Goldstein wrote:
> CC fstests
> 
> On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@xxxxxx> wrote:
> >
> > At Meta, we currently primarily rely on fstests 'auto' runs for
> > validating Btrfs as a general purpose filesystem for all of our root
> > drives. While this has obviously proven to be a very useful test suite
> > with rich collaboration across teams and filesystems, we have observed a
> > recent trend in our production filesystem issues that makes us question
> > if it is sufficient.
> >
> > Over the last few years, we have had a number of issues (primarily in
> > Btrfs, but at least one notable one in Xfs) that have been detected in
> > production, then reproduced with an unreliable non-specific stressor
> > that takes hours or even days to trigger the issue.
> > Examples:
> > - Btrfs relocation bugs
> > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@xxxxxxxxxxxxxx/
> > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@xxxxxxxxxxxxxx/
> > - Btrfs extent map merging corruption
> > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@xxxxxx/
> > - Btrfs dio data corruptions from bio splitting
> > (mostly our internal errors trying to make minimal backports of
> > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@xxxxxx/
> > and Christoph's related series)
> > - Xfs large folios
> > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@xxxxxxxx/
> >
> > In my view, the common threads between these are that:
> > - we used fstests to validate these systems, in some cases even with
> >   specific regression tests for highly related bugs, but still missed
> >   the bugs until they hit us during our production release process. In
> >   all cases, we had passing 'fstests -g auto' runs.

Have you considered the 'soak' test group with a long SOAK_DURATION
and then increasing the load using LOAD_FACTOR? Also there is a
'stress' group that TIME_FACTOR acts on.

For XFS, there's also bunch of fuzzing tests (in the
dangerous_fuzzers group) that use the same SOAK_DURATION
infrastructure via common/fuzzy.

> > - were able to reproduce the bugs with a predictable concoction of "run
> >   a workload and some known nasty btrfs operations in parallel". The most
> >   common form of this was running 'fsstress' and 'btrfs balance', but it
> >   wasn't quite universal. Sometimes we needed reflink threads, or
> >   drop_caches, or memory pressure, etc. to trigger a bug.

That's pretty much what check-parallel does to a system. Loads of
tests run things like drop_caches, memory compaction, CPU hotplug,
etc. check-parallel essentially exposes every test to these sorts
of background perturbations rather than just the one test that is
running that perturbation. IOWs, even the most basic correctness
test now gets exercised while cpu hotplug and memory compaction are
going on in the background....

Eventually, I plan to implement these background perturbations as
separate control tasks for check-parallel so we don't need specific
tests that run a background perturbation whilst the rest of the
system is under test.

> > - The relatively generic stressing reproducers took hours or days to
> >   produce an issue then the investigating engineer could try to tweak and
> >   tune it by trial and error to bring that time down for a particular bug.
> >
> > This leads me to the conclusion that there is some room for improvement in
> > stress testing filesystems (at least Btrfs).
> >
> > I attempted to study the prior art on this and so far have found:
> > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
> >   tests using fsstress and fsx in fstests/. Most of them are xfs and
> >   btrfs tests following the aforementioned pattern of racing fsstress
> >   with some scary operations. Most of them tend to run for 30s, though
> >   some are longer (and of course subject to TIME_FACTOR configuration)

As per above, SOAK_DURATION.

> > - Similar duration error injection tests in fstests (e.g. generic/475)
> > - The NFSv4 Test Project
> >   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
> >   A choice quote regarding stress testing:
> >   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
> >   was able to sustain the concurrent load of 10 processes during 24
> >   hours, without any problem. Three months later, NFSv4 reached 72 hours
> >   of stress under FSSTRESS, without any bugs. From this date, NFSv4
> >   filesystem tree manipulation is considered to be stable."
> >
> >
> > I would like to discuss:
> > - Am I missing other strategies people are employing? Apologies if there
> >   are obvious ones, but I tried to hunt around for a few days :)

check-parallel.

> > - What is the universe of interesting stressors (e.g., reflink, scrub,
> >   online repair, balance, etc.)

memory compaction, cpu hotplug, random reflinks of the underlying
loop device image files to simulate dynamic VM image file snapshots,
etc.

> > - What is the universe of interesting validation conditions (e.g.,
> >   kernel panic, read only fs, fsck failure, data integrity error, etc.)

All of them. That's the point of check-parallel - it uses simple,
existing filesystem correctness tests to generate a massively
stressful load on the system...

> > - Is there any interest in automating longer running fsstress runs? Are
> >   people already doing this with varying TIME_FACTOR configurations in
> >   fstests?

At least for XFS, Darrick is already doing that, and I think Carlos
may be as well.

> > - There is relatively less testing with fsx than fsstress in fstests.
> >   I believe this creates gaps for data corruption bugs rather than
> >   "feature logic" issues that the fsstress feature set tends to hit.
> > - Can we standardize on some modular "stressors" and stress durations
> >   to run to validate file systems?

I think we already have that with the "soak" and "stress" groups...

> > In the short term, I have been working on these ideas in a separate
> > barebones stress testing framework which I am happy to share, but isn't
> > particularly interesting in and of itself. It is basically just a
> > skeleton for concurrently running some concurrent "stressors" and then
> > validating the fs with some generic "validators". I plan to run it
> > internally just to see if I can get some useful results on our next few
> > major kernel releases.

check-parallel is effectively a massive concurrent stress workload
for the system. It does this by running many individual correctness
tests concurrently.

Run it on a 64p system or larger, and it will hammer both the test
filesystems and base filesystem that all the loop device image files
are laid out on.  I'm seeing it generate 5-6GB/s of IO load, 40-50GB
of memory usage, and consistently use >90% of the CPU in the system
stress the scheduler at over half a million context switches/s.

> > And of course, I would love to discuss anything else of interest to
> > people who like stress testing filesystems!

Filesystem stress testing by itself isn't really interesting to me.
Using filesystem correctness tests to create massively stressful
workloads, OTOH, attacks the problem from multiple angles and
exercises the system well outside the bounds of just filesystem
code.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx