[LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems

Boris Burkov <boris@xxxxxx> · Mon, 3 Feb 2025 10:55:19 -0800

At Meta, we currently primarily rely on fstests 'auto' runs for
validating Btrfs as a general purpose filesystem for all of our root
drives. While this has obviously proven to be a very useful test suite
with rich collaboration across teams and filesystems, we have observed a
recent trend in our production filesystem issues that makes us question
if it is sufficient.

Over the last few years, we have had a number of issues (primarily in
Btrfs, but at least one notable one in Xfs) that have been detected in
production, then reproduced with an unreliable non-specific stressor
that takes hours or even days to trigger the issue.
Examples:
- Btrfs relocation bugs
https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@xxxxxxxxxxxxxx/
https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@xxxxxxxxxxxxxx/
- Btrfs extent map merging corruption
https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@xxxxxx/
- Btrfs dio data corruptions from bio splitting
(mostly our internal errors trying to make minimal backports of
https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@xxxxxx/
and Christoph's related series)
- Xfs large folios 
https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@xxxxxxxx/

In my view, the common threads between these are that:
- we used fstests to validate these systems, in some cases even with
  specific regression tests for highly related bugs, but still missed
  the bugs until they hit us during our production release process. In
  all cases, we had passing 'fstests -g auto' runs.
- were able to reproduce the bugs with a predictable concoction of "run
  a workload and some known nasty btrfs operations in parallel". The most
  common form of this was running 'fsstress' and 'btrfs balance', but it
  wasn't quite universal. Sometimes we needed reflink threads, or
  drop_caches, or memory pressure, etc. to trigger a bug.
- The relatively generic stressing reproducers took hours or days to
  produce an issue then the investigating engineer could try to tweak and
  tune it by trial and error to bring that time down for a particular bug.

This leads me to the conclusion that there is some room for improvement in
stress testing filesystems (at least Btrfs).

I attempted to study the prior art on this and so far have found:
- fsstress/fsx and the attendant tests in fstests/. There are ~150-200
  tests using fsstress and fsx in fstests/. Most of them are xfs and
  btrfs tests following the aforementioned pattern of racing fsstress
  with some scary operations. Most of them tend to run for 30s, though
  some are longer (and of course subject to TIME_FACTOR configuration)
- Similar duration error injection tests in fstests (e.g. generic/475)
- The NFSv4 Test Project
  https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf 
  A choice quote regarding stress testing:
  "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
  was able to sustain the concurrent load of 10 processes during 24
  hours, without any problem. Three months later, NFSv4 reached 72 hours
  of stress under FSSTRESS, without any bugs. From this date, NFSv4
  filesystem tree manipulation is considered to be stable."

I would like to discuss:
- Am I missing other strategies people are employing? Apologies if there
  are obvious ones, but I tried to hunt around for a few days :)
- What is the universe of interesting stressors (e.g., reflink, scrub,
  online repair, balance, etc.)
- What is the universe of interesting validation conditions (e.g.,
  kernel panic, read only fs, fsck failure, data integrity error, etc.)
- Is there any interest in automating longer running fsstress runs? Are
  people already doing this with varying TIME_FACTOR configurations in
  fstests?
- There is relatively less testing with fsx than fsstress in fstests.
  I believe this creates gaps for data corruption bugs rather than
  "feature logic" issues that the fsstress feature set tends to hit.
- Can we standardize on some modular "stressors" and stress durations
  to run to validate file systems?

In the short term, I have been working on these ideas in a separate
barebones stress testing framework which I am happy to share, but isn't
particularly interesting in and of itself. It is basically just a
skeleton for concurrently running some concurrent "stressors" and then
validating the fs with some generic "validators". I plan to run it
internally just to see if I can get some useful results on our next few
major kernel releases.

And of course, I would love to discuss anything else of interest to
people who like stress testing filesystems!

Thanks,
Boris