I attempted to study the prior art on this and so far have found:
- fsstress/fsx and the attendant tests in fstests/. There are ~150-200
tests using fsstress and fsx in fstests/. Most of them are xfs and
btrfs tests following the aforementioned pattern of racing fsstress
with some scary operations. Most of them tend to run for 30s, though
some are longer (and of course subject to TIME_FACTOR configuration)
- Similar duration error injection tests in fstests (e.g. generic/475)
- The NFSv4 Test Project
https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
A choice quote regarding stress testing:
"One year after we started using FSSTRESS (in April 2005) Linux NFSv4
was able to sustain the concurrent load of 10 processes during 24
hours, without any problem. Three months later, NFSv4 reached 72 hours
of stress under FSSTRESS, without any bugs. From this date, NFSv4
filesystem tree manipulation is considered to be stable."
I would like to discuss:
- Am I missing other strategies people are employing? Apologies if there
are obvious ones, but I tried to hunt around for a few days :)
- What is the universe of interesting stressors (e.g., reflink, scrub,
online repair, balance, etc.)
It's not a filesystem, but the dm-vdo project has some similarities,
doing deduplication, compression, and thin provisioning. As such, they
have a fairly extensive set of tests of dm-vdo, and in particular they
do a fair bit of stress testing.
For them, the universe is reboots, crashes, complete rebuilds, read-only
entry and exit, compression enable/disable, and 512 byte sector mode
enable/disable. They've been running about fifty hours a week of these
tests inside of Red Hat. For instance,
https://github.com/dm-vdo/vdo-devel/blob/main/src/perl/vdotest/VDOTest/RebuildStress03.pm
is one of the tests showing the random selection of operations.
When these tests were first introduced eight years ago, they did catch
some crash or data corruption bugs which were not covered by the
existing universe of fstests-like tests for dm-vdo. There was also a
filesystem inconsistency uncovered at the time:
https://lore.kernel.org/all/CALoZfD4-uqhRSfEh0Y+v8jjSDY2KkAh-hhwdLnRgZopHEETUXA@xxxxxxxxxxxxxx/
I would suggest Matt Sakai, cc'd, or another of the VDO folks as a
valuable contributor to this discussion, given the VDO folks' long
experience with stress testing.
Sweet Tea