At Meta, we currently primarily rely on fstests 'auto' runs for validating Btrfs as a general purpose filesystem for all of our root drives. While this has obviously proven to be a very useful test suite with rich collaboration across teams and filesystems, we have observed a recent trend in our production filesystem issues that makes us question if it is sufficient. Over the last few years, we have had a number of issues (primarily in Btrfs, but at least one notable one in Xfs) that have been detected in production, then reproduced with an unreliable non-specific stressor that takes hours or even days to trigger the issue. Examples: - Btrfs relocation bugs https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@xxxxxxxxxxxxxx/ https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@xxxxxxxxxxxxxx/ - Btrfs extent map merging corruption https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@xxxxxx/ - Btrfs dio data corruptions from bio splitting (mostly our internal errors trying to make minimal backports of https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@xxxxxx/ and Christoph's related series) - Xfs large folios https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@xxxxxxxx/ In my view, the common threads between these are that: - we used fstests to validate these systems, in some cases even with specific regression tests for highly related bugs, but still missed the bugs until they hit us during our production release process. In all cases, we had passing 'fstests -g auto' runs. - were able to reproduce the bugs with a predictable concoction of "run a workload and some known nasty btrfs operations in parallel". The most common form of this was running 'fsstress' and 'btrfs balance', but it wasn't quite universal. Sometimes we needed reflink threads, or drop_caches, or memory pressure, etc. to trigger a bug. - The relatively generic stressing reproducers took hours or days to produce an issue then the investigating engineer could try to tweak and tune it by trial and error to bring that time down for a particular bug. This leads me to the conclusion that there is some room for improvement in stress testing filesystems (at least Btrfs). I attempted to study the prior art on this and so far have found: - fsstress/fsx and the attendant tests in fstests/. There are ~150-200 tests using fsstress and fsx in fstests/. Most of them are xfs and btrfs tests following the aforementioned pattern of racing fsstress with some scary operations. Most of them tend to run for 30s, though some are longer (and of course subject to TIME_FACTOR configuration) - Similar duration error injection tests in fstests (e.g. generic/475) - The NFSv4 Test Project https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf A choice quote regarding stress testing: "One year after we started using FSSTRESS (in April 2005) Linux NFSv4 was able to sustain the concurrent load of 10 processes during 24 hours, without any problem. Three months later, NFSv4 reached 72 hours of stress under FSSTRESS, without any bugs. From this date, NFSv4 filesystem tree manipulation is considered to be stable." I would like to discuss: - Am I missing other strategies people are employing? Apologies if there are obvious ones, but I tried to hunt around for a few days :) - What is the universe of interesting stressors (e.g., reflink, scrub, online repair, balance, etc.) - What is the universe of interesting validation conditions (e.g., kernel panic, read only fs, fsck failure, data integrity error, etc.) - Is there any interest in automating longer running fsstress runs? Are people already doing this with varying TIME_FACTOR configurations in fstests? - There is relatively less testing with fsx than fsstress in fstests. I believe this creates gaps for data corruption bugs rather than "feature logic" issues that the fsstress feature set tends to hit. - Can we standardize on some modular "stressors" and stress durations to run to validate file systems? In the short term, I have been working on these ideas in a separate barebones stress testing framework which I am happy to share, but isn't particularly interesting in and of itself. It is basically just a skeleton for concurrently running some concurrent "stressors" and then validating the fs with some generic "validators". I plan to run it internally just to see if I can get some useful results on our next few major kernel releases. And of course, I would love to discuss anything else of interest to people who like stress testing filesystems! Thanks, Boris