On Sun, Aug 07, 2022 at 11:30:22AM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > Start the third chapter of the online fsck design documentation. This > covers the testing plan to make sure that both online and offline fsck > can detect arbitrary problems and correct them without making things > worse. > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > --- > .../filesystems/xfs-online-fsck-design.rst | 187 ++++++++++++++++++++ > 1 file changed, 187 insertions(+) .... > +Stress Testing > +-------------- > + > +A unique requirement to online fsck is the ability to operate on a filesystem > +concurrently with regular workloads. > +Although it is of course impossible to run ``xfs_scrub`` with *zero* observable > +impact on the running system, the online repair code should never introduce > +inconsistencies into the filesystem metadata, and regular workloads should > +never notice resource starvation. > +To verify that these conditions are being met, fstests has been enhanced in > +the following ways: > + > +* For each scrub item type, create a test to exercise checking that item type > + while running ``fsstress``. > +* For each scrub item type, create a test to exercise repairing that item type > + while running ``fsstress``. > +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole > + filesystem doesn't cause problems. > +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that > + force-repairing the whole filesystem doesn't cause problems. > +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while > + freezing and thawing the filesystem. > +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while > + remounting the filesystem read-only and read-write. > +* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) I had a thought when reading this that we want to ensure that online repair handles concurrent grow/shrink operations so that doesn't cause problems, as well as dealing with concurrent attempts to run independent online repair processes. Not sure that comes under stress testing, but it was the "test while freeze/thaw" that triggered me to think of this, so that's where I'm commenting about it. :) Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx