On Thu, Aug 11, 2022 at 10:09:45AM +1000, Dave Chinner wrote: > On Sun, Aug 07, 2022 at 11:30:22AM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > Start the third chapter of the online fsck design documentation. This > > covers the testing plan to make sure that both online and offline fsck > > can detect arbitrary problems and correct them without making things > > worse. > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > --- > > .../filesystems/xfs-online-fsck-design.rst | 187 ++++++++++++++++++++ > > 1 file changed, 187 insertions(+) > > > .... > > +Stress Testing > > +-------------- > > + > > +A unique requirement to online fsck is the ability to operate on a filesystem > > +concurrently with regular workloads. > > +Although it is of course impossible to run ``xfs_scrub`` with *zero* observable > > +impact on the running system, the online repair code should never introduce > > +inconsistencies into the filesystem metadata, and regular workloads should > > +never notice resource starvation. > > +To verify that these conditions are being met, fstests has been enhanced in > > +the following ways: > > + > > +* For each scrub item type, create a test to exercise checking that item type > > + while running ``fsstress``. > > +* For each scrub item type, create a test to exercise repairing that item type > > + while running ``fsstress``. > > +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole > > + filesystem doesn't cause problems. > > +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that > > + force-repairing the whole filesystem doesn't cause problems. > > +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while > > + freezing and thawing the filesystem. > > +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while > > + remounting the filesystem read-only and read-write. > > +* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) > > I had a thought when reading this that we want to ensure that online > repair handles concurrent grow/shrink operations so that doesn't > cause problems, as well as dealing with concurrent attempts to run > independent online repair processes. > > Not sure that comes under stress testing, but it was the "test while > freeze/thaw" that triggered me to think of this, so that's where I'm > commenting about it. :) Hmm. I hadn't really given that much thought. Let me go add that to the test suite and see how many daemons come pouring out... --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx