On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > Start the third chapter of the online fsck design documentation. > This > covers the testing plan to make sure that both online and offline > fsck > can detect arbitrary problems and correct them without making things > worse. > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > --- > .../filesystems/xfs-online-fsck-design.rst | 187 > ++++++++++++++++++++ > 1 file changed, 187 insertions(+) > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst > b/Documentation/filesystems/xfs-online-fsck-design.rst > index a03a7b9f0250..d630b6bdbe4a 100644 > --- a/Documentation/filesystems/xfs-online-fsck-design.rst > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst > @@ -563,3 +563,190 @@ functionality. > Many of these risks are inherent to software programming. > Despite this, it is hoped that this new functionality will prove > useful in > reducing unexpected downtime. > + > +3. Testing Plan > +=============== > + > +As stated before, fsck tools have three main goals: > + > +1. Detect inconsistencies in the metadata; > + > +2. Eliminate those inconsistencies; and > + > +3. Minimize further loss of data. > + > +Demonstrations of correct operation are necessary to build users' > confidence > +that the software behaves within expectations. > +Unfortunately, it was not really feasible to perform regular > exhaustive testing > +of every aspect of a fsck tool until the introduction of low-cost > virtual > +machines with high-IOPS storage. > +With ample hardware availability in mind, the testing strategy for > the online > +fsck project involves differential analysis against the existing > fsck tools and > +systematic testing of every attribute of every type of metadata > object. > +Testing can be split into four major categories, as discussed below. > + > +Integrated Testing with fstests > +------------------------------- > + > +The primary goal of any free software QA effort is to make testing > as > +inexpensive and widespread as possible to maximize the scaling > advantages of > +community. > +In other words, testing should maximize the breadth of filesystem > configuration > +scenarios and hardware setups. > +This improves code quality by enabling the authors of online fsck to > find and > +fix bugs early, and helps developers of new features to find > integration > +issues earlier in their development effort. > + > +The Linux filesystem community shares a common QA testing suite, > +`fstests > <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for > +functional and regression testing. > +Even before development work began on online fsck, fstests (when run > on XFS) > +would run both the ``xfs_check`` and ``xfs_repair -n`` commands on > the test and > +scratch filesystems between each test. > +This provides a level of assurance that the kernel and the fsck > tools stay in > +alignment about what constitutes consistent metadata. > +During development of the online checking code, fstests was modified > to run > +``xfs_scrub -n`` between each test to ensure that the new checking > code > +produces the same results as the two existing fsck tools. > + > +To start development of online repair, fstests was modified to run > +``xfs_repair`` to rebuild the filesystem's metadata indices between > tests. > +This ensures that offline repair does not crash, leave a corrupt > filesystem > +after it exists, or trigger complaints from the online check. > +This also established a baseline for what can and cannot be repaired > offline. > +To complete the first phase of development of online repair, fstests > was > +modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. > +This enables a comparison of the effectiveness of online repair as > compared to > +the existing offline repair tools. > + > +General Fuzz Testing of Metadata Blocks > +--------------------------------------- > + > +XFS benefits greatly from having a very robust debugging tool, > ``xfs_db``. > + > +Before development of online fsck even began, a set of fstests were > created > +to test the rather common fault that entire metadata blocks get > corrupted. > +This required the creation of fstests library code that can create a > filesystem > +containing every possible type of metadata object. > +Next, individual test cases were created to create a test > filesystem, identify > +a single block of a specific type of metadata object, trash it with > the > +existing ``blocktrash`` command in ``xfs_db``, and test the reaction > of a > +particular metadata validation strategy. > + > +This earlier test suite enabled XFS developers to test the ability > of the > +in-kernel validation functions and the ability of the offline fsck > tool to > +detect and eliminate the inconsistent metadata. > +This part of the test suite was extended to cover online fsck in > exactly the > +same manner. > + > +In other words, for a given fstests filesystem configuration: > + > +* For each metadata object existing on the filesystem: > + > + * Write garbage to it > + > + * Test the reactions of: > + > + 1. The kernel verifiers to stop obviously bad metadata > + 2. Offline repair (``xfs_repair``) to detect and fix > + 3. Online repair (``xfs_scrub``) to detect and fix > + > +Targeted Fuzz Testing of Metadata Records > +----------------------------------------- > + > +A quick conversation with the other XFS developers revealed that the > existing > +test infrastructure could be extended to provide "The testing plan for ofsck includes extending the existing test infrastructure to provide..." Took me a moment to notice we're not talking about history any more.... > a much more powerful > +facility: targeted fuzz testing of every metadata field of every > metadata > +object in the filesystem. > +``xfs_db`` can modify every field of every metadata structure in > every > +block in the filesystem to simulate the effects of memory corruption > and > +software bugs. > +Given that fstests already contains the ability to create a > filesystem > +containing every metadata format known to the filesystem, ``xfs_db`` > can be > +used to perform exhaustive fuzz testing! > + > +For a given fstests filesystem configuration: > + > +* For each metadata object existing on the filesystem... > + > + * For each record inside that metadata object... > + > + * For each field inside that record... > + > + * For each conceivable type of transformation that can be > applied to a bit field... > + > + 1. Clear all bits > + 2. Set all bits > + 3. Toggle the most significant bit > + 4. Toggle the middle bit > + 5. Toggle the least significant bit > + 6. Add a small quantity > + 7. Subtract a small quantity > + 8. Randomize the contents > + > + * ...test the reactions of: > + > + 1. The kernel verifiers to stop obviously bad metadata > + 2. Offline checking (``xfs_repair -n``) > + 3. Offline repair (``xfs_repair``) > + 4. Online checking (``xfs_scrub -n``) > + 5. Online repair (``xfs_scrub``) > + 6. Both repair tools (``xfs_scrub`` and then > ``xfs_repair`` if online repair doesn't succeed) I like the indented bullet list format tho > + > +This is quite the combinatoric explosion! > + > +Fortunately, having this much test coverage makes it easy for XFS > developers to > +check the responses of XFS' fsck tools. > +Since the introduction of the fuzz testing framework, these tests > have been > +used to discover incorrect repair code and missing functionality for > entire > +classes of metadata objects in ``xfs_repair``. > +The enhanced testing was used to finalize the deprecation of > ``xfs_check`` by > +confirming that ``xfs_repair`` could detect at least as many > corruptions as > +the older tool. > + > +These tests have been very valuable for ``xfs_scrub`` in the same > ways -- they > +allow the online fsck developers to compare online fsck against > offline fsck, > +and they enable XFS developers to find deficiencies in the code > base. > + > +Proposed patchsets include > +`general fuzzer improvements > +< > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > it/log/?h=fuzzer-improvements>`_, > +`fuzzing baselines > +< > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > it/log/?h=fuzz-baseline>`_, > +and `improvements in fuzz testing comprehensiveness > +< > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > it/log/?h=more-fuzz-testing>`_. > + > +Stress Testing > +-------------- > + > +A unique requirement to online fsck is the ability to operate on a > filesystem > +concurrently with regular workloads. > +Although it is of course impossible to run ``xfs_scrub`` with *zero* > observable > +impact on the running system, the online repair code should never > introduce > +inconsistencies into the filesystem metadata, and regular workloads > should > +never notice resource starvation. > +To verify that these conditions are being met, fstests has been > enhanced in > +the following ways: > + > +* For each scrub item type, create a test to exercise checking that > item type > + while running ``fsstress``. > +* For each scrub item type, create a test to exercise repairing that > item type > + while running ``fsstress``. > +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the > whole > + filesystem doesn't cause problems. > +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to > ensure that > + force-repairing the whole filesystem doesn't cause problems. > +* Race ``xfs_scrub`` in check and force-repair mode against > ``fsstress`` while > + freezing and thawing the filesystem. > +* Race ``xfs_scrub`` in check and force-repair mode against > ``fsstress`` while > + remounting the filesystem read-only and read-write. > +* The same, but running ``fsx`` instead of ``fsstress``. (Not done > yet?) > + > +Success is defined by the ability to run all of these tests without > observing > +any unexpected filesystem shutdowns due to corrupted metadata, > kernel hang > +check warnings, or any other sort of mischief. Seems reasonable. Other than the one nit, I think this section reads pretty well. Reviewed-by: Allison Henderson <allison.henderson@xxxxxxxxxx> Allison > + > +Proposed patchsets include `general stress testing > +< > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > it/log/?h=race-scrub-and-mount-state-changes>`_ > +and the `evolution of existing per-function stress testing > +< > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > it/log/?h=refactor-scrub-stress>`_. >