On Wed, Jan 18, 2023 at 12:03:17AM +0000, Allison Henderson wrote: > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > Start the third chapter of the online fsck design documentation. > > This > > covers the testing plan to make sure that both online and offline > > fsck > > can detect arbitrary problems and correct them without making things > > worse. > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > --- > > .../filesystems/xfs-online-fsck-design.rst | 187 > > ++++++++++++++++++++ > > 1 file changed, 187 insertions(+) > > > > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst > > b/Documentation/filesystems/xfs-online-fsck-design.rst > > index a03a7b9f0250..d630b6bdbe4a 100644 > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst > > @@ -563,3 +563,190 @@ functionality. > > Many of these risks are inherent to software programming. > > Despite this, it is hoped that this new functionality will prove > > useful in > > reducing unexpected downtime. > > + > > +3. Testing Plan > > +=============== > > + > > +As stated before, fsck tools have three main goals: > > + > > +1. Detect inconsistencies in the metadata; > > + > > +2. Eliminate those inconsistencies; and > > + > > +3. Minimize further loss of data. > > + > > +Demonstrations of correct operation are necessary to build users' > > confidence > > +that the software behaves within expectations. > > +Unfortunately, it was not really feasible to perform regular > > exhaustive testing > > +of every aspect of a fsck tool until the introduction of low-cost > > virtual > > +machines with high-IOPS storage. > > +With ample hardware availability in mind, the testing strategy for > > the online > > +fsck project involves differential analysis against the existing > > fsck tools and > > +systematic testing of every attribute of every type of metadata > > object. > > +Testing can be split into four major categories, as discussed below. > > + > > +Integrated Testing with fstests > > +------------------------------- > > + > > +The primary goal of any free software QA effort is to make testing > > as > > +inexpensive and widespread as possible to maximize the scaling > > advantages of > > +community. > > +In other words, testing should maximize the breadth of filesystem > > configuration > > +scenarios and hardware setups. > > +This improves code quality by enabling the authors of online fsck to > > find and > > +fix bugs early, and helps developers of new features to find > > integration > > +issues earlier in their development effort. > > + > > +The Linux filesystem community shares a common QA testing suite, > > +`fstests > > <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for > > +functional and regression testing. > > +Even before development work began on online fsck, fstests (when run > > on XFS) > > +would run both the ``xfs_check`` and ``xfs_repair -n`` commands on > > the test and > > +scratch filesystems between each test. > > +This provides a level of assurance that the kernel and the fsck > > tools stay in > > +alignment about what constitutes consistent metadata. > > +During development of the online checking code, fstests was modified > > to run > > +``xfs_scrub -n`` between each test to ensure that the new checking > > code > > +produces the same results as the two existing fsck tools. > > + > > +To start development of online repair, fstests was modified to run > > +``xfs_repair`` to rebuild the filesystem's metadata indices between > > tests. > > +This ensures that offline repair does not crash, leave a corrupt > > filesystem > > +after it exists, or trigger complaints from the online check. > > +This also established a baseline for what can and cannot be repaired > > offline. > > +To complete the first phase of development of online repair, fstests > > was > > +modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. > > +This enables a comparison of the effectiveness of online repair as > > compared to > > +the existing offline repair tools. > > + > > +General Fuzz Testing of Metadata Blocks > > +--------------------------------------- > > + > > +XFS benefits greatly from having a very robust debugging tool, > > ``xfs_db``. > > + > > +Before development of online fsck even began, a set of fstests were > > created > > +to test the rather common fault that entire metadata blocks get > > corrupted. > > +This required the creation of fstests library code that can create a > > filesystem > > +containing every possible type of metadata object. > > +Next, individual test cases were created to create a test > > filesystem, identify > > +a single block of a specific type of metadata object, trash it with > > the > > +existing ``blocktrash`` command in ``xfs_db``, and test the reaction > > of a > > +particular metadata validation strategy. > > + > > +This earlier test suite enabled XFS developers to test the ability > > of the > > +in-kernel validation functions and the ability of the offline fsck > > tool to > > +detect and eliminate the inconsistent metadata. > > +This part of the test suite was extended to cover online fsck in > > exactly the > > +same manner. > > + > > +In other words, for a given fstests filesystem configuration: > > + > > +* For each metadata object existing on the filesystem: > > + > > + * Write garbage to it > > + > > + * Test the reactions of: > > + > > + 1. The kernel verifiers to stop obviously bad metadata > > + 2. Offline repair (``xfs_repair``) to detect and fix > > + 3. Online repair (``xfs_scrub``) to detect and fix > > + > > +Targeted Fuzz Testing of Metadata Records > > +----------------------------------------- > > + > > +A quick conversation with the other XFS developers revealed that the > > existing > > +test infrastructure could be extended to provide > > "The testing plan for ofsck includes extending the existing test > infrastructure to provide..." > > Took me a moment to notice we're not talking about history any more.... Ah. Sorry about that. The sentence now reads: "The testing plan for online fsck includes extending the existing fs testing infrastructure to provide a much more powerful facility: targeted fuzz testing of every metadata field of every metadata object in the filesystem." > > a much more powerful > > +facility: targeted fuzz testing of every metadata field of every > > metadata > > +object in the filesystem. > > +``xfs_db`` can modify every field of every metadata structure in > > every > > +block in the filesystem to simulate the effects of memory corruption > > and > > +software bugs. > > +Given that fstests already contains the ability to create a > > filesystem > > +containing every metadata format known to the filesystem, ``xfs_db`` > > can be > > +used to perform exhaustive fuzz testing! > > + > > +For a given fstests filesystem configuration: > > + > > +* For each metadata object existing on the filesystem... > > + > > + * For each record inside that metadata object... > > + > > + * For each field inside that record... > > + > > + * For each conceivable type of transformation that can be > > applied to a bit field... > > + > > + 1. Clear all bits > > + 2. Set all bits > > + 3. Toggle the most significant bit > > + 4. Toggle the middle bit > > + 5. Toggle the least significant bit > > + 6. Add a small quantity > > + 7. Subtract a small quantity > > + 8. Randomize the contents > > + > > + * ...test the reactions of: > > + > > + 1. The kernel verifiers to stop obviously bad metadata > > + 2. Offline checking (``xfs_repair -n``) > > + 3. Offline repair (``xfs_repair``) > > + 4. Online checking (``xfs_scrub -n``) > > + 5. Online repair (``xfs_scrub``) > > + 6. Both repair tools (``xfs_scrub`` and then > > ``xfs_repair`` if online repair doesn't succeed) > I like the indented bullet list format tho Thanks! I'm pleased that ... whatever renders this stuff ... actually supports nested lists. > > + > > +This is quite the combinatoric explosion! > > + > > +Fortunately, having this much test coverage makes it easy for XFS > > developers to > > +check the responses of XFS' fsck tools. > > +Since the introduction of the fuzz testing framework, these tests > > have been > > +used to discover incorrect repair code and missing functionality for > > entire > > +classes of metadata objects in ``xfs_repair``. > > +The enhanced testing was used to finalize the deprecation of > > ``xfs_check`` by > > +confirming that ``xfs_repair`` could detect at least as many > > corruptions as > > +the older tool. > > + > > +These tests have been very valuable for ``xfs_scrub`` in the same > > ways -- they > > +allow the online fsck developers to compare online fsck against > > offline fsck, > > +and they enable XFS developers to find deficiencies in the code > > base. > > + > > +Proposed patchsets include > > +`general fuzzer improvements > > +< > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > > it/log/?h=fuzzer-improvements>`_, > > +`fuzzing baselines > > +< > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > > it/log/?h=fuzz-baseline>`_, > > +and `improvements in fuzz testing comprehensiveness > > +< > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > > it/log/?h=more-fuzz-testing>`_. > > + > > +Stress Testing > > +-------------- > > + > > +A unique requirement to online fsck is the ability to operate on a > > filesystem > > +concurrently with regular workloads. > > +Although it is of course impossible to run ``xfs_scrub`` with *zero* > > observable > > +impact on the running system, the online repair code should never > > introduce > > +inconsistencies into the filesystem metadata, and regular workloads > > should > > +never notice resource starvation. > > +To verify that these conditions are being met, fstests has been > > enhanced in > > +the following ways: > > + > > +* For each scrub item type, create a test to exercise checking that > > item type > > + while running ``fsstress``. > > +* For each scrub item type, create a test to exercise repairing that > > item type > > + while running ``fsstress``. > > +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the > > whole > > + filesystem doesn't cause problems. > > +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to > > ensure that > > + force-repairing the whole filesystem doesn't cause problems. > > +* Race ``xfs_scrub`` in check and force-repair mode against > > ``fsstress`` while > > + freezing and thawing the filesystem. > > +* Race ``xfs_scrub`` in check and force-repair mode against > > ``fsstress`` while > > + remounting the filesystem read-only and read-write. > > +* The same, but running ``fsx`` instead of ``fsstress``. (Not done > > yet?) > > + > > +Success is defined by the ability to run all of these tests without > > observing > > +any unexpected filesystem shutdowns due to corrupted metadata, > > kernel hang > > +check warnings, or any other sort of mischief. > > Seems reasonable. Other than the one nit, I think this section reads > pretty well. > Reviewed-by: Allison Henderson <allison.henderson@xxxxxxxxxx> Woo! --D > Allison > > + > > +Proposed patchsets include `general stress testing > > +< > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > > it/log/?h=race-scrub-and-mount-state-changes>`_ > > +and the `evolution of existing per-function stress testing > > +< > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g > > it/log/?h=refactor-scrub-stress>`_. > > >