Hi everyone, As I've mentioned several times throughout 2022, I would like to merge the online fsck feature in time for the 2023 LTS kernel. This is the second part of that effort. This deluge contains all of the online repair kernel code, a significant amount of restructuring of how repairs work in the userspace driver program, and a ton of fstests updates to provide automated fuzz testing and stress testing of forced repairs. Within the kernel section, the major pieces are the use of tmpfs files to provide pageable kernel memory for staging repair information; lightweight hooks into the main xfs filesystem for scrub via jump labels; coordinated inode scans for live index construction; and the atomic file mapping swap feature. Changes to the userspace driver program fall into two main categories: restructuring how repairs are scheduled so that they're tracked by inode or AG; establishing data dependency chains so that we scan and repair things in the correct order; and reworking the systemd background services to be more secure, enable periodic media scans, and provide some semblance of fs corruption reporting. The fstests changes are a substantial reworking of the fuzzing code to fit the testing described in the design documentation; adding stress testing of online repairs vs. fsstress; and functional tests for all the new features that ride in with online repair. For this review, I would like people to focus the following: - Are the major subsystems sufficiently documented that you could figure out what the code does? - Do you see any problems that are severe enough to cause long term support hassles? (e.g. bad API design, writing weird metadata to disk) - Can you spot mis-interactions between the subsystems? - What were my blind spots in devising this feature? - Are there missing pieces that you'd like to help build? - Can I just merge all of this? The one thing that is /not/ in scope for this review are requests for more refactoring of existing subsystems. While there are usually valid arguments for performing such cleanups, those are separate tasks to be prioritized separately. I will get to them after merging online fsck, because revising existing subsystems generally involves rebasing work in this patchset, which means the affected patches need re-reviewing. Unless it's absolutely necessary, this just creates more work for everybody. I've been running daily online **repairs** of every computer I own for the last eight months. All modifications so far have been to optimize data structures (holes in the xattr structures, excessively large rmap btrees, and bugs in quota resource counter updates). So far, no damage has resulted from these operations. All issues observed in that time have been corrected in this submission. Fuzz and stress testing of online repairs have been running well for a year now. As of this writing, online repair can fix slightly more things than offline repair, and the fsstress+repair long soak test has passed 100 million repairs with zero problems observed. (For comparison, the long soak fsx test recently passed 92 billion file operations, so online fsck has a ways to go...) As a warning, the patches will likely take several days to trickle in. While everyone else looks at this, I plan to prototype directory tree reconstruction with Allison's parent pointers v27 patchset. Having a user of that functionality is (I think) the last major hurdle to ensuring that parent pointers are a good fit for the problems that need solving, which in turn is the last requirement for merging that feature. --D