On Fri, Mar 03, 2023 at 11:51:02PM +0000, Allison Henderson wrote: > On Wed, 2023-03-01 at 16:27 -0800, Darrick J. Wong wrote: > > On Wed, Mar 01, 2023 at 05:36:59AM +0000, Allison Henderson wrote: > > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > > > Add the sixth chapter of the online fsck design documentation, > > > > where > > > > we discuss the details of the data structures and algorithms used > > > > by > > > > the > > > > driver program xfs_scrub. > > > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > --- > > > > .../filesystems/xfs-online-fsck-design.rst | 313 > > > > ++++++++++++++++++++ > > > > 1 file changed, 313 insertions(+) > > > > > > > > > > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst > > > > b/Documentation/filesystems/xfs-online-fsck-design.rst > > > > index 2e20314f1831..05b9411fac7f 100644 > > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst > > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst > > > > @@ -300,6 +300,9 @@ The seven phases are as follows: > > > > 7. Re-check the summary counters and presents the caller with a > > > > summary of > > > > space usage and file counts. > > > > > > > > +This allocation of responsibilities will be :ref:`revisited > > > > <scrubcheck>` > > > > +later in this document. > > > > + > > > > Steps for Each Scrub Item > > > > ------------------------- > > > > > > > > @@ -4505,3 +4508,313 @@ The proposed patches are in the > > > > `orphanage adoption > > > > < > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > > log/?h=repair-orphanage>`_ > > > > series. > > > > + > > > > +6. Userspace Algorithms and Data Structures > > > > +=========================================== > > > > + > > > > +This section discusses the key algorithms and data structures of > > > > the > > > > userspace > > > > +program, ``xfs_scrub``, that provide the ability to drive > > > > metadata > > > > checks and > > > > +repairs in the kernel, verify file data, and look for other > > > > potential problems. > > > > + > > > > +.. _scrubcheck: > > > > + > > > > +Checking Metadata > > > > +----------------- > > > > + > > > > +Recall the :ref:`phases of fsck work<scrubphases>` outlined > > > > earlier. > > > > +That structure follows naturally from the data dependencies > > > > designed > > > > into the > > > > +filesystem from its beginnings in 1993. > > > > +In XFS, there are several groups of metadata dependencies: > > > > + > > > > +a. Filesystem summary counts depend on consistency within the > > > > inode > > > > indices, > > > > + the allocation group space btrees, and the realtime volume > > > > space > > > > + information. > > > > + > > > > +b. Quota resource counts depend on consistency within the quota > > > > file > > > > data > > > > + forks, inode indices, inode records, and the forks of every > > > > file > > > > on the > > > > + system. > > > > + > > > > +c. The naming hierarchy depends on consistency within the > > > > directory > > > > and > > > > + extended attribute structures. > > > > + This includes file link counts. > > > > + > > > > +d. Directories, extended attributes, and file data depend on > > > > consistency within > > > > + the file forks that map directory and extended attribute data > > > > to > > > > physical > > > > + storage media. > > > > + > > > > +e. The file forks depends on consistency within inode records > > > > and > > > > the space > > > > + metadata indices of the allocation groups and the realtime > > > > volume. > > > > + This includes quota and realtime metadata files. > > > > + > > > > +f. Inode records depends on consistency within the inode > > > > metadata > > > > indices. > > > > + > > > > +g. Realtime space metadata depend on the inode records and data > > > > forks of the > > > > + realtime metadata inodes. > > > > + > > > > +h. The allocation group metadata indices (free space, inodes, > > > > reference count, > > > > + and reverse mapping btrees) depend on consistency within the > > > > AG > > > > headers and > > > > + between all the AG metadata btrees. > > > > + > > > > +i. ``xfs_scrub`` depends on the filesystem being mounted and > > > > kernel > > > > support > > > > + for online fsck functionality. > > > > + > > > > +Therefore, a metadata dependency graph is a convenient way to > > > > schedule checking > > > > +operations in the ``xfs_scrub`` program: > > > > + > > > > +- Phase 1 checks that the provided path maps to an XFS > > > > filesystem > > > > and detect > > > > + the kernel's scrubbing abilities, which validates group (i). > > > > + > > > > +- Phase 2 scrubs groups (g) and (h) in parallel using a threaded > > > > workqueue. > > > > + > > > > +- Phase 3 checks groups (f), (e), and (d), in that order. > > > > + These groups are all file metadata, which means that inodes > > > > are > > > > scanned in > > > > + parallel. > > > ...When things are done in order, then they are done in serial > > > right? > > > Things done in parallel are done at the same time. Either the > > > phrase > > > "in that order" needs to go away, or the last line needs to drop > > > > Each inode is processed in parallel, but individual inodes are > > processed > > in f-e-d order. > > > > "Phase 3 scans inodes in parallel. For each inode, groups (f), (e), > > and > > (d) are checked, in that order." > Ohh, ok. Now that I re-read it, it makes sense but lets keep the new > one > > > > > > > + > > > > +- Phase 4 repairs everything in groups (i) through (d) so that > > > > phases 5 and 6 > > > > + may run reliably. > > > > + > > > > +- Phase 5 starts by checking groups (b) and (c) in parallel > > > > before > > > > moving on > > > > + to checking names. > > > > + > > > > +- Phase 6 depends on groups (i) through (b) to find file data > > > > blocks > > > > to verify, > > > > + to read them, and to report which blocks of which files are > > > > affected. > > > > + > > > > +- Phase 7 checks group (a), having validated everything else. > > > > + > > > > +Notice that the data dependencies between groups are enforced by > > > > the > > > > structure > > > > +of the program flow. > > > > + > > > > +Parallel Inode Scans > > > > +-------------------- > > > > + > > > > +An XFS filesystem can easily contain hundreds of millions of > > > > inodes. > > > > +Given that XFS targets installations with large high-performance > > > > storage, > > > > +it is desirable to scrub inodes in parallel to minimize runtime, > > > > particularly > > > > +if the program has been invoked manually from a command line. > > > > +This requires careful scheduling to keep the threads as evenly > > > > loaded as > > > > +possible. > > > > + > > > > +Early iterations of the ``xfs_scrub`` inode scanner naïvely > > > > created > > > > a single > > > > +workqueue and scheduled a single workqueue item per AG. > > > > +Each workqueue item walked the inode btree (with > > > > ``XFS_IOC_INUMBERS``) to find > > > > +inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to > > > > gather enough > > > > +information to construct file handles. > > > > +The file handle was then passed to a function to generate scrub > > > > items for each > > > > +metadata object of each inode. > > > > +This simple algorithm leads to thread balancing problems in > > > > phase 3 > > > > if the > > > > +filesystem contains one AG with a few large sparse files and the > > > > rest of the > > > > +AGs contain many smaller files. > > > > +The inode scan dispatch function was not sufficiently granular; > > > > it > > > > should have > > > > +been dispatching at the level of individual inodes, or, to > > > > constrain > > > > memory > > > > +consumption, inode btree records. > > > > + > > > > +Thanks to Dave Chinner, bounded workqueues in userspace enable > > > > ``xfs_scrub`` to > > > > +avoid this problem with ease by adding a second workqueue. > > > > +Just like before, the first workqueue is seeded with one > > > > workqueue > > > > item per AG, > > > > +and it uses INUMBERS to find inode btree chunks. > > > > +The second workqueue, however, is configured with an upper bound > > > > on > > > > the number > > > > +of items that can be waiting to be run. > > > > +Each inode btree chunk found by the first workqueue's workers > > > > are > > > > queued to the > > > > +second workqueue, and it is this second workqueue that queries > > > > BULKSTAT, > > > > +creates a file handle, and passes it to a function to generate > > > > scrub > > > > items for > > > > +each metadata object of each inode. > > > > +If the second workqueue is too full, the workqueue add function > > > > blocks the > > > > +first workqueue's workers until the backlog eases. > > > > +This doesn't completely solve the balancing problem, but reduces > > > > it > > > > enough to > > > > +move on to more pressing issues. > > > > + > > > > +The proposed patchsets are the scrub > > > > +`performance tweaks > > > > +< > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g > > > > it/log/?h=scrub-performance-tweaks>`_ > > > > +and the > > > > +`inode scan rebalance > > > > +< > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g > > > > it/log/?h=scrub-iscan-rebalance>`_ > > > > +series. > > > > + > > > > +.. _scrubrepair: > > > > + > > > > +Scheduling Repairs > > > > +------------------ > > > > + > > > > +During phase 2, corruptions and inconsistencies reported in any > > > > AGI > > > > header or > > > > +inode btree are repaired immediately, because phase 3 relies on > > > > proper > > > > +functioning of the inode indices to find inodes to scan. > > > > +Failed repairs are rescheduled to phase 4. > > > > +Problems reported in any other space metadata are deferred to > > > > phase > > > > 4. > > > > +Optimization opportunities are always deferred to phase 4, no > > > > matter > > > > their > > > > +origin. > > > > + > > > > +During phase 3, corruptions and inconsistencies reported in any > > > > part > > > > of a > > > > +file's metadata are repaired immediately if all space metadata > > > > were > > > > validated > > > > +during phase 2. > > > > +Repairs that fail or cannot be repaired immediately are > > > > scheduled > > > > for phase 4. > > > > + > > > > +In the original design of ``xfs_scrub``, it was thought that > > > > repairs > > > > would be > > > > +so infrequent that the ``struct xfs_scrub_metadata`` objects > > > > used to > > > > +communicate with the kernel could also be used as the primary > > > > object > > > > to > > > > +schedule repairs. > > > > +With recent increases in the number of optimizations possible > > > > for a > > > > given > > > > +filesystem object, it became much more memory-efficient to track > > > > all > > > > eligible > > > > +repairs for a given filesystem object with a single repair item. > > > > +Each repair item represents a single lockable object -- AGs, > > > > metadata files, > > > > +individual inodes, or a class of summary information. > > > > + > > > > +Phase 4 is responsible for scheduling a lot of repair work in as > > > > quick a > > > > +manner as is practical. > > > > +The :ref:`data dependencies <scrubcheck>` outlined earlier still > > > > apply, which > > > > +means that ``xfs_scrub`` must try to complete the repair work > > > > scheduled by > > > > +phase 2 before trying repair work scheduled by phase 3. > > > > +The repair process is as follows: > > > > + > > > > +1. Start a round of repair with a workqueue and enough workers > > > > to > > > > keep the CPUs > > > > + as busy as the user desires. > > > > + > > > > + a. For each repair item queued by phase 2, > > > > + > > > > + i. Ask the kernel to repair everything listed in the > > > > repair > > > > item for a > > > > + given filesystem object. > > > > + > > > > + ii. Make a note if the kernel made any progress in > > > > reducing > > > > the number > > > > + of repairs needed for this object. > > > > + > > > > + iii. If the object no longer requires repairs, revalidate > > > > all > > > > metadata > > > > + associated with this object. > > > > + If the revalidation succeeds, drop the repair item. > > > > + If not, requeue the item for more repairs. > > > > + > > > > + b. If any repairs were made, jump back to 1a to retry all the > > > > phase 2 items. > > > > + > > > > + c. For each repair item queued by phase 3, > > > > + > > > > + i. Ask the kernel to repair everything listed in the > > > > repair > > > > item for a > > > > + given filesystem object. > > > > + > > > > + ii. Make a note if the kernel made any progress in > > > > reducing > > > > the number > > > > + of repairs needed for this object. > > > > + > > > > + iii. If the object no longer requires repairs, revalidate > > > > all > > > > metadata > > > > + associated with this object. > > > > + If the revalidation succeeds, drop the repair item. > > > > + If not, requeue the item for more repairs. > > > > + > > > > + d. If any repairs were made, jump back to 1c to retry all the > > > > phase 3 items. > > > > + > > > > +2. If step 1 made any repair progress of any kind, jump back to > > > > step > > > > 1 to start > > > > + another round of repair. > > > > + > > > > +3. If there are items left to repair, run them all serially one > > > > more > > > > time. > > > > + Complain if the repairs were not successful, since this is > > > > the > > > > last chance > > > > + to repair anything. > > > > + > > > > +Corruptions and inconsistencies encountered during phases 5 and > > > > 7 > > > > are repaired > > > > +immediately. > > > > +Corrupt file data blocks reported by phase 6 cannot be recovered > > > > by > > > > the > > > > +filesystem. > > > > + > > > > +The proposed patchsets are the > > > > +`repair warning improvements > > > > +< > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g > > > > it/log/?h=scrub-better-repair-warnings>`_, > > > > +refactoring of the > > > > +`repair data dependency > > > > +< > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g > > > > it/log/?h=scrub-repair-data-deps>`_ > > > > +and > > > > +`object tracking > > > > +< > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g > > > > it/log/?h=scrub-object-tracking>`_, > > > > +and the > > > > +`repair scheduling > > > > +< > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g > > > > it/log/?h=scrub-repair-scheduling>`_ > > > > +improvement series. > > > > + > > > > +Checking Names for Confusable Unicode Sequences > > > > +----------------------------------------------- > > > > + > > > > +If ``xfs_scrub`` succeeds in validating the filesystem metadata > > > > by > > > > the end of > > > > +phase 4, it moves on to phase 5, which checks for suspicious > > > > looking > > > > names in > > > > +the filesystem. > > > > +These names consist of the filesystem label, names in directory > > > > entries, and > > > > +the names of extended attributes. > > > > +Like most Unix filesystems, XFS imposes the sparest of > > > > constraints > > > > on the > > > > +contents of a name -- slashes and null bytes are not allowed in > > > > directory > > > > +entries; and null bytes are not allowed in extended attributes > > > > and > > > maybe say "standard user accessible extended attributes" > > > > "userspace visible"? > Thats fine, mostly I meant to exclude parent pointers, but I've seen > other ideas that talk about using xattrs to store binary metadata, so > pptrs may not be the last to do this. Yeah. I think Andrey's fsverity mechanism is preparing to store merkle tree data in the format: (merkle tree block number) -> (pile of hashes or whatever) So there's more coming. :) --D > > > > I'll list-ify this too: > > > > Like most Unix filesystems, XFS imposes the sparest of constraints on > > the contents of a name: > > > > - slashes and null bytes are not allowed in directory entries; > > > > - null bytes are not allowed in userspace-visible extended > > attributes; > > > > - null bytes are not allowed in the filesystem label > Ok, I think that works > > > > > > > the > > > > +filesystem label. > > > > +Directory entries and attribute keys store the length of the > > > > name > > > > explicitly > > > > +ondisk, which means that nulls are not name terminators. > > > > +For this section, the term "naming domain" refers to any place > > > > where > > > > names are > > > > +presented together -- all the names in a directory, or all the > > > > attributes of a > > > > +file. > > > > + > > > > +Although the Unix naming constraints are very permissive, the > > > > reality of most > > > > +modern-day Linux systems is that programs work with Unicode > > > > character code > > > > +points to support international languages. > > > > +These programs typically encode those code points in UTF-8 when > > > > interfacing > > > > +with the C library because the kernel expects null-terminated > > > > names. > > > > +In the common case, therefore, names found in an XFS filesystem > > > > are > > > > actually > > > > +UTF-8 encoded Unicode data. > > > > + > > > > +To maximize its expressiveness, the Unicode standard defines > > > > separate control > > > > +points for various characters that render similarly or > > > > identically > > > > in writing > > > > +systems around the world. > > > > +For example, the character "Cyrillic Small Letter A" U+0430 "а" > > > > often renders > > > > +identically to "Latin Small Letter A" U+0061 "a". > > > > > > > > > > + > > > > +The standard also permits characters to be constructed in > > > > multiple > > > > ways -- > > > > +either by using a defined code point, or by combining one code > > > > point > > > > with > > > > +various combining marks. > > > > +For example, the character "Angstrom Sign U+212B "Å" can also be > > > > expressed > > > > +as "Latin Capital Letter A" U+0041 "A" followed by "Combining > > > > Ring > > > > Above" > > > > +U+030A "◌̊". > > > > +Both sequences render identically. > > > > + > > > > +Like the standards that preceded it, Unicode also defines > > > > various > > > > control > > > > +characters to alter the presentation of text. > > > > +For example, the character "Right-to-Left Override" U+202E can > > > > trick > > > > some > > > > +programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as > > > > "mootxt.png". > > > > +A second category of rendering problems involves whitespace > > > > characters. > > > > +If the character "Zero Width Space" U+200B is encountered in a > > > > file > > > > name, the > > > > +name will render identically to a name that does not have the > > > > zero > > > > width > > > > +space. > > > > + > > > > +If two names within a naming domain have different byte > > > > sequences > > > > but render > > > > +identically, a user may be confused by it. > > > > +The kernel, in its indifference to upper level encoding schemes, > > > > permits this. > > > > +Most filesystem drivers persist the byte sequence names that are > > > > given to them > > > > +by the VFS. > > > > + > > > > +Techniques for detecting confusable names are explained in great > > > > detail in > > > > +sections 4 and 5 of the > > > > +`Unicode Security Mechanisms > > > > <https://unicode.org/reports/tr39/>`_ > > > > +document. > > > I don't know that we need this much detail on character rendering. > > > I > > > think the example above is enough to make the point that character > > > strings can differ in binary, but render the same, so we need to > > > deal > > > with that. So I think that's really all the justification we need > > > for > > > the NFD usage > > > > I want to leave the link in, because TR39 is the canonical source for > > information about confusability detection. That is the location > > where > > the Unicode folks publish everything they currently know on the > > topic. > > Sure, maybe just keep the last line then. > > Allison > > > > > > > +``xfs_scrub``, when it detects UTF-8 encoding in use on a > > > > system, > > > > uses the > > > When ``xfs_scrub`` detects UTF-8 encoding, it uses the... > > > > Changed, thanks. > > > > > > +Unicode normalization form NFD in conjunction with the > > > > confusable > > > > name > > > > +detection component of > > > > +`libicu <https://github.com/unicode-org/icu>`_ > > > > +to identify names with a directory or within a file's extended > > > > attributes that > > > > +could be confused for each other. > > > > +Names are also checked for control characters, non-rendering > > > > characters, and > > > > +mixing of bidirectional characters. > > > > +All of these potential issues are reported to the system > > > > administrator during > > > > +phase 5. > > > > + > > > > +Media Verification of File Data Extents > > > > +--------------------------------------- > > > > + > > > > +The system administrator can elect to initiate a media scan of > > > > all > > > > file data > > > > +blocks. > > > > +This scan after validation of all filesystem metadata (except > > > > for > > > > the summary > > > > +counters) as phase 6. > > > > +The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the > > > > filesystem space map > > > > +to find areas that are allocated to file data fork extents. > > > > +Gaps betweeen data fork extents that are smaller than 64k are > > > > treated as if > > > > +they were data fork extents to reduce the command setup > > > > overhead. > > > > +When the space map scan accumulates a region larger than 32MB, a > > > > media > > > > +verification request is sent to the disk as a directio read of > > > > the > > > > raw block > > > > +device. > > > > + > > > > +If the verification read fails, ``xfs_scrub`` retries with > > > > single- > > > > block reads > > > > +to narrow down the failure to the specific region of the media > > > > and > > > > recorded. > > > > +When it has finished issuing verification requests, it again > > > > uses > > > > the space > > > > +mapping ioctl to map the recorded media errors back to metadata > > > > structures > > > > +and report what has been lost. > > > > +For media errors in blocks owned by files, the lack of parent > > > > pointers means > > > > +that the entire filesystem must be walked to report the file > > > > paths > > > > and offsets > > > > +corresponding to the media error. > > > > > > > This last bit will need to be updated after we come to a decision > > > with > > > the rfc > > > > I'll at least update it since this doc is now pretty deep into the > > pptrs > > stuff: > > > > "For media errors in blocks owned by files, parent pointers can be > > used > > to construct file paths from inode numbers for user-friendly > > reporting." > > > > > Other than that, I think it looks pretty good. > > > > Woot. > > > > --D > > > > > Allison > > > >