Re: [PATCH 05/14] xfs: document the filesystem metadata checking strategy

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 2 Feb 2023 11:04:04 -0800

On Sat, Jan 21, 2023 at 01:38:33AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > 
> > Begin the fifth chapter of the online fsck design documentation,
> > where
> > we discuss the details of the data structures and algorithms used by
> > the
> > kernel to examine filesystem metadata and cross-reference it around
> > the
> > filesystem.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  579
> > ++++++++++++++++++++
> >  .../filesystems/xfs-self-describing-metadata.rst   |    1 
> >  2 files changed, 580 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 42e82971e036..f45bf97fa9c4 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -864,3 +864,582 @@ Proposed patchsets include
> >  and
> >  `preservation of sickness info during memory reclaim
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=indirect-health-reporting>`_.
> > +
> > +5. Kernel Algorithms and Data Structures
> > +========================================
> > +
> > +This section discusses the key algorithms and data structures of the
> > kernel
> > +code that provide the ability to check and repair metadata while the
> > system
> > +is running.
> > +The first chapters in this section reveal the pieces that provide
> > the
> > +foundation for checking metadata.
> > +The remainder of this section presents the mechanisms through which
> > XFS
> > +regenerates itself.
> > +
> > +Self Describing Metadata
> > +------------------------
> > +
> > +Starting with XFS version 5 in 2012, XFS updated the format of
> > nearly every
> > +ondisk block header to record a magic number, a checksum, a
> > universally
> > +"unique" identifier (UUID), an owner code, the ondisk address of the
> > block,
> > +and a log sequence number.
> > +When loading a block buffer from disk, the magic number, UUID,
> > owner, and
> > +ondisk address confirm that the retrieved block matches the specific
> > owner of
> > +the current filesystem, and that the information contained in the
> > block is
> > +supposed to be found at the ondisk address.
> > +The first three components enable checking tools to disregard
> > alleged metadata
> > +that doesn't belong to the filesystem, and the fourth component
> > enables the
> > +filesystem to detect lost writes.
> Add...
> 
> "When ever a file system operation modifies a block, the change is
> submitted to the journal as a transaction.  The journal then processes
> these transactions marking them done once they are safely committed to
> the disk"

Ok, I'll add that transition.  Though I'll s/journal/log/ since this is
xfs. :)

> At this point we havnt talked much at all about transactions or logs,
> and we've just barely begin to cover blocks.  I think you at least want
> a quick blip to describe the relation of these two things, or it may
> not be clear why we suddenly jumped into logs.

Point taken.  Thanks for the suggestion.

> > +
> > +The logging code maintains the checksum and the log sequence number
> > of the last
> > +transactional update.
> > +Checksums are useful for detecting torn writes and other mischief
> "Checksums (or crc's) are useful for detecting incomplete or torn
> writes as well as other discrepancies..."

Checksums are a general concept, whereas CRCs denote a particular family
of checksums.  The statement would still apply even if we used a
different family (e.g. erasure codes, cryptographic hash functions) of
function instead of crc32c.

I will, however, avoid the undefined term 'mischief'.  Thanks for the
correction.

"Checksums are useful for detecting torn writes and other discrepancies
that can be introduced between the computer and its storage devices."

> > between the
> > +computer and its storage devices.
> > +Sequence number tracking enables log recovery to avoid applying out
> > of date
> > +log updates to the filesystem.
> > +
> > +These two features improve overall runtime resiliency by providing a
> > means for
> > +the filesystem to detect obvious corruption when reading metadata
> > blocks from
> > +disk, but these buffer verifiers cannot provide any consistency
> > checking
> > +between metadata structures.
> > +
> > +For more information, please see the documentation for
> > +Documentation/filesystems/xfs-self-describing-metadata.rst
> > +
> > +Reverse Mapping
> > +---------------
> > +
> > +The original design of XFS (circa 1993) is an improvement upon 1980s
> > Unix
> > +filesystem design.
> > +In those days, storage density was expensive, CPU time was scarce,
> > and
> > +excessive seek time could kill performance.
> > +For performance reasons, filesystem authors were reluctant to add
> > redundancy to
> > +the filesystem, even at the cost of data integrity.
> > +Filesystems designers in the early 21st century choose different
> > strategies to
> > +increase internal redundancy -- either storing nearly identical
> > copies of
> > +metadata, or more space-efficient techniques such as erasure coding.
> "such as erasure coding which may encode sections of the data with
> redundant symbols and in more than one location"
> 
> That ties it into the next line.  If you go on to talk about a term you
> have not previously defined, i think you want to either define it
> quickly or just drop it all together.  Right now your goal is to just
> give the reader context, so you want it to move quickly.

How about I shorten it to:

"...or more space-efficient encoding techniques." ?

and end the paragraph there?

> > +Obvious corruptions are typically repaired by copying replicas or
> > +reconstructing from codes.
> > +
> I think I would have just jumped straight from xfs history to modern
> xfs...
> > +For XFS, a different redundancy strategy was chosen to modernize the
> > design:
> > +a secondary space usage index that maps allocated disk extents back
> > to their
> > +owners.
> > +By adding a new index, the filesystem retains most of its ability to
> > scale
> > +well to heavily threaded workloads involving large datasets, since
> > the primary
> > +file metadata (the directory tree, the file block map, and the
> > allocation
> > +groups) remain unchanged.
> > 
> 
> > +Although the reverse-mapping feature increases overhead costs for
> > space
> > +mapping activities just like any other system that improves
> > redundancy, it
> "Like any system that improves redundancy, the reverse-mapping feature
> increases overhead costs for space mapping activities. However, it..."

I like this better.  These two sentences have been changed to read:

"Like any system that improves redundancy, the reverse-mapping feature
increases overhead costs for space mapping activities.  However, it has
two critical advantages: first, the reverse index is key to enabling
online fsck and other requested functionality such as free space
defragmentation, better media failure reporting, and filesystem
shrinking."

> > +has two critical advantages: first, the reverse index is key to
> > enabling online
> > +fsck and other requested functionality such as filesystem
> > reorganization,
> > +better media failure reporting, and shrinking.
> > +Second, the different ondisk storage format of the reverse mapping
> > btree
> > +defeats device-level deduplication, because the filesystem requires
> > real
> > +redundancy.
> > +
> > +A criticism of adding the secondary index is that it does nothing to
> > improve
> > +the robustness of user data storage itself.
> > +This is a valid point, but adding a new index for file data block
> > checksums
> > +increases write amplification and turns data overwrites into copy-
> > writes, which
> > +age the filesystem prematurely.
> > +In keeping with thirty years of precedent, users who want file data
> > integrity
> > +can supply as powerful a solution as they require.
> > +As for metadata, the complexity of adding a new secondary index of
> > space usage
> > +is much less than adding volume management and storage device
> > mirroring to XFS
> > +itself.
> > +Perfection of RAID and volume management are best left to existing
> > layers in
> > +the kernel.
> I think I would cull the entire above paragraph.  rmap, crc and raid
> all have very different points of redundancy, so criticism that an
> apple is not an orange or visavis just feels like a shortsighted
> comparison that's probably more of a distraction than anything.
> 
> Sometimes it feels like this document kinda gets off into tangents
> like it's preemptively trying to position it's self for an argument
> that hasn't happened yet.

It does!  Each of the many tangents that you've pointed out are a
reaction to some discussion that we've had on the list, or at an
LSF, or <cough> fs nerds sniping on social media.  The reason I
capture all of these offtopic arguments is to discourage people from
wasting time rehashing discussions that were settled long ago.

Admittedly, that is a very defensive reaction on my part...

> But I think it has the effect of pulling the
> readers attention off topic into an argument they never thought to
> consider in the first place.  The topic of this section is to explain
> what rmap is.  So lets stay on topic and finish laying out that ground
> work first before getting into how it compares to other solutions

...and you're right to point out that mentioning these things is
distracting and provides fuel to reignite a flamewar.  At the same time,
I think there's value in identifying the roads not taken, and why.

What if I turned these tangents into explicitly labelled sidebars?
Would that help readers who want to stick to the topic?

> > +
> > +The information captured in a reverse space mapping record is as
> > follows:
> > +
> > +.. code-block:: c
> > +
> > +       struct xfs_rmap_irec {
> > +           xfs_agblock_t    rm_startblock;   /* extent start block
> > */
> > +           xfs_extlen_t     rm_blockcount;   /* extent length */
> > +           uint64_t         rm_owner;        /* extent owner */
> > +           uint64_t         rm_offset;       /* offset within the
> > owner */
> > +           unsigned int     rm_flags;        /* state flags */
> > +       };
> > +
> > +The first two fields capture the location and size of the physical
> > space,
> > +in units of filesystem blocks.
> > +The owner field tells scrub which metadata structure or file inode
> > have been
> > +assigned this space.
> > +For space allocated to files, the offset field tells scrub where the
> > space was
> > +mapped within the file fork.
> > +Finally, the flags field provides extra information about the space
> > usage --
> > +is this an attribute fork extent?  A file mapping btree extent?  Or
> > an
> > +unwritten data extent?
> > +
> > +Online filesystem checking judges the consistency of each primary
> > metadata
> > +record by comparing its information against all other space indices.
> > +The reverse mapping index plays a key role in the consistency
> > checking process
> > +because it contains a centralized alternate copy of all space
> > allocation
> > +information.
> > +Program runtime and ease of resource acquisition are the only real
> > limits to
> > +what online checking can consult.
> > +For example, a file data extent mapping can be checked against:
> > +
> > +* The absence of an entry in the free space information.
> > +* The absence of an entry in the inode index.
> > +* The absence of an entry in the reference count data if the file is
> > not
> > +  marked as having shared extents.
> > +* The correspondence of an entry in the reverse mapping information.
> > +
> > +A key observation here is that only the reverse mapping can provide
> > a positive
> > +affirmation of correctness if the primary metadata is in doubt.
> if any of the above metadata is in doubt...

Fixed.

> > +The checking code for most primary metadata follows a path similar
> > to the
> > +one outlined above.
> > +
> > +A second observation to make about this secondary index is that
> > proving its
> > +consistency with the primary metadata is difficult.
> 
> > +Demonstrating that a given reverse mapping record exactly
> > corresponds to the
> > +primary space metadata involves a full scan of all primary space
> > metadata,
> > +which is very time intensive.
> "But why?" Wonders the reader. Just jump into an example:
> 
> "In order to verify that an rmap extent does not incorrectly over lap
> with another record, we would need a full scan of all the other
> records, which is time intensive."

I want to shorten it even further:

"Validating that reverse mapping records are correct requires a full
scan of all primary space metadata, which is very time intensive."

> 
> ?
> 
> And then the below is a separate observation right?  

Right.

> > +Scanning activity for online fsck can only use non-blocking lock
> > acquisition
> > +primitives if the locking order is not the regular order as used by
> > the rest of
> > +the filesystem.
> Lastly, it should be noted that most file system operations tend to
> lock primary metadata before locking the secondary metadata.

This isn't accurate -- metadata structures don't have separate locks.
So it's not true to say that we lock primary or secondary metadata.

We /can/ say that file operations lock the inode, then the AGI, then the
AGF; or that directory operations lock the parent and child ILOCKs in
inumber order; and that if scrub wants to take locks in any other order,
it can only do that via trylocks and backoff.

> This
> means that scanning operations that acquire the secondary metadata
> first may need to yield the secondary lock to filesystem operations
> that have already acquired the primary lock. 
> 
> ?
> 
> > +This means that forward progress during this part of a scan of the
> > reverse
> > +mapping data cannot be guaranteed if system load is especially
> > heavy.
> > +Therefore, it is not practical for online check to detect reverse
> > mapping
> > +records that lack a counterpart in the primary metadata.
> Such as <quick list / quick example>
> 
> > +Instead, scrub relies on rigorous cross-referencing during the
> > primary space
> > +mapping structure checks.

I've converted this section into a bullet list:

"There are several observations to make about reverse mapping indices:

"1. Reverse mappings can provide a positive affirmation of correctness if
any of the above primary metadata are in doubt.  The checking code for
most primary metadata follows a path similar to the one outlined above.

"2. Proving the consistency of secondary metadata with the primary
metadata is difficult because that requires a full scan of all primary
space metadata, which is very time intensive.  For example, checking a
reverse mapping record for a file extent mapping btree block requires
locking the file and searching the entire btree to confirm the block.
Instead, scrub relies on rigorous cross-referencing during the primary
space mapping structure checks.

"3. Consistency scans must use non-blocking lock acquisition primitives
if the required locking order is not the same order used by regular
filesystem operations.  This means that forward progress during this
part of a scan of the reverse mapping data cannot be guaranteed if
system load is heavy."

> > +
> 
> The below paragraph sounds like a re-cap?
> 
> "So to recap, reverse mappings also...."

Yep.

> > +Reverse mappings also play a key role in reconstruction of primary
> > metadata.
> > +The secondary information is general enough for online repair to
> > synthesize a
> > +complete copy of any primary space management metadata by locking
> > that
> > +resource, querying all reverse mapping indices looking for records
> > matching
> > +the relevant resource, and transforming the mapping into an
> > appropriate format.
> > +The details of how these records are staged, written to disk, and
> > committed
> > +into the filesystem are covered in subsequent sections.
> I also think the section would be ok if you were to trim off this last
> paragraph too.

Hm.  I still want to set up the expectation that there's more to come.
How about a brief two-sentence transition paragraph:

"In summary, reverse mappings play a key role in reconstruction of
primary metadata.  The details of how these records are staged, written
to disk, and committed into the filesystem are covered in subsequent
sections."

> 
> > +
> > +Checking and Cross-Referencing
> > +------------------------------
> > +
> > +The first step of checking a metadata structure is to examine every
> > record
> > +contained within the structure and its relationship with the rest of
> > the
> > +system.
> > +XFS contains multiple layers of checking to try to prevent
> > inconsistent
> > +metadata from wreaking havoc on the system.
> > +Each of these layers contributes information that helps the kernel
> > to make
> > +three decisions about the health of a metadata structure:
> > +
> > +- Is a part of this structure obviously corrupt
> > (``XFS_SCRUB_OFLAG_CORRUPT``) ?
> > +- Is this structure inconsistent with the rest of the system
> > +  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
> > +- Is there so much damage around the filesystem that cross-
> > referencing is not
> > +  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
> > +- Can the structure be optimized to improve performance or reduce
> > the size of
> > +  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
> > +- Does the structure contain data that is not inconsistent but
> > deserves review
> > +  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
> > +
> > +The following sections describe how the metadata scrubbing process
> > works.
> > +
> > +Metadata Buffer Verification
> > +````````````````````````````
> > +
> > +The lowest layer of metadata protection in XFS are the metadata
> > verifiers built
> > +into the buffer cache.
> > +These functions perform inexpensive internal consistency checking of
> > the block
> > +itself, and answer these questions:
> > +
> > +- Does the block belong to this filesystem?
> > +
> > +- Does the block belong to the structure that asked for the read?
> > +  This assumes that metadata blocks only have one owner, which is
> > always true
> > +  in XFS.
> > +
> > +- Is the type of data stored in the block within a reasonable range
> > of what
> > +  scrub is expecting?
> > +
> > +- Does the physical location of the block match the location it was
> > read from?
> > +
> > +- Does the block checksum match the data?
> > +
> > +The scope of the protections here are very limited -- verifiers can
> > only
> > +establish that the filesystem code is reasonably free of gross
> > corruption bugs
> > +and that the storage system is reasonably competent at retrieval.
> > +Corruption problems observed at runtime cause the generation of
> > health reports,
> > +failed system calls, and in the extreme case, filesystem shutdowns
> > if the
> > +corrupt metadata force the cancellation of a dirty transaction.
> > +
> > +Every online fsck scrubbing function is expected to read every
> > ondisk metadata
> > +block of a structure in the course of checking the structure.
> > +Corruption problems observed during a check are immediately reported
> > to
> > +userspace as corruption; during a cross-reference, they are reported
> > as a
> > +failure to cross-reference once the full examination is complete.
> > +Reads satisfied by a buffer already in cache (and hence already
> > verified)
> > +bypass these checks.
> > +
> > +Internal Consistency Checks
> > +```````````````````````````
> > +
> > +The next higher level of metadata protection is the internal record
> "After the buffer cache, the next level of metadata protection is..."

Changed.  I'll do the same to the next section as well.

> > +verification code built into the filesystem.
> 
> > +These checks are split between the buffer verifiers, the in-
> > filesystem users of
> > +the buffer cache, and the scrub code itself, depending on the amount
> > of higher
> > +level context required.
> > +The scope of checking is still internal to the block.
> > +For performance reasons, regular code may skip some of these checks
> > unless
> > +debugging is enabled or a write is about to occur.
> > +Scrub functions, of course, must check all possible problems.
> I'd put this chunk after the list below.
> 
> > +Either way, these higher level checking functions answer these
> > questions:
> Then this becomes:
> "These higher level checking functions..."

Done.

> > +
> > +- Does the type of data stored in the block match what scrub is
> > expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the
> > read?
> > +
> > +- If the block contains records, do the records fit within the
> > block?
> > +
> > +- If the block tracks internal free space information, is it
> > consistent with
> > +  the record areas?
> > +
> > +- Are the records contained inside the block free of obvious
> > corruptions?
> > +
> > +Record checks in this category are more rigorous and more time-
> > intensive.
> > +For example, block pointers and inumbers are checked to ensure that
> > they point
> > +within the dynamically allocated parts of an allocation group and
> > within
> > +the filesystem.
> > +Names are checked for invalid characters, and flags are checked for
> > invalid
> > +combinations.
> > +Other record attributes are checked for sensible values.
> > +Btree records spanning an interval of the btree keyspace are checked
> > for
> > +correct order and lack of mergeability (except for file fork
> > mappings).
> > +
> > +Validation of Userspace-Controlled Record Attributes
> > +````````````````````````````````````````````````````
> > +
> > +Various pieces of filesystem metadata are directly controlled by
> > userspace.
> > +Because of this nature, validation work cannot be more precise than
> > checking
> > +that a value is within the possible range.
> > +These fields include:
> > +
> > +- Superblock fields controlled by mount options
> > +- Filesystem labels
> > +- File timestamps
> > +- File permissions
> > +- File size
> > +- File flags
> > +- Names present in directory entries, extended attribute keys, and
> > filesystem
> > +  labels
> > +- Extended attribute key namespaces
> > +- Extended attribute values
> > +- File data block contents
> > +- Quota limits
> > +- Quota timer expiration (if resource usage exceeds the soft limit)
> > +
> > +Cross-Referencing Space Metadata
> > +````````````````````````````````
> > +
> > +The next higher level of checking is cross-referencing records
> > between metadata
> 
> I kinda like the list first so that the reader has an idea of what
> these checks are before getting into discussion about them.  It just
> makes it a little more obvious as to why it's "prohibitively expensive"
> or "dependent on the context of the structure" after having just looked
> at it

<nod>

> The rest looks good from here.

Woot.  Onto the next reply! :)

--D

> Allison
> 
> > +structures.
> > +For regular runtime code, the cost of these checks is considered to
> > be
> > +prohibitively expensive, but as scrub is dedicated to rooting out
> > +inconsistencies, it must pursue all avenues of inquiry.
> > +The exact set of cross-referencing is highly dependent on the
> > context of the
> > +data structure being checked.
> > +
> > +The XFS btree code has keyspace scanning functions that online fsck
> > uses to
> > +cross reference one structure with another.
> > +Specifically, scrub can scan the key space of an index to determine
> > if that
> > +keyspace is fully, sparsely, or not at all mapped to records.
> > +For the reverse mapping btree, it is possible to mask parts of the
> > key for the
> > +purposes of performing a keyspace scan so that scrub can decide if
> > the rmap
> > +btree contains records mapping a certain extent of physical space
> > without the
> > +sparsenses of the rest of the rmap keyspace getting in the way.
> > +
> > +Btree blocks undergo the following checks before cross-referencing:
> > +
> > +- Does the type of data stored in the block match what scrub is
> > expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the
> > read?
> > +
> > +- Do the records fit within the block?
> > +
> > +- Are the records contained inside the block free of obvious
> > corruptions?
> > +
> > +- Are the name hashes in the correct order?
> > +
> > +- Do node pointers within the btree point to valid block addresses
> > for the type
> > +  of btree?
> > +
> > +- Do child pointers point towards the leaves?
> > +
> > +- Do sibling pointers point across the same level?
> > +
> > +- For each node block record, does the record key accurate reflect
> > the contents
> > +  of the child block?
> > +
> > +Space allocation records are cross-referenced as follows:
> > +
> > +1. Any space mentioned by any metadata structure are cross-
> > referenced as
> > +   follows:
> > +
> > +   - Does the reverse mapping index list only the appropriate owner
> > as the
> > +     owner of each block?
> > +
> > +   - Are none of the blocks claimed as free space?
> > +
> > +   - If these aren't file data blocks, are none of the blocks
> > claimed as space
> > +     shared by different owners?
> > +
> > +2. Btree blocks are cross-referenced as follows:
> > +
> > +   - Everything in class 1 above.
> > +
> > +   - If there's a parent node block, do the keys listed for this
> > block match the
> > +     keyspace of this block?
> > +
> > +   - Do the sibling pointers point to valid blocks?  Of the same
> > level?
> > +
> > +   - Do the child pointers point to valid blocks?  Of the next level
> > down?
> > +
> > +3. Free space btree records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Does the reverse mapping index list no owners of this space?
> > +
> > +   - Is this space not claimed by the inode index for inodes?
> > +
> > +   - Is it not mentioned by the reference count index?
> > +
> > +   - Is there a matching record in the other free space btree?
> > +
> > +4. Inode btree records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Is there a matching record in free inode btree?
> > +
> > +   - Do cleared bits in the holemask correspond with inode clusters?
> > +
> > +   - Do set bits in the freemask correspond with inode records with
> > zero link
> > +     count?
> > +
> > +5. Inode records are cross-referenced as follows:
> > +
> > +   - Everything in class 1.
> > +
> > +   - Do all the fields that summarize information about the file
> > forks actually
> > +     match those forks?
> > +
> > +   - Does each inode with zero link count correspond to a record in
> > the free
> > +     inode btree?
> > +
> > +6. File fork space mapping records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Is this space not mentioned by the inode btrees?
> > +
> > +   - If this is a CoW fork mapping, does it correspond to a CoW
> > entry in the
> > +     reference count btree?
> > +
> > +7. Reference count records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Within the space subkeyspace of the rmap btree (that is to say,
> > all
> > +     records mapped to a particular space extent and ignoring the
> > owner info),
> > +     are there the same number of reverse mapping records for each
> > block as the
> > +     reference count record claims?
> > +
> > +Proposed patchsets are the series to find gaps in
> > +`refcount btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-refcount-gaps>`_,
> > +`inode btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-inobt-gaps>`_, and
> > +`rmap btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-rmapbt-gaps>`_ records;
> > +to find
> > +`mergeable records
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-mergeable-records>`_;
> > +and to
> > +`improve cross referencing with rmap
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-strengthen-rmap-checking>`_
> > +before starting a repair.
> > +
> > +Checking Extended Attributes
> > +````````````````````````````
> > +
> > +Extended attributes implement a key-value store that enable
> > fragments of data
> > +to be attached to any file.
> > +Both the kernel and userspace can access the keys and values,
> > subject to
> > +namespace and privilege restrictions.
> > +Most typically these fragments are metadata about the file --
> > origins, security
> > +contexts, user-supplied labels, indexing information, etc.
> > +
> > +Names can be as long as 255 bytes and can exist in several different
> > +namespaces.
> > +Values can be as large as 64KB.
> > +A file's extended attributes are stored in blocks mapped by the attr
> > fork.
> > +The mappings point to leaf blocks, remote value blocks, or dabtree
> > blocks.
> > +Block 0 in the attribute fork is always the top of the structure,
> > but otherwise
> > +each of the three types of blocks can be found at any offset in the
> > attr fork.
> > +Leaf blocks contain attribute key records that point to the name and
> > the value.
> > +Names are always stored elsewhere in the same leaf block.
> > +Values that are less than 3/4 the size of a filesystem block are
> > also stored
> > +elsewhere in the same leaf block.
> > +Remote value blocks contain values that are too large to fit inside
> > a leaf.
> > +If the leaf information exceeds a single filesystem block, a dabtree
> > (also
> > +rooted at block 0) is created to map hashes of the attribute names
> > to leaf
> > +blocks in the attr fork.
> > +
> > +Checking an extended attribute structure is not so straightfoward
> > due to the
> > +lack of separation between attr blocks and index blocks.
> > +Scrub must read each block mapped by the attr fork and ignore the
> > non-leaf
> > +blocks:
> > +
> > +1. Walk the dabtree in the attr fork (if present) to ensure that
> > there are no
> > +   irregularities in the blocks or dabtree mappings that do not
> > point to
> > +   attr leaf blocks.
> > +
> > +2. Walk the blocks of the attr fork looking for leaf blocks.
> > +   For each entry inside a leaf:
> > +
> > +   a. Validate that the name does not contain invalid characters.
> > +
> > +   b. Read the attr value.
> > +      This performs a named lookup of the attr name to ensure the
> > correctness
> > +      of the dabtree.
> > +      If the value is stored in a remote block, this also validates
> > the
> > +      integrity of the remote value block.
> > +
> > +Checking and Cross-Referencing Directories
> > +``````````````````````````````````````````
> > +
> > +The filesystem directory tree is a directed acylic graph structure,
> > with files
> > +constituting the nodes, and directory entries (dirents) constituting
> > the edges.
> > +Directories are a special type of file containing a set of mappings
> > from a
> > +255-byte sequence (name) to an inumber.
> > +These are called directory entries, or dirents for short.
> > +Each directory file must have exactly one directory pointing to the
> > file.
> > +A root directory points to itself.
> > +Directory entries point to files of any type.
> > +Each non-directory file may have multiple directories point to it.
> > +
> > +In XFS, directories are implemented as a file containing up to three
> > 32GB
> > +partitions.
> > +The first partition contains directory entry data blocks.
> > +Each data block contains variable-sized records associating a user-
> > provided
> > +name with an inumber and, optionally, a file type.
> > +If the directory entry data grows beyond one block, the second
> > partition (which
> > +exists as post-EOF extents) is populated with a block containing
> > free space
> > +information and an index that maps hashes of the dirent names to
> > directory data
> > +blocks in the first partition.
> > +This makes directory name lookups very fast.
> > +If this second partition grows beyond one block, the third partition
> > is
> > +populated with a linear array of free space information for faster
> > +expansions.
> > +If the free space has been separated and the second partition grows
> > again
> > +beyond one block, then a dabtree is used to map hashes of dirent
> > names to
> > +directory data blocks.
> > +
> > +Checking a directory is pretty straightfoward:
> > +
> > +1. Walk the dabtree in the second partition (if present) to ensure
> > that there
> > +   are no irregularities in the blocks or dabtree mappings that do
> > not point to
> > +   dirent blocks.
> > +
> > +2. Walk the blocks of the first partition looking for directory
> > entries.
> > +   Each dirent is checked as follows:
> > +
> > +   a. Does the name contain no invalid characters?
> > +
> > +   b. Does the inumber correspond to an actual, allocated inode?
> > +
> > +   c. Does the child inode have a nonzero link count?
> > +
> > +   d. If a file type is included in the dirent, does it match the
> > type of the
> > +      inode?
> > +
> > +   e. If the child is a subdirectory, does the child's dotdot
> > pointer point
> > +      back to the parent?
> > +
> > +   f. If the directory has a second partition, perform a named
> > lookup of the
> > +      dirent name to ensure the correctness of the dabtree.
> > +
> > +3. Walk the free space list in the third partition (if present) to
> > ensure that
> > +   the free spaces it describes are really unused.
> > +
> > +Checking operations involving :ref:`parents <dirparent>` and
> > +:ref:`file link counts <nlinks>` are discussed in more detail in
> > later
> > +sections.
> > +
> > +Checking Directory/Attribute Btrees
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +As stated in previous sections, the directory/attribute btree
> > (dabtree) index
> > +maps user-provided names to improve lookup times by avoiding linear
> > scans.
> > +Internally, it maps a 32-bit hash of the name to a block offset
> > within the
> > +appropriate file fork.
> > +
> > +The internal structure of a dabtree closely resembles the btrees
> > that record
> > +fixed-size metadata records -- each dabtree block contains a magic
> > number, a
> > +checksum, sibling pointers, a UUID, a tree level, and a log sequence
> > number.
> > +The format of leaf and node records are the same -- each entry
> > points to the
> > +next level down in the hierarchy, with dabtree node records pointing
> > to dabtree
> > +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks
> > elsewhere
> > +in the fork.
> > +
> > +Checking and cross-referencing the dabtree is very similar to what
> > is done for
> > +space btrees:
> > +
> > +- Does the type of data stored in the block match what scrub is
> > expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the
> > read?
> > +
> > +- Do the records fit within the block?
> > +
> > +- Are the records contained inside the block free of obvious
> > corruptions?
> > +
> > +- Are the name hashes in the correct order?
> > +
> > +- Do node pointers within the dabtree point to valid fork offsets
> > for dabtree
> > +  blocks?
> > +
> > +- Do leaf pointers within the dabtree point to valid fork offsets
> > for directory
> > +  or attr leaf blocks?
> > +
> > +- Do child pointers point towards the leaves?
> > +
> > +- Do sibling pointers point across the same level?
> > +
> > +- For each dabtree node record, does the record key accurate reflect
> > the
> > +  contents of the child dabtree block?
> > +
> > +- For each dabtree leaf record, does the record key accurate reflect
> > the
> > +  contents of the directory or attr block?
> > +
> > +Cross-Referencing Summary Counters
> > +``````````````````````````````````
> > +
> > +XFS maintains three classes of summary counters: available
> > resources, quota
> > +resource usage, and file link counts.
> > +
> > +In theory, the amount of available resources (data blocks, inodes,
> > realtime
> > +extents) can be found by walking the entire filesystem.
> > +This would make for very slow reporting, so a transactional
> > filesystem can
> > +maintain summaries of this information in the superblock.
> > +Cross-referencing these values against the filesystem metadata
> > should be a
> > +simple matter of walking the free space and inode metadata in each
> > AG and the
> > +realtime bitmap, but there are complications that will be discussed
> > in
> > +:ref:`more detail <fscounters>` later.
> > +
> > +:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
> > +checking are sufficiently complicated to warrant separate sections.
> > +
> > +Post-Repair Reverification
> > +``````````````````````````
> > +
> > +After performing a repair, the checking code is run a second time to
> > validate
> > +the new structure, and the results of the health assessment are
> > recorded
> > +internally and returned to the calling process.
> > +This step is critical for enabling system administrator to monitor
> > the status
> > +of the filesystem and the progress of any repairs.
> > +For developers, it is a useful means to judge the efficacy of error
> > detection
> > +and correction in the online and offline checking tools.
> > diff --git a/Documentation/filesystems/xfs-self-describing-
> > metadata.rst b/Documentation/filesystems/xfs-self-describing-
> > metadata.rst
> > index b79dbf36dc94..a10c4ae6955e 100644
> > --- a/Documentation/filesystems/xfs-self-describing-metadata.rst
> > +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
> > @@ -1,4 +1,5 @@
> >  .. SPDX-License-Identifier: GPL-2.0
> > +.. _xfs_self_describing_metadata:
> >  
> >  ============================
> >  XFS Self Describing Metadata
> > 
>