On Thu, 2023-02-02 at 11:04 -0800, Darrick J. Wong wrote: > On Sat, Jan 21, 2023 at 01:38:33AM +0000, Allison Henderson wrote: > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > Begin the fifth chapter of the online fsck design documentation, > > > where > > > we discuss the details of the data structures and algorithms used > > > by > > > the > > > kernel to examine filesystem metadata and cross-reference it > > > around > > > the > > > filesystem. > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > --- > > > .../filesystems/xfs-online-fsck-design.rst | 579 > > > ++++++++++++++++++++ > > > .../filesystems/xfs-self-describing-metadata.rst | 1 > > > 2 files changed, 580 insertions(+) > > > > > > > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst > > > b/Documentation/filesystems/xfs-online-fsck-design.rst > > > index 42e82971e036..f45bf97fa9c4 100644 > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst > > > @@ -864,3 +864,582 @@ Proposed patchsets include > > > and > > > `preservation of sickness info during memory reclaim > > > < > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=indirect-health-reporting>`_. > > > + > > > +5. Kernel Algorithms and Data Structures > > > +======================================== > > > + > > > +This section discusses the key algorithms and data structures of > > > the > > > kernel > > > +code that provide the ability to check and repair metadata while > > > the > > > system > > > +is running. > > > +The first chapters in this section reveal the pieces that > > > provide > > > the > > > +foundation for checking metadata. > > > +The remainder of this section presents the mechanisms through > > > which > > > XFS > > > +regenerates itself. > > > + > > > +Self Describing Metadata > > > +------------------------ > > > + > > > +Starting with XFS version 5 in 2012, XFS updated the format of > > > nearly every > > > +ondisk block header to record a magic number, a checksum, a > > > universally > > > +"unique" identifier (UUID), an owner code, the ondisk address of > > > the > > > block, > > > +and a log sequence number. > > > +When loading a block buffer from disk, the magic number, UUID, > > > owner, and > > > +ondisk address confirm that the retrieved block matches the > > > specific > > > owner of > > > +the current filesystem, and that the information contained in > > > the > > > block is > > > +supposed to be found at the ondisk address. > > > +The first three components enable checking tools to disregard > > > alleged metadata > > > +that doesn't belong to the filesystem, and the fourth component > > > enables the > > > +filesystem to detect lost writes. > > Add... > > > > "When ever a file system operation modifies a block, the change is > > submitted to the journal as a transaction. The journal then > > processes > > these transactions marking them done once they are safely committed > > to > > the disk" > > Ok, I'll add that transition. Though I'll s/journal/log/ since this > is > xfs. :) > > > At this point we havnt talked much at all about transactions or > > logs, > > and we've just barely begin to cover blocks. I think you at least > > want > > a quick blip to describe the relation of these two things, or it > > may > > not be clear why we suddenly jumped into logs. > > Point taken. Thanks for the suggestion. > > > > + > > > +The logging code maintains the checksum and the log sequence > > > number > > > of the last > > > +transactional update. > > > +Checksums are useful for detecting torn writes and other > > > mischief > > "Checksums (or crc's) are useful for detecting incomplete or torn > > writes as well as other discrepancies..." > > Checksums are a general concept, whereas CRCs denote a particular > family > of checksums. The statement would still apply even if we used a > different family (e.g. erasure codes, cryptographic hash functions) > of > function instead of crc32c. > > I will, however, avoid the undefined term 'mischief'. Thanks for the > correction. > > "Checksums are useful for detecting torn writes and other > discrepancies > that can be introduced between the computer and its storage devices." > > > > between the > > > +computer and its storage devices. > > > +Sequence number tracking enables log recovery to avoid applying > > > out > > > of date > > > +log updates to the filesystem. > > > + > > > +These two features improve overall runtime resiliency by > > > providing a > > > means for > > > +the filesystem to detect obvious corruption when reading > > > metadata > > > blocks from > > > +disk, but these buffer verifiers cannot provide any consistency > > > checking > > > +between metadata structures. > > > + > > > +For more information, please see the documentation for > > > +Documentation/filesystems/xfs-self-describing-metadata.rst > > > + > > > +Reverse Mapping > > > +--------------- > > > + > > > +The original design of XFS (circa 1993) is an improvement upon > > > 1980s > > > Unix > > > +filesystem design. > > > +In those days, storage density was expensive, CPU time was > > > scarce, > > > and > > > +excessive seek time could kill performance. > > > +For performance reasons, filesystem authors were reluctant to > > > add > > > redundancy to > > > +the filesystem, even at the cost of data integrity. > > > +Filesystems designers in the early 21st century choose different > > > strategies to > > > +increase internal redundancy -- either storing nearly identical > > > copies of > > > +metadata, or more space-efficient techniques such as erasure > > > coding. > > "such as erasure coding which may encode sections of the data with > > redundant symbols and in more than one location" > > > > That ties it into the next line. If you go on to talk about a term > > you > > have not previously defined, i think you want to either define it > > quickly or just drop it all together. Right now your goal is to > > just > > give the reader context, so you want it to move quickly. > > How about I shorten it to: > > "...or more space-efficient encoding techniques." ? Sure, I think that would be fine > > and end the paragraph there? > > > > +Obvious corruptions are typically repaired by copying replicas > > > or > > > +reconstructing from codes. > > > + > > I think I would have just jumped straight from xfs history to > > modern > > xfs... > > > +For XFS, a different redundancy strategy was chosen to modernize > > > the > > > design: > > > +a secondary space usage index that maps allocated disk extents > > > back > > > to their > > > +owners. > > > +By adding a new index, the filesystem retains most of its > > > ability to > > > scale > > > +well to heavily threaded workloads involving large datasets, > > > since > > > the primary > > > +file metadata (the directory tree, the file block map, and the > > > allocation > > > +groups) remain unchanged. > > > > > > > > +Although the reverse-mapping feature increases overhead costs > > > for > > > space > > > +mapping activities just like any other system that improves > > > redundancy, it > > "Like any system that improves redundancy, the reverse-mapping > > feature > > increases overhead costs for space mapping activities. However, > > it..." > > I like this better. These two sentences have been changed to read: > > "Like any system that improves redundancy, the reverse-mapping > feature > increases overhead costs for space mapping activities. However, it > has > two critical advantages: first, the reverse index is key to enabling > online fsck and other requested functionality such as free space > defragmentation, better media failure reporting, and filesystem > shrinking." Alrighty, sounds good > > > > +has two critical advantages: first, the reverse index is key to > > > enabling online > > > +fsck and other requested functionality such as filesystem > > > reorganization, > > > +better media failure reporting, and shrinking. > > > +Second, the different ondisk storage format of the reverse > > > mapping > > > btree > > > +defeats device-level deduplication, because the filesystem > > > requires > > > real > > > +redundancy. > > > + > > > +A criticism of adding the secondary index is that it does > > > nothing to > > > improve > > > +the robustness of user data storage itself. > > > +This is a valid point, but adding a new index for file data > > > block > > > checksums > > > +increases write amplification and turns data overwrites into > > > copy- > > > writes, which > > > +age the filesystem prematurely. > > > +In keeping with thirty years of precedent, users who want file > > > data > > > integrity > > > +can supply as powerful a solution as they require. > > > +As for metadata, the complexity of adding a new secondary index > > > of > > > space usage > > > +is much less than adding volume management and storage device > > > mirroring to XFS > > > +itself. > > > +Perfection of RAID and volume management are best left to > > > existing > > > layers in > > > +the kernel. > > I think I would cull the entire above paragraph. rmap, crc and > > raid > > all have very different points of redundancy, so criticism that an > > apple is not an orange or visavis just feels like a shortsighted > > comparison that's probably more of a distraction than anything. > > > > Sometimes it feels like this document kinda gets off into tangents > > like it's preemptively trying to position it's self for an argument > > that hasn't happened yet. > > It does! Each of the many tangents that you've pointed out are a > reaction to some discussion that we've had on the list, or at an > LSF, or <cough> fs nerds sniping on social media. The reason I > capture all of these offtopic arguments is to discourage people from > wasting time rehashing discussions that were settled long ago. > > Admittedly, that is a very defensive reaction on my part... > > > But I think it has the effect of pulling the > > readers attention off topic into an argument they never thought to > > consider in the first place. The topic of this section is to > > explain > > what rmap is. So lets stay on topic and finish laying out that > > ground > > work first before getting into how it compares to other solutions > > ...and you're right to point out that mentioning these things is > distracting and provides fuel to reignite a flamewar. At the same > time, > I think there's value in identifying the roads not taken, and why. > > What if I turned these tangents into explicitly labelled sidebars? > Would that help readers who want to stick to the topic? > Sure, I think that would be a reasonable compromise > > > + > > > +The information captured in a reverse space mapping record is as > > > follows: > > > + > > > +.. code-block:: c > > > + > > > + struct xfs_rmap_irec { > > > + xfs_agblock_t rm_startblock; /* extent start > > > block > > > */ > > > + xfs_extlen_t rm_blockcount; /* extent length */ > > > + uint64_t rm_owner; /* extent owner */ > > > + uint64_t rm_offset; /* offset within > > > the > > > owner */ > > > + unsigned int rm_flags; /* state flags */ > > > + }; > > > + > > > +The first two fields capture the location and size of the > > > physical > > > space, > > > +in units of filesystem blocks. > > > +The owner field tells scrub which metadata structure or file > > > inode > > > have been > > > +assigned this space. > > > +For space allocated to files, the offset field tells scrub where > > > the > > > space was > > > +mapped within the file fork. > > > +Finally, the flags field provides extra information about the > > > space > > > usage -- > > > +is this an attribute fork extent? A file mapping btree extent? > > > Or > > > an > > > +unwritten data extent? > > > + > > > +Online filesystem checking judges the consistency of each > > > primary > > > metadata > > > +record by comparing its information against all other space > > > indices. > > > +The reverse mapping index plays a key role in the consistency > > > checking process > > > +because it contains a centralized alternate copy of all space > > > allocation > > > +information. > > > +Program runtime and ease of resource acquisition are the only > > > real > > > limits to > > > +what online checking can consult. > > > +For example, a file data extent mapping can be checked against: > > > + > > > +* The absence of an entry in the free space information. > > > +* The absence of an entry in the inode index. > > > +* The absence of an entry in the reference count data if the > > > file is > > > not > > > + marked as having shared extents. > > > +* The correspondence of an entry in the reverse mapping > > > information. > > > + > > > +A key observation here is that only the reverse mapping can > > > provide > > > a positive > > > +affirmation of correctness if the primary metadata is in doubt. > > if any of the above metadata is in doubt... > > Fixed. > > > > +The checking code for most primary metadata follows a path > > > similar > > > to the > > > +one outlined above. > > > + > > > +A second observation to make about this secondary index is that > > > proving its > > > +consistency with the primary metadata is difficult. > > > > > +Demonstrating that a given reverse mapping record exactly > > > corresponds to the > > > +primary space metadata involves a full scan of all primary space > > > metadata, > > > +which is very time intensive. > > "But why?" Wonders the reader. Just jump into an example: > > > > "In order to verify that an rmap extent does not incorrectly over > > lap > > with another record, we would need a full scan of all the other > > records, which is time intensive." > > I want to shorten it even further: > > "Validating that reverse mapping records are correct requires a full > scan of all primary space metadata, which is very time intensive." Ok, I think that sounds fine > > > > > ? > > > > And then the below is a separate observation right? > > Right. > > > > +Scanning activity for online fsck can only use non-blocking lock > > > acquisition > > > +primitives if the locking order is not the regular order as used > > > by > > > the rest of > > > +the filesystem. > > Lastly, it should be noted that most file system operations tend to > > lock primary metadata before locking the secondary metadata. > > This isn't accurate -- metadata structures don't have separate locks. > So it's not true to say that we lock primary or secondary metadata. > > We /can/ say that file operations lock the inode, then the AGI, then > the > AGF; or that directory operations lock the parent and child ILOCKs in > inumber order; and that if scrub wants to take locks in any other > order, > it can only do that via trylocks and backoff. I see, ok maybe giving one or both of those examples is clearer then > > > This > > means that scanning operations that acquire the secondary metadata > > first may need to yield the secondary lock to filesystem operations > > that have already acquired the primary lock. > > > > ? > > > > > +This means that forward progress during this part of a scan of > > > the > > > reverse > > > +mapping data cannot be guaranteed if system load is especially > > > heavy. > > > +Therefore, it is not practical for online check to detect > > > reverse > > > mapping > > > +records that lack a counterpart in the primary metadata. > > Such as <quick list / quick example> > > > > > +Instead, scrub relies on rigorous cross-referencing during the > > > primary space > > > +mapping structure checks. > > I've converted this section into a bullet list: > > "There are several observations to make about reverse mapping > indices: > > "1. Reverse mappings can provide a positive affirmation of > correctness if > any of the above primary metadata are in doubt. The checking code > for > most primary metadata follows a path similar to the one outlined > above. > > "2. Proving the consistency of secondary metadata with the primary > metadata is difficult because that requires a full scan of all > primary > space metadata, which is very time intensive. For example, checking > a > reverse mapping record for a file extent mapping btree block requires > locking the file and searching the entire btree to confirm the block. > Instead, scrub relies on rigorous cross-referencing during the > primary > space mapping structure checks. > > "3. Consistency scans must use non-blocking lock acquisition > primitives > if the required locking order is not the same order used by regular > filesystem operations. This means that forward progress during this > part of a scan of the reverse mapping data cannot be guaranteed if > system load is heavy." Ok, I think that reads cleaner > > > > + > > > > The below paragraph sounds like a re-cap? > > > > "So to recap, reverse mappings also...." > > Yep. > > > > +Reverse mappings also play a key role in reconstruction of > > > primary > > > metadata. > > > +The secondary information is general enough for online repair to > > > synthesize a > > > +complete copy of any primary space management metadata by > > > locking > > > that > > > +resource, querying all reverse mapping indices looking for > > > records > > > matching > > > +the relevant resource, and transforming the mapping into an > > > appropriate format. > > > +The details of how these records are staged, written to disk, > > > and > > > committed > > > +into the filesystem are covered in subsequent sections. > > I also think the section would be ok if you were to trim off this > > last > > paragraph too. > > Hm. I still want to set up the expectation that there's more to > come. > How about a brief two-sentence transition paragraph: > > "In summary, reverse mappings play a key role in reconstruction of > primary metadata. The details of how these records are staged, > written > to disk, and committed into the filesystem are covered in subsequent > sections." Ok, I think that's a cleaner wrap up > > > > > > + > > > +Checking and Cross-Referencing > > > +------------------------------ > > > + > > > +The first step of checking a metadata structure is to examine > > > every > > > record > > > +contained within the structure and its relationship with the > > > rest of > > > the > > > +system. > > > +XFS contains multiple layers of checking to try to prevent > > > inconsistent > > > +metadata from wreaking havoc on the system. > > > +Each of these layers contributes information that helps the > > > kernel > > > to make > > > +three decisions about the health of a metadata structure: > > > + > > > +- Is a part of this structure obviously corrupt > > > (``XFS_SCRUB_OFLAG_CORRUPT``) ? > > > +- Is this structure inconsistent with the rest of the system > > > + (``XFS_SCRUB_OFLAG_XCORRUPT``) ? > > > +- Is there so much damage around the filesystem that cross- > > > referencing is not > > > + possible (``XFS_SCRUB_OFLAG_XFAIL``) ? > > > +- Can the structure be optimized to improve performance or > > > reduce > > > the size of > > > + metadata (``XFS_SCRUB_OFLAG_PREEN``) ? > > > +- Does the structure contain data that is not inconsistent but > > > deserves review > > > + by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? > > > + > > > +The following sections describe how the metadata scrubbing > > > process > > > works. > > > + > > > +Metadata Buffer Verification > > > +```````````````````````````` > > > + > > > +The lowest layer of metadata protection in XFS are the metadata > > > verifiers built > > > +into the buffer cache. > > > +These functions perform inexpensive internal consistency > > > checking of > > > the block > > > +itself, and answer these questions: > > > + > > > +- Does the block belong to this filesystem? > > > + > > > +- Does the block belong to the structure that asked for the > > > read? > > > + This assumes that metadata blocks only have one owner, which > > > is > > > always true > > > + in XFS. > > > + > > > +- Is the type of data stored in the block within a reasonable > > > range > > > of what > > > + scrub is expecting? > > > + > > > +- Does the physical location of the block match the location it > > > was > > > read from? > > > + > > > +- Does the block checksum match the data? > > > + > > > +The scope of the protections here are very limited -- verifiers > > > can > > > only > > > +establish that the filesystem code is reasonably free of gross > > > corruption bugs > > > +and that the storage system is reasonably competent at > > > retrieval. > > > +Corruption problems observed at runtime cause the generation of > > > health reports, > > > +failed system calls, and in the extreme case, filesystem > > > shutdowns > > > if the > > > +corrupt metadata force the cancellation of a dirty transaction. > > > + > > > +Every online fsck scrubbing function is expected to read every > > > ondisk metadata > > > +block of a structure in the course of checking the structure. > > > +Corruption problems observed during a check are immediately > > > reported > > > to > > > +userspace as corruption; during a cross-reference, they are > > > reported > > > as a > > > +failure to cross-reference once the full examination is > > > complete. > > > +Reads satisfied by a buffer already in cache (and hence already > > > verified) > > > +bypass these checks. > > > + > > > +Internal Consistency Checks > > > +``````````````````````````` > > > + > > > +The next higher level of metadata protection is the internal > > > record > > "After the buffer cache, the next level of metadata protection > > is..." > > Changed. I'll do the same to the next section as well. > > > > +verification code built into the filesystem. > > > > > +These checks are split between the buffer verifiers, the in- > > > filesystem users of > > > +the buffer cache, and the scrub code itself, depending on the > > > amount > > > of higher > > > +level context required. > > > +The scope of checking is still internal to the block. > > > +For performance reasons, regular code may skip some of these > > > checks > > > unless > > > +debugging is enabled or a write is about to occur. > > > +Scrub functions, of course, must check all possible problems. > > I'd put this chunk after the list below. > > > > > +Either way, these higher level checking functions answer these > > > questions: > > Then this becomes: > > "These higher level checking functions..." > > Done. > > > > + > > > +- Does the type of data stored in the block match what scrub is > > > expecting? > > > + > > > +- Does the block belong to the owning structure that asked for > > > the > > > read? > > > + > > > +- If the block contains records, do the records fit within the > > > block? > > > + > > > +- If the block tracks internal free space information, is it > > > consistent with > > > + the record areas? > > > + > > > +- Are the records contained inside the block free of obvious > > > corruptions? > > > + > > > +Record checks in this category are more rigorous and more time- > > > intensive. > > > +For example, block pointers and inumbers are checked to ensure > > > that > > > they point > > > +within the dynamically allocated parts of an allocation group > > > and > > > within > > > +the filesystem. > > > +Names are checked for invalid characters, and flags are checked > > > for > > > invalid > > > +combinations. > > > +Other record attributes are checked for sensible values. > > > +Btree records spanning an interval of the btree keyspace are > > > checked > > > for > > > +correct order and lack of mergeability (except for file fork > > > mappings). > > > + > > > +Validation of Userspace-Controlled Record Attributes > > > +```````````````````````````````````````````````````` > > > + > > > +Various pieces of filesystem metadata are directly controlled by > > > userspace. > > > +Because of this nature, validation work cannot be more precise > > > than > > > checking > > > +that a value is within the possible range. > > > +These fields include: > > > + > > > +- Superblock fields controlled by mount options > > > +- Filesystem labels > > > +- File timestamps > > > +- File permissions > > > +- File size > > > +- File flags > > > +- Names present in directory entries, extended attribute keys, > > > and > > > filesystem > > > + labels > > > +- Extended attribute key namespaces > > > +- Extended attribute values > > > +- File data block contents > > > +- Quota limits > > > +- Quota timer expiration (if resource usage exceeds the soft > > > limit) > > > + > > > +Cross-Referencing Space Metadata > > > +```````````````````````````````` > > > + > > > +The next higher level of checking is cross-referencing records > > > between metadata > > > > I kinda like the list first so that the reader has an idea of what > > these checks are before getting into discussion about them. It > > just > > makes it a little more obvious as to why it's "prohibitively > > expensive" > > or "dependent on the context of the structure" after having just > > looked > > at it > > <nod> > > > The rest looks good from here. > > Woot. Onto the next reply! :) > > --D > > > Allison > > > > > +structures. > > > +For regular runtime code, the cost of these checks is considered > > > to > > > be > > > +prohibitively expensive, but as scrub is dedicated to rooting > > > out > > > +inconsistencies, it must pursue all avenues of inquiry. > > > +The exact set of cross-referencing is highly dependent on the > > > context of the > > > +data structure being checked. > > > + > > > +The XFS btree code has keyspace scanning functions that online > > > fsck > > > uses to > > > +cross reference one structure with another. > > > +Specifically, scrub can scan the key space of an index to > > > determine > > > if that > > > +keyspace is fully, sparsely, or not at all mapped to records. > > > +For the reverse mapping btree, it is possible to mask parts of > > > the > > > key for the > > > +purposes of performing a keyspace scan so that scrub can decide > > > if > > > the rmap > > > +btree contains records mapping a certain extent of physical > > > space > > > without the > > > +sparsenses of the rest of the rmap keyspace getting in the way. > > > + > > > +Btree blocks undergo the following checks before cross- > > > referencing: > > > + > > > +- Does the type of data stored in the block match what scrub is > > > expecting? > > > + > > > +- Does the block belong to the owning structure that asked for > > > the > > > read? > > > + > > > +- Do the records fit within the block? > > > + > > > +- Are the records contained inside the block free of obvious > > > corruptions? > > > + > > > +- Are the name hashes in the correct order? > > > + > > > +- Do node pointers within the btree point to valid block > > > addresses > > > for the type > > > + of btree? > > > + > > > +- Do child pointers point towards the leaves? > > > + > > > +- Do sibling pointers point across the same level? > > > + > > > +- For each node block record, does the record key accurate > > > reflect > > > the contents > > > + of the child block? > > > + > > > +Space allocation records are cross-referenced as follows: > > > + > > > +1. Any space mentioned by any metadata structure are cross- > > > referenced as > > > + follows: > > > + > > > + - Does the reverse mapping index list only the appropriate > > > owner > > > as the > > > + owner of each block? > > > + > > > + - Are none of the blocks claimed as free space? > > > + > > > + - If these aren't file data blocks, are none of the blocks > > > claimed as space > > > + shared by different owners? > > > + > > > +2. Btree blocks are cross-referenced as follows: > > > + > > > + - Everything in class 1 above. > > > + > > > + - If there's a parent node block, do the keys listed for this > > > block match the > > > + keyspace of this block? > > > + > > > + - Do the sibling pointers point to valid blocks? Of the same > > > level? > > > + > > > + - Do the child pointers point to valid blocks? Of the next > > > level > > > down? > > > + > > > +3. Free space btree records are cross-referenced as follows: > > > + > > > + - Everything in class 1 and 2 above. > > > + > > > + - Does the reverse mapping index list no owners of this > > > space? > > > + > > > + - Is this space not claimed by the inode index for inodes? > > > + > > > + - Is it not mentioned by the reference count index? > > > + > > > + - Is there a matching record in the other free space btree? > > > + > > > +4. Inode btree records are cross-referenced as follows: > > > + > > > + - Everything in class 1 and 2 above. > > > + > > > + - Is there a matching record in free inode btree? > > > + > > > + - Do cleared bits in the holemask correspond with inode > > > clusters? > > > + > > > + - Do set bits in the freemask correspond with inode records > > > with > > > zero link > > > + count? > > > + > > > +5. Inode records are cross-referenced as follows: > > > + > > > + - Everything in class 1. > > > + > > > + - Do all the fields that summarize information about the file > > > forks actually > > > + match those forks? > > > + > > > + - Does each inode with zero link count correspond to a record > > > in > > > the free > > > + inode btree? > > > + > > > +6. File fork space mapping records are cross-referenced as > > > follows: > > > + > > > + - Everything in class 1 and 2 above. > > > + > > > + - Is this space not mentioned by the inode btrees? > > > + > > > + - If this is a CoW fork mapping, does it correspond to a CoW > > > entry in the > > > + reference count btree? > > > + > > > +7. Reference count records are cross-referenced as follows: > > > + > > > + - Everything in class 1 and 2 above. > > > + > > > + - Within the space subkeyspace of the rmap btree (that is to > > > say, > > > all > > > + records mapped to a particular space extent and ignoring > > > the > > > owner info), > > > + are there the same number of reverse mapping records for > > > each > > > block as the > > > + reference count record claims? > > > + > > > +Proposed patchsets are the series to find gaps in > > > +`refcount btree > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=scrub-detect-refcount-gaps>`_, > > > +`inode btree > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=scrub-detect-inobt-gaps>`_, and > > > +`rmap btree > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=scrub-detect-rmapbt-gaps>`_ records; > > > +to find > > > +`mergeable records > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=scrub-detect-mergeable-records>`_; > > > +and to > > > +`improve cross referencing with rmap > > > +< > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/ > > > log/?h=scrub-strengthen-rmap-checking>`_ > > > +before starting a repair. > > > + > > > +Checking Extended Attributes > > > +```````````````````````````` > > > + > > > +Extended attributes implement a key-value store that enable > > > fragments of data > > > +to be attached to any file. > > > +Both the kernel and userspace can access the keys and values, > > > subject to > > > +namespace and privilege restrictions. > > > +Most typically these fragments are metadata about the file -- > > > origins, security > > > +contexts, user-supplied labels, indexing information, etc. > > > + > > > +Names can be as long as 255 bytes and can exist in several > > > different > > > +namespaces. > > > +Values can be as large as 64KB. > > > +A file's extended attributes are stored in blocks mapped by the > > > attr > > > fork. > > > +The mappings point to leaf blocks, remote value blocks, or > > > dabtree > > > blocks. > > > +Block 0 in the attribute fork is always the top of the > > > structure, > > > but otherwise > > > +each of the three types of blocks can be found at any offset in > > > the > > > attr fork. > > > +Leaf blocks contain attribute key records that point to the name > > > and > > > the value. > > > +Names are always stored elsewhere in the same leaf block. > > > +Values that are less than 3/4 the size of a filesystem block are > > > also stored > > > +elsewhere in the same leaf block. > > > +Remote value blocks contain values that are too large to fit > > > inside > > > a leaf. > > > +If the leaf information exceeds a single filesystem block, a > > > dabtree > > > (also > > > +rooted at block 0) is created to map hashes of the attribute > > > names > > > to leaf > > > +blocks in the attr fork. > > > + > > > +Checking an extended attribute structure is not so > > > straightfoward > > > due to the > > > +lack of separation between attr blocks and index blocks. > > > +Scrub must read each block mapped by the attr fork and ignore > > > the > > > non-leaf > > > +blocks: > > > + > > > +1. Walk the dabtree in the attr fork (if present) to ensure that > > > there are no > > > + irregularities in the blocks or dabtree mappings that do not > > > point to > > > + attr leaf blocks. > > > + > > > +2. Walk the blocks of the attr fork looking for leaf blocks. > > > + For each entry inside a leaf: > > > + > > > + a. Validate that the name does not contain invalid > > > characters. > > > + > > > + b. Read the attr value. > > > + This performs a named lookup of the attr name to ensure > > > the > > > correctness > > > + of the dabtree. > > > + If the value is stored in a remote block, this also > > > validates > > > the > > > + integrity of the remote value block. > > > + > > > +Checking and Cross-Referencing Directories > > > +`````````````````````````````````````````` > > > + > > > +The filesystem directory tree is a directed acylic graph > > > structure, > > > with files > > > +constituting the nodes, and directory entries (dirents) > > > constituting > > > the edges. > > > +Directories are a special type of file containing a set of > > > mappings > > > from a > > > +255-byte sequence (name) to an inumber. > > > +These are called directory entries, or dirents for short. > > > +Each directory file must have exactly one directory pointing to > > > the > > > file. > > > +A root directory points to itself. > > > +Directory entries point to files of any type. > > > +Each non-directory file may have multiple directories point to > > > it. > > > + > > > +In XFS, directories are implemented as a file containing up to > > > three > > > 32GB > > > +partitions. > > > +The first partition contains directory entry data blocks. > > > +Each data block contains variable-sized records associating a > > > user- > > > provided > > > +name with an inumber and, optionally, a file type. > > > +If the directory entry data grows beyond one block, the second > > > partition (which > > > +exists as post-EOF extents) is populated with a block containing > > > free space > > > +information and an index that maps hashes of the dirent names to > > > directory data > > > +blocks in the first partition. > > > +This makes directory name lookups very fast. > > > +If this second partition grows beyond one block, the third > > > partition > > > is > > > +populated with a linear array of free space information for > > > faster > > > +expansions. > > > +If the free space has been separated and the second partition > > > grows > > > again > > > +beyond one block, then a dabtree is used to map hashes of dirent > > > names to > > > +directory data blocks. > > > + > > > +Checking a directory is pretty straightfoward: > > > + > > > +1. Walk the dabtree in the second partition (if present) to > > > ensure > > > that there > > > + are no irregularities in the blocks or dabtree mappings that > > > do > > > not point to > > > + dirent blocks. > > > + > > > +2. Walk the blocks of the first partition looking for directory > > > entries. > > > + Each dirent is checked as follows: > > > + > > > + a. Does the name contain no invalid characters? > > > + > > > + b. Does the inumber correspond to an actual, allocated inode? > > > + > > > + c. Does the child inode have a nonzero link count? > > > + > > > + d. If a file type is included in the dirent, does it match > > > the > > > type of the > > > + inode? > > > + > > > + e. If the child is a subdirectory, does the child's dotdot > > > pointer point > > > + back to the parent? > > > + > > > + f. If the directory has a second partition, perform a named > > > lookup of the > > > + dirent name to ensure the correctness of the dabtree. > > > + > > > +3. Walk the free space list in the third partition (if present) > > > to > > > ensure that > > > + the free spaces it describes are really unused. > > > + > > > +Checking operations involving :ref:`parents <dirparent>` and > > > +:ref:`file link counts <nlinks>` are discussed in more detail in > > > later > > > +sections. > > > + > > > +Checking Directory/Attribute Btrees > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > + > > > +As stated in previous sections, the directory/attribute btree > > > (dabtree) index > > > +maps user-provided names to improve lookup times by avoiding > > > linear > > > scans. > > > +Internally, it maps a 32-bit hash of the name to a block offset > > > within the > > > +appropriate file fork. > > > + > > > +The internal structure of a dabtree closely resembles the btrees > > > that record > > > +fixed-size metadata records -- each dabtree block contains a > > > magic > > > number, a > > > +checksum, sibling pointers, a UUID, a tree level, and a log > > > sequence > > > number. > > > +The format of leaf and node records are the same -- each entry > > > points to the > > > +next level down in the hierarchy, with dabtree node records > > > pointing > > > to dabtree > > > +leaf blocks, and dabtree leaf records pointing to non-dabtree > > > blocks > > > elsewhere > > > +in the fork. > > > + > > > +Checking and cross-referencing the dabtree is very similar to > > > what > > > is done for > > > +space btrees: > > > + > > > +- Does the type of data stored in the block match what scrub is > > > expecting? > > > + > > > +- Does the block belong to the owning structure that asked for > > > the > > > read? > > > + > > > +- Do the records fit within the block? > > > + > > > +- Are the records contained inside the block free of obvious > > > corruptions? > > > + > > > +- Are the name hashes in the correct order? > > > + > > > +- Do node pointers within the dabtree point to valid fork > > > offsets > > > for dabtree > > > + blocks? > > > + > > > +- Do leaf pointers within the dabtree point to valid fork > > > offsets > > > for directory > > > + or attr leaf blocks? > > > + > > > +- Do child pointers point towards the leaves? > > > + > > > +- Do sibling pointers point across the same level? > > > + > > > +- For each dabtree node record, does the record key accurate > > > reflect > > > the > > > + contents of the child dabtree block? > > > + > > > +- For each dabtree leaf record, does the record key accurate > > > reflect > > > the > > > + contents of the directory or attr block? > > > + > > > +Cross-Referencing Summary Counters > > > +`````````````````````````````````` > > > + > > > +XFS maintains three classes of summary counters: available > > > resources, quota > > > +resource usage, and file link counts. > > > + > > > +In theory, the amount of available resources (data blocks, > > > inodes, > > > realtime > > > +extents) can be found by walking the entire filesystem. > > > +This would make for very slow reporting, so a transactional > > > filesystem can > > > +maintain summaries of this information in the superblock. > > > +Cross-referencing these values against the filesystem metadata > > > should be a > > > +simple matter of walking the free space and inode metadata in > > > each > > > AG and the > > > +realtime bitmap, but there are complications that will be > > > discussed > > > in > > > +:ref:`more detail <fscounters>` later. > > > + > > > +:ref:`Quota usage <quotacheck>` and :ref:`file link count > > > <nlinks>` > > > +checking are sufficiently complicated to warrant separate > > > sections. > > > + > > > +Post-Repair Reverification > > > +`````````````````````````` > > > + > > > +After performing a repair, the checking code is run a second > > > time to > > > validate > > > +the new structure, and the results of the health assessment are > > > recorded > > > +internally and returned to the calling process. > > > +This step is critical for enabling system administrator to > > > monitor > > > the status > > > +of the filesystem and the progress of any repairs. > > > +For developers, it is a useful means to judge the efficacy of > > > error > > > detection > > > +and correction in the online and offline checking tools. > > > diff --git a/Documentation/filesystems/xfs-self-describing- > > > metadata.rst b/Documentation/filesystems/xfs-self-describing- > > > metadata.rst > > > index b79dbf36dc94..a10c4ae6955e 100644 > > > --- a/Documentation/filesystems/xfs-self-describing-metadata.rst > > > +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst > > > @@ -1,4 +1,5 @@ > > > .. SPDX-License-Identifier: GPL-2.0 > > > +.. _xfs_self_describing_metadata: > > > > > > ============================ > > > XFS Self Describing Metadata > > > > >