On Thu, Aug 11, 2022 at 11:17:42AM +1000, Dave Chinner wrote: > On Sun, Aug 07, 2022 at 11:30:33AM -0700, Darrick J. Wong wrote: > > +Reverse Mapping > > +--------------- > > + > > +The original design of XFS (circa 1993) is an improvement upon 1980s Unix > > +filesystem design. > > +In those days, storage density was expensive, CPU time was scarce, and > > +excessive seek time could kill performance. > > +For performance reasons, filesystem authors were reluctant to add redundancy to > > +the filesystem, even at the cost of data integrity. > > +Filesystems designers in the early 21st century choose different strategies to > > +increase internal redundancy -- either storing nearly identical copies of > > +metadata, or more space-efficient techniques such as erasure coding. > > +Obvious corruptions are typically repaired by copying replicas or > > +reconstructing from codes. > > + > > +For XFS, a different redundancy strategy was chosen to modernize the design: > > +a secondary space usage index that maps allocated disk extents back to their > > +owners. > > +By adding a new index, the filesystem retains most of its ability to scale > > +well to heavily threaded workloads involving large datasets, since the primary > > +file metadata (the directory tree, the file block map, and the allocation > > +groups) remain unchanged. > > +Although the reverse-mapping feature increases overhead costs for space > > +mapping activities just like any other system that improves redundancy, it > > +has two critical advantages: first, the reverse index is key to enabling online > > +fsck and other requested functionality such as filesystem reorganization, > > +better media failure reporting, and shrinking. > > +Second, the different ondisk storage format of the reverse mapping btree > > +defeats device-level deduplication, because the filesystem requires real > > +redundancy. > > "defeats device-level deduplication" took me a bit of thought to > follow what you were trying to describe here. > > I think what you mean is this: > > "Second, the reverse mapping metadata uses a different on-disk > format for storage. This prevents storage device level deduplication > from removing redundant metadata the filesystem stores for recovery > purposes. Hence we will always have two physical copies of the > metadata we need for recovery in the storage device, regardless of > the technologies it implements." Correct. I'll work that in. > > +A criticism of adding the secondary index is that it does nothing to improve > > +the robustness of user data storage itself. > > +This is a valid point, but adding a new index for file data block checksums > > +increases write amplification and turns data overwrites into copy-writes, which > > +age the filesystem prematurely. > > And can come with a significant IO performance penalty. Right, I shouldn't be assuming that people can draw the connection between write amplification/aging fs metadata and slow performance. > > +Checking Extended Attributes > > +```````````````````````````` > > + > > +Extended attributes implement a key-value store that enable fragments of data > > +to be attached to any file. > > +Both the kernel and userspace can access the keys and values, subject to > > +namespace and privilege restrictions. > > +Most typically these fragments are metadata about the file -- origins, security > > +contexts, user-supplied labels, indexing information, etc. > > + > > +Names can be as long as 255 bytes and can exist in several different > > +namespaces. > > +Values can be as large as 64KB. > > +A file's extended attributes are stored in blocks mapped by the attr fork. > > +The mappings point to leaf blocks, remote value blocks, or dabtree blocks. > > +Block 0 in the attribute fork is always the top of the structure, but otherwise > > +each of the three types of blocks can be found at any offset in the attr fork. > > +Leaf blocks contain attribute key records that point to the name and the value. > > +Names are always stored elsewhere in the same leaf block. > > +Values that are less than 3/4 the size of a filesystem block are also stored > > +elsewhere in the same leaf block. > > +Remote value blocks contain values that are too large to fit inside a leaf. > > +If the leaf information exceeds a single filesystem block, a dabtree (also > > +rooted at block 0) is created to map hashes of the attribute names to leaf > > +blocks in the attr fork. > > + > > +Checking an extended attribute structure is not so straightfoward due to the > > +lack of separation between attr blocks and index blocks. > > +Scrub must read each block mapped by the attr fork and ignore the non-leaf > > +blocks: > > + > > +1. Walk the dabtree in the attr fork (if present) to ensure that there are no > > + irregularities in the blocks or dabtree mappings that do not point to > > + attr leaf blocks. > > + > > +2. Walk the blocks of the attr fork looking for leaf blocks. > > + For each entry inside a leaf: > > + > > + a. Validate that the name does not contain invalid characters. > > + > > + b. Read the attr value. > > + This performs a named lookup of the attr name to ensure the correctness > > + of the dabtree. > > + If the value is stored in a remote block, this also validates the > > + integrity of the remote value block. > > I'm assuming that checking remote xattr blocks involves checking the headers > are all valid, the overall length is correct, they have valid rmap > entries, etc? Yep. The headers are checked by reading the attr value; and the space mappings are checked by the attr fork scanner. > > +Checking and Cross-Referencing Directories > > +`````````````````````````````````````````` > > + > > +The filesystem directory tree is a directed acylic graph structure, with files > > +constituting the nodes, and directory entries (dirents) constituting the edges. > > "with hashed file names consituting the nodes" ? > > +Directories are a special type of file containing a set of mappings from a > > +255-byte sequence (name) to an inumber. > > +These are called directory entries, or dirents for short. > > +Each directory file must have exactly one directory pointing to the file. > > +A root directory points to itself. > > +Directory entries point to files of any type. > > "point to inodes" > > > +Each non-directory file may have multiple directories point to it. > > "Each non directory inode may have multiple dirents point to it" Both fixed. > > + > > +In XFS, directories are implemented as a file containing up to three 32GB > > +partitions. > > "implemented as offset-based file data" Fixed. > > +The first partition contains directory entry data blocks. > > +Each data block contains variable-sized records associating a user-provided > > +name with an inumber and, optionally, a file type. > > +If the directory entry data grows beyond one block, the second partition (which > > +exists as post-EOF extents) is populated with a block containing free space > > +information and an index that maps hashes of the dirent names to directory data > > +blocks in the first partition. > > +This makes directory name lookups very fast. > > "This is known as a "LEAF 1" format directory." Good addition! > > +If this second partition grows beyond one block, the third partition is > > +populated with a linear array of free space information for faster > > +expansions. > > s/for faster expansions/to speed up first fit searches when inserting > new dirents/. > > "This is known as a "NODE" format directory." Ok. > > +If the free space has been separated and the second partition grows again > > +beyond one block, then a dabtree is used to map hashes of dirent names to > > +directory data blocks. > > The second partition is changed to a dabtree format during NODE > format conversion - it's just the single root block form. It grows > as a btree from there by normal split/grow btree operations at that > point. Hence I'm not sure this needs to be mentioned as a separate > step - the node form conversion should mention the second partition > transitions to the dabtree index format, and it's all obvious from > there... I put that nugget in because *I* keep forgetting that there are directories with a single block in the second partition that doesn't have the DA3_NODE magicnum. ;) How about: "This is known as a 'NODE' format directory. The second partition contains a dabtree that ranges in size from a single block to many levels." > > +Checking a directory is pretty straightfoward: > > + > > +1. Walk the dabtree in the second partition (if present) to ensure that there > > + are no irregularities in the blocks or dabtree mappings that do not point to > > + dirent blocks. > > + > > +2. Walk the blocks of the first partition looking for directory entries. > > + Each dirent is checked as follows: > > + > > + a. Does the name contain no invalid characters? > > + > > + b. Does the inumber correspond to an actual, allocated inode? > > + > > + c. Does the child inode have a nonzero link count? > > + > > + d. If a file type is included in the dirent, does it match the type of the > > + inode? > > + > > + e. If the child is a subdirectory, does the child's dotdot pointer point > > + back to the parent? > > + > > + f. If the directory has a second partition, perform a named lookup of the > > + dirent name to ensure the correctness of the dabtree. > > + > > +3. Walk the free space list in the third partition (if present) to ensure that > > + the free spaces it describes are really unused. > > Actaully, they should reflect the largest free space in the given > block, so this has to be checked against the bests array in the > given block. > > Also, if we are walking the directory blocks, we should be checking > the empty ranges for valid tags and sizes, and that the largest 3 > holes in the block are correctly ordered in the bests array, too. <nod> I think the code actually does that, but I got lazy with writing here. :/ > > + > > +Checking operations involving :ref:`parents <dirparent>` and > > +:ref:`file link counts <nlinks>` are discussed in more detail in later > > +sections. > > + > > +Checking Directory/Attribute Btrees > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > + > > +As stated in previous sections, the directory/attribute btree (dabtree) index > > +maps user-provided names to improve lookup times by avoiding linear scans. > > +Internally, it maps a 32-bit hash of the name to a block offset within the > > +appropriate file fork. > > + > > +The internal structure of a dabtree closely resembles the btrees that record > > +fixed-size metadata records -- each dabtree block contains a magic number, a > > +checksum, sibling pointers, a UUID, a tree level, and a log sequence number. > > +The format of leaf and node records are the same -- each entry points to the > > +next level down in the hierarchy, with dabtree node records pointing to dabtree > > +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere > > +in the fork. > > + > > +Checking and cross-referencing the dabtree is very similar to what is done for > > +space btrees: > > + > > +- Does the type of data stored in the block match what scrub is expecting? > > + > > +- Does the block belong to the owning structure that asked for the read? > > + > > +- Do the records fit within the block? > > + > > +- Are the records contained inside the block free of obvious corruptions? > > + > > +- Are the name hashes in the correct order? > > + > > +- Do node pointers within the dabtree point to valid fork offsets for dabtree > > + blocks? > > + > > +- Do leaf pointers within the dabtree point to valid fork offsets for directory > > + or attr leaf blocks? > > + > > +- Do child pointers point towards the leaves? > > + > > +- Do sibling pointers point across the same level? > > + > > +- For each dabtree node record, does the record key accurate reflect the > > + contents of the child dabtree block? > > + > > +- For each dabtree leaf record, does the record key accurate reflect the > > + contents of the directory or attr block? > > This last one is checking that they point to a valid dirent and > check that the hash of the dirent name actually matches the hash > that points to it? Yes. --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx