Add chapters on the operation of the reverse mapping btree and future things we could do with rmap data. Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> --- .../allocation_groups.asciidoc | 24 +- design/XFS_Filesystem_Structure/docinfo.xml | 14 + .../reconstruction.asciidoc | 53 +++++ design/XFS_Filesystem_Structure/rmapbt.asciidoc | 223 ++++++++++++++++++++ .../xfs_filesystem_structure.asciidoc | 4 5 files changed, 310 insertions(+), 8 deletions(-) create mode 100644 design/XFS_Filesystem_Structure/reconstruction.asciidoc create mode 100644 design/XFS_Filesystem_Structure/rmapbt.asciidoc diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc index 0633175..29e61b4 100644 --- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc +++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc @@ -12,6 +12,7 @@ Each AG has the following characteristics: * A super block describing overall filesystem info * Free space management * Inode allocation and tracking + * Reverse block-mapping index (optional) Having multiple AGs allows XFS to handle most operations in parallel without degrading performance as the number of concurrent accesses increases. @@ -379,6 +380,12 @@ it doesn't understand the flag. Free inode B+tree. Each allocation group contains a B+tree to track inode chunks containing free inodes. This is a performance optimization to reduce the time required to allocate inodes. + +| +XFS_SB_FEAT_RO_COMPAT_RMAPBT+ | +Reverse mapping B+tree. Each allocation group contains a B+tree containing +records mapping AG blocks to their owners. See the section about +xref:Reconstruction[reconstruction] for more details. + |===== *sb_features_incompat*:: @@ -529,9 +536,7 @@ struct xfs_agf { __be32 agf_seqno; __be32 agf_length; __be32 agf_roots[XFS_BTNUM_AGF]; - __be32 agf_spare0; __be32 agf_levels[XFS_BTNUM_AGF]; - __be32 agf_spare1; __be32 agf_flfirst; __be32 agf_fllast; __be32 agf_flcount; @@ -550,9 +555,10 @@ struct xfs_agf { }; ---- -The rest of the bytes in the sector are zeroed. +XFS_BTNUM_AGF+ is set to 2: -index 0 for the free space B+tree indexed by block number; and index 1 for the -free space B+tree indexed by extent size. +The rest of the bytes in the sector are zeroed. +XFS_BTNUM_AGF+ is set to 3: +index 0 for the free space B+tree indexed by block number; index 1 for the free +space B+tree indexed by extent size; and index 2 for the reverse-mapping +B+tree. *agf_magicnum*:: Specifies the magic number for the AGF sector: ``XAGF'' (0x58414746). @@ -570,11 +576,13 @@ this could be less than the +sb_agblocks+ value. It is this value that should be used to determine the size of the AG. *agf_roots*:: -Specifies the block number for the root of the two free space B+trees. +Specifies the block number for the root of the two free space B+trees and the +reverse-mapping B+tree, if enabled. *agf_levels*:: -Specifies the level or depth of the two free space B+trees. For a fresh AG, this -will be one, and the ``roots'' will point to a single leaf of level 0. +Specifies the level or depth of the two free space B+trees and the +reverse-mapping B+tree, if enabled. For a fresh AG, this value will be one, +and the ``roots'' will point to a single leaf of level 0. *agf_flfirst*:: Specifies the index of the first ``free list'' block. Free lists are covered in diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml index ba97809..19eb135 100644 --- a/design/XFS_Filesystem_Structure/docinfo.xml +++ b/design/XFS_Filesystem_Structure/docinfo.xml @@ -108,4 +108,18 @@ </simplelist> </revdescription> </revision> + <revision> + <revnumber>3.14</revnumber> + <date>November 2015</date> + <author> + <firstname>Darrick</firstname> + <surname>Wong</surname> + <email></email> + </author> + <revdescription> + <simplelist> + <member>Document the reverse-mapping btree.</member> + </simplelist> + </revdescription> + </revision> </revhistory> diff --git a/design/XFS_Filesystem_Structure/reconstruction.asciidoc b/design/XFS_Filesystem_Structure/reconstruction.asciidoc new file mode 100644 index 0000000..f172e0f --- /dev/null +++ b/design/XFS_Filesystem_Structure/reconstruction.asciidoc @@ -0,0 +1,53 @@ +[[Reconstruction]] += Metadata Reconstruction + +[NOTE] +This is a theoretical discussion of how reconstruction could work; none of this +is implemented as of 2015. + +A simple UNIX filesystem can be thought of in terms of a directed acyclic graph. +To a first approximation, there exists a root directory node, which points to +other nodes. Those other nodes can themselves be directories or they can be +files. Each file, in turn, points to data blocks. + +XFS adds a few more details to this picture: + +* The real root(s) of an XFS filesystem are the allocation group headers +(superblock, AGF, AGI, AGFL). +* Each allocation group’s headers point to various per-AG B+trees (free space, +inode, free inodes, free list, etc.) +* The free space B+trees point to unused extents; +* The inode B+trees point to blocks containing inode chunks; +* All superblocks point to the root directory and the log; +* Hardlinks mean that multiple directories can point to a single file node; +* File data block pointers are indexed by file offset; +* Files and directories can have a second collection of pointers to data blocks +which contain extended attributes; +* Large directories require multiple data blocks to store all the subpointers; +* Still larger directories use high-offset data blocks to store a B+tree of +hashes to directory entries; +* Large extended attribute forks similarly use high-offset data blocks to store +a B+tree of hashes to attribute keys; and +* Symbolic links can point to data blocks. + +The beauty of this massive graph structure is that under normal circumstances, +everything known to the filesystem is discoverable (access controls +notwithstanding) from the root. The major weakness of this structure of course +is that breaking a edge in the graph can render entire subtrees inaccessible. ++xfs_repair+ “recovers” from broken directories by scanning for unlinked inodes +and connecting them to +/lost+found+, but this isn’t sufficiently general to +recover from breaks in other parts of the graph structure. Wouldn’t it be +useful to have back pointers as a secondary data structure? The current repair +strategy is to reconstruct whatever can be rebuilt, but to scrap anything that +doesn't check out. + +The xref:Reverse_Mapping_Btree[reverse-mapping B+tree] fills in part of the +puzzle. Since it contains copies of every entry in each inode’s data and +attribute forks, we can fix a corrupted block map with these records. +Furthermore, if the inode B+trees become corrupt, it is possible to visit all +inode chunks using the reverse-mapping data. Should XFS ever gain the ability +to store parent directory information in each inode, it also becomes possible +to resurrect damaged directory trees, which should reduce the complaints about +inodes ending up in +/lost+found+. Everything else in the per-AG primary +metadata can already be reconstructed via +xfs_repair+. Hopefully, +reconstruction will not turn out to be a fool's errand. diff --git a/design/XFS_Filesystem_Structure/rmapbt.asciidoc b/design/XFS_Filesystem_Structure/rmapbt.asciidoc new file mode 100644 index 0000000..d9ccd27 --- /dev/null +++ b/design/XFS_Filesystem_Structure/rmapbt.asciidoc @@ -0,0 +1,223 @@ +[[Reverse_Mapping_Btree]] +== Reverse-Mapping B+tree + +[NOTE] +This data structure is under construction! Details may change. + +If the feature is enabled, each allocation group has its own reverse +block-mapping B+tree, which grows in the free space like the free space +B+trees. As mentioned in the chapter about +xref:Reconstruction[reconstruction], this data structure is another piece of +the puzzle necessary to reconstruct the data or attribute fork of a file from +reverse-mapping records; we can also use it to double-check allocations to +ensure that we are not accidentally cross-linking blocks, which can cause +severe damage to the filesystem. + +This B+tree is only present if the +XFS_SB_FEAT_RO_COMPAT_RMAPBT+ +feature is enabled. The feature requires a version 5 filesystem. + +Each record in the reverse-mapping B+tree has the following structure: + +[source, c] +---- +struct xfs_rmap_rec { + __be32 rm_startblock; + __be32 rm_unwritten:1; + __be32 rm_blockcount:31; + __be64 rm_owner; + __be64 rm_fork:1; + __be64 rm_bmbt:1; + __be64 rm_offset:62; +}; +---- + +*rm_startblock*:: +AG block number of this record. + +*rm_unwritten*:: +A flag indicating that the extent is unwritten. This corresponds to the flag in +the xref:Data_Extents[extent record] format which means +XFS_EXT_UNWRITTEN+. + +*rm_blockcount*:: +The length of this extent. + +*rm_owner*:: +A 64-bit number describing the owner of this extent. This is typically the +absolute inode number, but can also correspond to one of the following: + +.Special owner values +[options="header"] +|===== +| Value | Description +| +XFS_RMAP_OWN_NULL+ | No owner. This should never appear on disk. +| +XFS_RMAP_OWN_UNKNOWN+ | Unknown owner; for EFI recovery. This should never appear on disk. +| +XFS_RMAP_OWN_FS+ | Allocation group headers +| +XFS_RMAP_OWN_LOG+ | XFS log blocks +| +XFS_RMAP_OWN_AG+ | Per-allocation group B+tree blocks. This means free space B+tree blocks, blocks on the freelist, and reverse-mapping B+tree blocks. +| +XFS_RMAP_OWN_INOBT+ | Per-allocation group inode B+tree blocks. This includes free inode B+tree blocks. +| +XFS_RMAP_OWN_INODES+ | Inode chunks +| +XFS_RMAP_OWN_REFC+ | Per-allocation group refcount B+tree blocks. This will be used for reflink support. +|===== + +*rm_fork*:: +If +rm_owner+ describes an inode, this can be 1 if this record is for an +attribute fork. + +*rm_bmbt*:: +If +rm_owner+ describes an inode, this can be 1 to signify that this record is +for a block map B+tree block. In this case, +rm_offset+ has no meaning. + +*rm_offset*:: +The 62-bit logical file block offset, if +rm_owner+ describes an inode. +Meaningless otherwise. + +[NOTE] +The single-bit flag values +rm_unwritten+, +rm_fork+, and +rm_bmbt+ are packed +into the larger fields in the C structure definition. + +[NOTE] +For the moment, there is a requirement that all records in the data or +attribute forks must match exactly with the corresponding entry in the +reverse-mapping B+tree. This may be lifted in future versions of the patchset. + +For the reverse-mapping B+tree, the key definition is larger than the usual AG +block number. On a classic XFS filesystem, each block has only one owner, which +means that +rm_startblock+ is sufficient to uniquely identify each record. +However, shared block support (refink) on XFS breaks that assumption; now +filesystem blocks can be linked to any logical block offset of any file inode. +Therefore, the key must include the owner and offset information to preserve the +1 to 1 relation between key and record. The key has the following structure: + +[source, c] +---- +struct xfs_rmap_key { + __be32 rm_startblock; + __be64 rm_owner; + __be64 rm_fork:1; + __be64 rm_bmbt:1; + __be64 rm_offset:62; +}; +---- + +* As the reference counting is AG relative, all the block numbers are only +32-bits. +* The +bb_magic+ value is "RMB3" (0x524d4233). +* The +xfs_btree_sblock_t+ header is used for intermediate B+tree node as well +as the leaves. + +=== xfs_db rmapbt Example + +This example shows a reverse-mapping B+tree from a freshly formatted root +filesystem: + +---- +xfs_db> agi 0 +xfs_db> addr rmaproot +xfs_db> p +magic = 0x524d4233 +level = 1 +numrecs = 43 +leftsib = null +rightsib = null +bno = 56 +lsn = 0x3000004c8 +uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4 +owner = 0 +crc = 0x7cf8be6f (correct) +keys[1-43] = [startblock,owner,offset] + 1:[0,-3,0] 2:[417,285,0] 3:[829,499,0] 4:[1292,710,0] 5:[32215,-5,0] + 6:[34083,1161,0] 7:[34896,256191,0] + ... + 41:[50998,326734,0] 42:[51431,327010,0] 43:[51611,327112,0] +ptrs[1-43] = 1:5 2:6 3:8 4:9 5:10 6:11 7:418 ... 41:46377 42:48784 43:49522 +xfs_db> addr ptrs[17] +xfs_db> p +magic = 0x524d4233 +level = 0 +numrecs = 168 +leftsib = 36284 +rightsib = 37617 +bno = 294760 +lsn = 0x200002761 +uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4 +owner = 0 +crc = 0x2dad3fbe (correct) +recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] + 1:[40326,1,259615,0,0,0,0] 2:[40327,1,-5,0,0,0,0] + 3:[40328,2,259618,0,0,0,0] 4:[40330,1,259619,0,0,0,0] + ... + 127:[40540,1,324266,0,0,0,0] 128:[40541,1,324266,8388608,0,0,0] + 129:[40542,2,324266,1,0,0,0] 130:[40544,32,-7,0,0,0,0] +---- + +Several interesting things pop out here. The first record shows that inode +259,615 has mapped AG block 40,326 at offset 0. We confirm this by looking at +the block map for that inode: + +---- +xfs_db> inode 259615 +xfs_db> bmap +data offset 0 startblock 40326 (0/40326) count 1 flag 0 +---- + +Next, notice records 127 and 128, which describe neighboring AG blocks that are +mapped to non-contiguous logical blocks in inode 324,266. Given the logical +offset of 8,388,608 we surmise that this is a leaf directory, but let us +confirm: + +---- +xfs_db> inode 324266 +xfs_db> p core.mode +core.mode = 040755 +xfs_db> bmap +data offset 0 startblock 40540 (0/40540) count 1 flag 0 +data offset 1 startblock 40542 (0/40542) count 2 flag 0 +data offset 3 startblock 40576 (0/40576) count 1 flag 0 +data offset 8388608 startblock 40541 (0/40541) count 1 flag 0 +xfs_db> p core.mode +core.mode = 0100644 +xfs_db> dblock 0 +xfs_db> p dhdr.hdr.magic +dhdr.hdr.magic = 0x58444433 +xfs_db> dblock 8388608 +xfs_db> p lhdr.info.hdr.magic +lhdr.info.hdr.magic = 0x3df1 +---- + +Indeed, this inode 324,266 appears to be a leaf directory, as it has regular +directory data blocks at low offsets, and a single leaf block. + +Notice further the two reverse-mapping records with negative owners. An owner +of -7 corresponds to +XFS_RMAP_OWN_INODES+, which is an inode chunk, and an +owner code of -5 corresponds to +XFS_RMAP_OWN_AG+, which covers free space +B+trees and free space. Let's see if block 40,544 is part of an inode chunk: + +---- +xfs_db> blockget +xfs_db> fsblock 40544 +xfs_db> blockuse +block 40544 (0/40544) type inode +xfs_db> stack +1: + byte offset 166068224, length 4096 + buffer block 324352 (fsbno 40544), 8 bbs + inode 324266, dir inode 324266, type data +xfs_db> type inode +xfs_db> p +core.magic = 0x494e +---- + +Our suspicions are confirmed. Let's also see if 40,327 is part of a free space +tree: + +---- +xfs_db> fsblock 40327 +xfs_db> blockuse +block 40327 (0/40327) type btrmap +xfs_db> type rmapbt +xfs_db> p +magic = 0x524d4233 +---- + +As you can see, the reverse block-mapping B+tree is an important secondary +metadata structure, which can be used to reconstruct damaged primary metadata. diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc index 0ab4906..4d089b6 100644 --- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc +++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc @@ -48,6 +48,8 @@ include::overview.asciidoc[] include::metadata_integrity.asciidoc[] +include::reconstruction.asciidoc[] + include::common_types.asciidoc[] include::magic.asciidoc[] @@ -62,6 +64,8 @@ Global Structures include::allocation_groups.asciidoc[] +include::rmapbt.asciidoc[] + include::journaling_log.asciidoc[] include::internal_inodes.asciidoc[] _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs