The reverse mapping btree now comes in two flavors: a fat one for reflink filesystems supporting overlapped interval queries and a thin one for filesystems that don't share blocks. Document the new on-disk formats. Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> --- design/XFS_Filesystem_Structure/docinfo.xml | 16 +++ design/XFS_Filesystem_Structure/magic.asciidoc | 1 design/XFS_Filesystem_Structure/rmapbt.asciidoc | 108 +++++++++++++++++++++-- 3 files changed, 116 insertions(+), 9 deletions(-) diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml index 009376f..7d32260 100644 --- a/design/XFS_Filesystem_Structure/docinfo.xml +++ b/design/XFS_Filesystem_Structure/docinfo.xml @@ -138,4 +138,20 @@ </simplelist> </revdescription> </revision> + <revision> + <revnumber>3.1415</revnumber> + <date>March 2016</date> + <author> + <firstname>Darrick</firstname> + <surname>Wong</surname> + <email></email> + </author> + <revdescription> + <simplelist> + <member>Move the b+tree discussion to a separate chapter.</member> + <member>Discuss overlapping interval b+trees.</member> + <member>Document the reverse mapping btree changes when reflink is enabled.</member> + </simplelist> + </revdescription> + </revision> </revhistory> diff --git a/design/XFS_Filesystem_Structure/magic.asciidoc b/design/XFS_Filesystem_Structure/magic.asciidoc index 7caf20e..5ce19a5 100644 --- a/design/XFS_Filesystem_Structure/magic.asciidoc +++ b/design/XFS_Filesystem_Structure/magic.asciidoc @@ -45,6 +45,7 @@ relevant chapters. Magic numbers tend to have consistent locations: | +XFS_ATTR3_LEAF_MAGIC+ | 0x3bee | | xref:Leaf_Attributes[Leaf Attribute], v5 only | +XFS_ATTR3_RMT_MAGIC+ | 0x5841524d | XARM | xref:Remote_Values[Remote Attribute Value], v5 only | +XFS_RMAP_CRC_MAGIC+ | 0x524d4233 | RMB3 | xref:Reverse_Mapping_Btree[Reverse Mapping B+tree], v5 only +| +XFS_RMAPX_CRC_MAGIC+ | 0x34524d42 | 4RMB | xref:Reverse_Mapping_Btree[Reverse Mapping B+tree], v5 only | +XFS_REFC_CRC_MAGIC+ | 0x52334643 | R3FC | xref:Reference_Count_Btree[Reference Count B+tree], v5 only |===== diff --git a/design/XFS_Filesystem_Structure/rmapbt.asciidoc b/design/XFS_Filesystem_Structure/rmapbt.asciidoc index 2be28fa..bfdc74e 100644 --- a/design/XFS_Filesystem_Structure/rmapbt.asciidoc +++ b/design/XFS_Filesystem_Structure/rmapbt.asciidoc @@ -81,18 +81,40 @@ For the moment, there is a requirement that all records in the data or attribute forks must match exactly with the corresponding entry in the reverse-mapping B+tree. This may be lifted in future versions of the patchset. -For the reverse-mapping B+tree, the key definition is larger than the usual AG -block number. On a classic XFS filesystem, each block has only one owner, which -means that +rm_startblock+ is sufficient to uniquely identify each record. -However, shared block support (reflink) on XFS breaks that assumption; now -filesystem blocks can be linked to any logical block offset of any file inode. -Therefore, the key must include the owner and offset information to preserve the -1 to 1 relation between key and record. The key has the following structure: +=== Reverse Mapping B+tree without Shared Blocks + +For the reverse-mapping B+tree on a filesystem that does not support sharing +file data blocks, we can uniquely identify each record using only the per-AG +block number. The key has the following structure: [source, c] ---- struct xfs_rmap_key { __be32 rm_startblock; +}; +---- + +* As the reference counting is AG relative, all the block numbers are only +32-bits. +* The +bb_magic+ value is "RMB3" (0x524d4233). +* The +xfs_btree_sblock_t+ header is used for intermediate B+tree node as well +as the leaves. + +=== Reverse Mapping B+tree with Shared Blocks + +For the reverse-mapping B+tree on a filesystem that supports sharing of file +data blocks, the key definition is larger than the usual AG block number. On a +classic XFS filesystem, each block has only one owner, which means that ++rm_startblock+ is sufficient to uniquely identify each record. However, +shared block support (reflink) on XFS breaks that assumption; now filesystem +blocks can be linked to any logical block offset of any file inode. Therefore, +the key must include the owner and offset information to preserve the 1 to 1 +relation between key and record. The key has the following structure: + +[source, c] +---- +struct xfs_rmapx_key { + __be32 rm_startblock; __be64 rm_owner; __be64 rm_fork:1; __be64 rm_bmbt:1; @@ -102,9 +124,17 @@ struct xfs_rmap_key { * As the reference counting is AG relative, all the block numbers are only 32-bits. -* The +bb_magic+ value is "RMB3" (0x524d4233). +* The +bb_magic+ value is "4RMB" (0x34524d42). * The +xfs_btree_sblock_t+ header is used for intermediate B+tree node as well as the leaves. +* Each pointer is associated with two keys. The first of these is the "low +key", which is the key of the smallest record accessible through the pointer. +This low key has the same meaning as the key in all other btrees. The second +key is the high key, which is the maximum of the largest key that can be used +to access a given record underneath the pointer. Recall that each record +in the reverse mapping b+tree describes an interval of physical blocks mapped +to an interval of logical file block offsets; therefore, it makes sense that +a range of keys can be used to find to a record. === xfs_db rmapbt Example @@ -112,7 +142,7 @@ This example shows a reverse-mapping B+tree from a freshly formatted root filesystem: ---- -xfs_db> agi 0 +xfs_db> agf 0 xfs_db> addr rmaproot xfs_db> p magic = 0x524d4233 @@ -222,3 +252,63 @@ magic = 0x524d4233 As you can see, the reverse block-mapping B+tree is an important secondary metadata structure, which can be used to reconstruct damaged primary metadata. +Now let's look at an extend rmap btree: + +---- +xfs_db> agf 0 +xfs_db> addr rmaproot +xfs_db> p +magic = 0x34524d42 +level = 1 +numrecs = 5 +leftsib = null +rightsib = null +bno = 6368 +lsn = 0x100000d1b +uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f +owner = 0 +crc = 0x8d4ace05 (correct) +keys[1-5] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,offset_hi,attrfork_hi,bmbtblock_hi] +1:[0,-3,0,0,0,705,132,681,0,0] +2:[24,5761,0,0,0,548,5761,524,0,0] +3:[24,5929,0,0,0,380,5929,356,0,0] +4:[24,6097,0,0,0,212,6097,188,0,0] +5:[24,6277,0,0,0,807,-7,0,0,0] +ptrs[1-5] = 1:5 2:771 3:9 4:10 5:11 +---- + +The second pointer stores both the low key [24,5761,0,0,0] and the high key +[548,5761,524,0,0], which means that we can expect block 771 to contain records +starting at physical block 24, inode 5761, offset zero; and that one of the +records can be used to find a reverse mapping for physical block 548, inode +5761, and offset 524: + +---- +xfs_db> addr ptrs[2] +xfs_db> p +magic = 0x34524d42 +level = 0 +numrecs = 168 +leftsib = 5 +rightsib = 9 +bno = 6168 +lsn = 0x100000d1b +uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f +owner = 0 +crc = 0xd58eff0e (correct) +recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] +1:[24,525,5761,0,0,0,0] +2:[24,524,5762,0,0,0,0] +3:[24,523,5763,0,0,0,0] +... +166:[24,360,5926,0,0,0,0] +167:[24,359,5927,0,0,0,0] +168:[24,358,5928,0,0,0,0] +---- + +Observe that the first record in the block starts at physical block 24, inode +5761, offset zero, just as we expected. Note that this first record is also +indexed by the highest key as provided in the node block; physical block 548, +inode 5761, offset 524 is the very last block mapped by this record. Furthermore, +note that record 168, despite being the last record in this block, has a lower +maximum key (physical block 382, inode 5928, offset 23) than the first record. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs