From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> --- .../filesystems/xfs/ondisk/allocation_groups.rst | 1 .../filesystems/xfs/ondisk/refcountbt.rst | 154 ++++++++++++++++++++ 2 files changed, 155 insertions(+) create mode 100644 Documentation/filesystems/xfs/ondisk/refcountbt.rst diff --git a/Documentation/filesystems/xfs/ondisk/allocation_groups.rst b/Documentation/filesystems/xfs/ondisk/allocation_groups.rst index 296c520a2fbb..622cef4577bb 100644 --- a/Documentation/filesystems/xfs/ondisk/allocation_groups.rst +++ b/Documentation/filesystems/xfs/ondisk/allocation_groups.rst @@ -1381,3 +1381,4 @@ None of the XFS per-AG B+trees are involved with real time files. It is not possible for real time files to share data blocks. .. include:: rmapbt.rst +.. include:: refcountbt.rst diff --git a/Documentation/filesystems/xfs/ondisk/refcountbt.rst b/Documentation/filesystems/xfs/ondisk/refcountbt.rst new file mode 100644 index 000000000000..c76f4ec7e543 --- /dev/null +++ b/Documentation/filesystems/xfs/ondisk/refcountbt.rst @@ -0,0 +1,154 @@ +.. SPDX-License-Identifier: CC-BY-SA-3.0+ + +Reference Count B+tree +~~~~~~~~~~~~~~~~~~~~~~ + +To support the sharing of file data blocks (reflink), each allocation group +has its own reference count B+tree, which grows in the allocated space like +the inode B+trees. This data could be gleaned by performing an interval query +of the reverse-mapping B+tree, but doing so would come at a huge performance +penalty. Therefore, this data structure is a cache of computable information. + +This B+tree is only present if the XFS\_SB\_FEAT\_RO\_COMPAT\_REFLINK feature +is enabled. The feature requires a version 5 filesystem. + +Each record in the reference count B+tree has the following structure: + +.. code:: c + + struct xfs_refcount_rec { + __be32 rc_startblock; + __be32 rc_blockcount; + __be32 rc_refcount; + }; + +**rc\_startblock** + AG block number of this record. The high bit is set for all records + referring to an extent that is being used to stage a copy on write + operation. This reduces recovery time during mount operations. The + reference count of these staging events must only be 1. + +**rc\_blockcount** + The length of this extent. + +**rc\_refcount** + Number of mappings of this filesystem extent. + +Node pointers are an AG relative block pointer: + +.. code:: c + + struct xfs_refcount_key { + __be32 rc_startblock; + }; + +- As the reference counting is AG relative, all the block numbers are only + 32-bits. + +- The bb\_magic value is "R3FC" (0x52334643). + +- The xfs\_btree\_sblock\_t header is used for intermediate B+tree node as + well as the leaves. + +xfs\_db refcntbt Example +^^^^^^^^^^^^^^^^^^^^^^^^ + +For this example, an XFS filesystem was populated with a root filesystem and a +deduplication program was run to create shared blocks: + +:: + + xfs_db> agf 0 + xfs_db> addr refcntroot + xfs_db> p + magic = 0x52334643 + level = 1 + numrecs = 6 + leftsib = null + rightsib = null + bno = 36892 + lsn = 0x200004ec2 + uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae + owner = 0 + crc = 0x75f35128 (correct) + keys[1-6] = [startblock] 1:[14] 2:[65633] 3:[65780] 4:[94571] 5:[117201] 6:[152442] + ptrs[1-6] = 1:7 2:25836 3:25835 4:18447 5:18445 6:18449 + xfs_db> addr ptrs[3] + xfs_db> p + magic = 0x52334643 + level = 0 + numrecs = 80 + leftsib = 25836 + rightsib = 18447 + bno = 51670 + lsn = 0x200004ec2 + uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae + owner = 0 + crc = 0xc3962813 (correct) + recs[1-80] = [startblock,blockcount,refcount,cowflag] + 1:[65780,1,2,0] 2:[65781,1,3,0] 3:[65785,2,2,0] 4:[66640,1,2,0] + 5:[69602,4,2,0] 6:[72256,16,2,0] 7:[72871,4,2,0] 8:[72879,20,2,0] + 9:[73395,4,2,0] 10:[75063,4,2,0] 11:[79093,4,2,0] 12:[86344,16,2,0] + ... + 80:[35235,10,1,1] + +Notice record 80. The copy on write flag is set and the reference count is 1, +which indicates that the extent 35,235 - 35,244 are being used to stage a copy +on write activity. The "cowflag" field is the high bit of rc\_startblock. + +Record 6 in the reference count B+tree for AG 0 indicates that the AG extent +starting at block 72,256 and running for 16 blocks has a reference count of 2. +This means that there are two files sharing the block: + +:: + + xfs_db> blockget -n + xfs_db> fsblock 72256 + xfs_db> blockuse + block 72256 (0/72256) type rldata inode 25169197 + +The blockuse type changes to "rldata" to indicate that the block is shared +data. Unfortunately, blockuse only tells us about one block owner. If we +happen to have enabled the reverse-mapping B+tree, we can use it to find all +inodes that own this block: + +:: + + xfs_db> agf 0 + xfs_db> addr rmaproot + ... + xfs_db> addr ptrs[3] + ... + xfs_db> addr ptrs[7] + xfs_db> p + magic = 0x524d4233 + level = 0 + numrecs = 22 + leftsib = 65057 + rightsib = 65058 + bno = 291478 + lsn = 0x200004ec2 + uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae + owner = 0 + crc = 0xed7da3f7 (correct) + recs[1-22] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] + 1:[68957,8,3201,0,0,0,0] 2:[68965,4,25260953,0,0,0,0] + ... + 18:[72232,58,3227,0,0,0,0] 19:[72256,16,25169197,24,0,0,0] + 20:[72290,75,3228,0,0,0,0] 21:[72365,46,3229,0,0,0,0] + +Records 18 and 19 intersect the block 72,256; they tell us that inodes 3,227 +and 25,169,197 both claim ownership. Let us confirm this: + +:: + + xfs_db> inode 25169197 + xfs_db> bmap + data offset 0 startblock 12632259 (3/49347) count 24 flag 0 + data offset 24 startblock 72256 (0/72256) count 16 flag 0 + data offset 40 startblock 12632299 (3/49387) count 18 flag 0 + xfs_db> inode 3227 + xfs_db> bmap + data offset 0 startblock 72232 (0/72232) count 58 flag 0 + +Inodes 25,169,197 and 3,227 both contain mappings to block 0/72,256.