From: Darrick J. Wong <djwong@xxxxxxxxxx> Update the ondisk format documentation to reflect the realtime refcount btree. Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> --- .../internal_inodes.asciidoc | 5 - .../journaling_log.asciidoc | 9 + design/XFS_Filesystem_Structure/magic.asciidoc | 1 design/XFS_Filesystem_Structure/realtime.asciidoc | 5 - .../XFS_Filesystem_Structure/rtrefcountbt.asciidoc | 173 ++++++++++++++++++++ 5 files changed, 190 insertions(+), 3 deletions(-) create mode 100644 design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc diff --git a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc index 7da0cdf6..0fc758c6 100644 --- a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc +++ b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc @@ -31,6 +31,7 @@ of those inodes have been deallocated and may be reused by future features. | xref:Real-Time_Bitmap_Inode[Realtime Bitmap] | /realtime/bitmap | xref:Real-Time_Summary_Inode[Realtime Summary] | /realtime/summary | xref:Real_time_Reverse_Mapping_Btree[Realtime Reverse Mapping B+tree] | /realtime/*.rmap +| xref:Real_time_Refcount_Btree[Realtime Reference Count+tree] | /realtime/*.refcount |===== Metadata files are flagged by the +XFS_DIFLAG2_METADATA+ flag in the @@ -325,4 +326,6 @@ xref:Real-Time_Bitmap_Inode[Bitmap Inode] and the xref:Real-Time_Summary_Inode[Summary Inode]. Each realtime group can allocate one inode to managing a -xref:Real_time_Reverse_Mapping_Btree[reverse-index of space] usage. +xref:Real_time_Reverse_Mapping_Btree[reverse-index of space] usage, and +a second one to manage xref:Real_time_Refcount_Btree[reference counts] of space +usage. diff --git a/design/XFS_Filesystem_Structure/journaling_log.asciidoc b/design/XFS_Filesystem_Structure/journaling_log.asciidoc index 52513b18..2c7b7383 100644 --- a/design/XFS_Filesystem_Structure/journaling_log.asciidoc +++ b/design/XFS_Filesystem_Structure/journaling_log.asciidoc @@ -569,6 +569,15 @@ reverse mapping operation we want. The upper three bytes are flag bits. | +XFS_REFCOUNT_EXTENT_FREE_COW+ | Unreserve an extent for staging copy on write. |===== +.Reference count update log intent flags +[options="header"] +|===== +| Value | Description +| +XFS_REFCOUNT_EXTENT_REALTIME+ | Extent describes a range of blocks on the +realtime volume. The range must be aligned to the realtime extent size, +because extents cannot be partially shared. +|===== + The ``reference count update intent'' operation comes first; it tells the log that XFS wants to update some reference counts. This record is crucial for correct log recovery because it enables us to spread a complex metadata update diff --git a/design/XFS_Filesystem_Structure/magic.asciidoc b/design/XFS_Filesystem_Structure/magic.asciidoc index c83f59a2..19ee52b8 100644 --- a/design/XFS_Filesystem_Structure/magic.asciidoc +++ b/design/XFS_Filesystem_Structure/magic.asciidoc @@ -49,6 +49,7 @@ relevant chapters. Magic numbers tend to have consistent locations: | +XFS_REFC_CRC_MAGIC+ | 0x52334643 | R3FC | xref:Reference_Count_Btree[Reference Count B+tree], v5 only | +XFS_MD_MAGIC+ | 0x5846534d | XFSM | xref:Metadata_Dumps[Metadata Dumps] | +XFS_RTSB_MAGIC+ | 0x58524750 | XRGP | xref:Realtime_Groups[Realtime Groups] +| +XFS_RTREFC_CRC_MAGIC+ | 0x52434e54 | RCNT | xref:Real_time_Refcount_Btree[Real-Time Reference Count B+tree], v5 only |===== The magic numbers for log items are at offset zero in each log item, but items diff --git a/design/XFS_Filesystem_Structure/realtime.asciidoc b/design/XFS_Filesystem_Structure/realtime.asciidoc index 77d947ba..5d0b47a2 100644 --- a/design/XFS_Filesystem_Structure/realtime.asciidoc +++ b/design/XFS_Filesystem_Structure/realtime.asciidoc @@ -14,8 +14,7 @@ By placing the real time device (and the journal) on separate high-performance storage devices, it is possible to reduce most of the unpredictability in I/O response times that come from metadata operations. -None of the XFS per-AG B+trees are involved with real time files. It is not -possible for real time files to share data blocks. +None of the XFS per-AG B+trees are involved with real time files. [[Real-Time_Bitmap_Inode]] == Free Space Bitmap Inode @@ -233,3 +232,5 @@ meta_uuid = c52adb8a-48a6-4325-b251-d4dcb30889ea ---- include::rtrmapbt.asciidoc[] + +include::rtrefcountbt.asciidoc[] diff --git a/design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc b/design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc new file mode 100644 index 00000000..617badbf --- /dev/null +++ b/design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc @@ -0,0 +1,173 @@ +[[Real_time_Refcount_Btree]] +=== Reference Count B+tree + +If the reflink and real-time storage device features are enabled, each +real-time group has its own reference count B+tree. + +As mentioned in the chapter about xref:Reflink_Deduplication[sharing data +blocks], this data structure is necessary to track how many times each extent +in the realtime volume has been mapped. This is how the copy-on-write code +determines what to do when a realtime file is written. + +This B+tree is only present if the +XFS_SB_FEAT_RO_COMPAT_REFLINK+ feature is +enabled and a real time device is present. The feature requires a version 5 +filesystem. + +The rtgroup reference count B+tree is rooted in an inode's data fork; the inode +number can be found by resolving the path +/realtime/$rgno.refcount+ in the +metadata directory tree. superblock. The B+tree blocks themselves are stored +in the regular filesystem. The structures used for an inode's B+tree root are: + +[source, c] +---- +struct xfs_rtrefcount_root { + __be16 bb_level; + __be16 bb_numrecs; +}; +---- + +* If the B+tree contains only a single level, the ondisk data fork area begins +with a +xfs_rtrefcount_root+ header followed by an array of +xfs_refcount_rec+ +leaf records. + +* Otherwise, the ondisk data fork area begins with the +xfs_rtrefcount_root+ +header and is followed first by an array of +xfs_refcount_key+ values and then +an array of +xfs_rtrefcount_ptr_t+ values. The size of both arrays is +specified by the header's +bb_numrecs+ value. + +* The root node in the inode can only contain up to 28 leaf records or +key/pointer pairs for a standard 512 byte inode before a new level of nodes is +added between the root and the leaves. +di_forkoff+ should always be zero, +because there are no extended attributes. + +Each record in an rtgroup reference count B+tree has the same structure as an +AG reference count btree: + +[source, c] +---- +struct xfs_refcount_rec { + __be32 rc_startblock; + __be32 rc_blockcount; + __be32 rc_refcount; +}; +---- + +*rc_startblock*:: +rtgroup block number of this record. Note that reference count records are +tracked in units of realtime blocks, not realtime extents. +However, records must be aligned to the realtime extent size in accordance with +the existing realtime extent handling strategy. The high bit +(+XFS_REFC_COW_FLAG+) is set for all records referring to an extent that is +being used to stage a copy on write operation. This reduces recovery time +during mount operations. The reference count of these staging events must only +be 1. + +*rc_blockcount*:: +The length of this extent, in filesystem blocks. + +*rc_refcount*:: +Number of times this extent has been shared. + +The key has the following structure: + +[source, c] +---- +struct xfs_refcount_key { + __be32 rc_startblock; +}; +---- + +* All block numbers are 32-bit rtgroup device block numbers, though the +key should be aligned to the realtime extent size. + +* The +bb_magic+ value is ``RCNT'' (0x52434354). + +* The +struct xfs_btree_lblock+ header is used for intermediate B+tree node as +well as the leaves. + +==== xfs_db rtrefcountbt Example + +This example shows a real-time reference count B+tree from a freshly +populated filesystem. One directory tree has been reflinked: + +---- +xfs_db> path -m /realtime/0.refcount +xfs_db> p +core.magic = 0x494e +core.mode = 0100000 +core.version = 3 +core.format = 6 (refcount) +... +v3.inumber = 134 +v3.uuid = 23d157a4-8ca7-4fca-8782-637dc6746105 +v3.reflink = 0 +v3.cowextsz = 0 +v3.dax = 0 +v3.bigtime = 1 +v3.nrext64 = 1 +v3.metadata = 1 +u3.rtrefcbt.level = 1 +u3.rtrefcbt.numrecs = 2 +u3.rtrefcbt.keys[1-2] = [startblock,cowflag] +1:[4,0] +2:[344,0] +u3.rtrefcbt.ptrs[1-2] = 1:8 2:9 +---- + +Notice that this is a two-level refcount btree; we must continue towards the +leaf level. + +---- +xfs_db> addr u3.rtrefcbt.ptrs[2] +xfs_db> p +magic = 0x52434e54 +level = 0 +numrecs = 170 +leftsib = 8 +rightsib = null +bno = 72 +lsn = 0 +uuid = 23d157a4-8ca7-4fca-8782-637dc6746105 +owner = 134 +crc = 0x21e04c3 (correct) +recs[1-170] = [startblock,blockcount,refcount,cowflag] +1:[344,1,2,0] +2:[346,1,2,0] +3:[348,1,2,0] +4:[350,1,2,0] +5:[352,1,2,0] +6:[354,1,2,0] +... +---- + +This indicates that realtime block 354 is shared. Let's use the realtime +reverse mapping information to find which files are sharing these blocks: + +---- +xfs_db> fsmap -r 354 354 +0: 0/1 len 682 owner 10015 offset 0 bmbt 0 attrfork 0 extflag 0 +1: 0/354 len 1 owner 10014 offset 353 bmbt 0 attrfork 0 extflag 0 +---- + +It looks as though inodes 10,014 and 10,015 share this block. Let us confirm +this by navigating to those inodes and dumping the data fork mappings: + +---- +xfs_db> inode 10015 +xfs_db> p core.realtime +core.realtime = 1 +xfs_db> bmap +data offset 0 startblock 1 (0/1) count 682 flag 0 +xfs_db> inode 10014 +xfs_db> p core.realtime +core.realtime = 1 +xfs_db> bmap 350 10 +data offset 351 startblock 352 (0/352) count 1 flag 0 +data offset 353 startblock 354 (0/354) count 1 flag 0 +data offset 355 startblock 356 (0/356) count 1 flag 0 +data offset 357 startblock 358 (0/358) count 1 flag 0 +data offset 359 startblock 360 (0/360) count 1 flag 0 +---- + +Notice that both inodes have their realtime flags set, and both of them map +a data fork extent to the same realtime block 354.