From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> --- Documentation/filesystems/xfs/ondisk/btrees.rst | 197 ++++++++++++++++++++++ Documentation/filesystems/xfs/ondisk/globals.rst | 2 2 files changed, 199 insertions(+) create mode 100644 Documentation/filesystems/xfs/ondisk/btrees.rst diff --git a/Documentation/filesystems/xfs/ondisk/btrees.rst b/Documentation/filesystems/xfs/ondisk/btrees.rst new file mode 100644 index 000000000000..4b4fc6b8e7f4 --- /dev/null +++ b/Documentation/filesystems/xfs/ondisk/btrees.rst @@ -0,0 +1,197 @@ +.. SPDX-License-Identifier: CC-BY-SA-3.0+ + +Fixed Length Record B+trees +--------------------------- + +XFS uses b+trees to index all metadata records. This well known data structure +is used to provide efficient random and sequential access to metadata records +while minimizing seek times. There are two btree formats: a short format for +records pertaining to a single allocation group, since all block pointers in +an AG are 32-bits in size; and a long format for records pertaining to a file, +since file data can have 64-bit block offsets. Each b+tree block is either a +leaf node containing records, or an internal node containing keys and pointers +to other b+tree blocks. The tree consists of a root block which may point to +some number of other blocks; blocks in the bottom level of the b+tree contains +only records. + +Leaf blocks of both types of b+trees have the same general format: a header +describing the data in the block, and an array of records. The specific header +formats are given in the next two sections, and the record format is provided +by the b+tree client itself. The generic b+tree code does not have any +specific knowledge of the record format. + +:: + + +--------+------------+------------+ + | header | record | records... | + +--------+------------+------------+ + +Internal node blocks of both types of b+trees also have the same general +format: a header describing the data in the block, an array of keys, and an +array of pointers. Each pointer may be associated with one or two keys. The +first key uniquely identifies the first record accessible via the leftmost +path down the branch of the tree. + +If the records in a b+tree are indexed by an interval, then a range of keys +can uniquely identify a single record. For example, if a record covers blocks +12-16, then any one of the keys 12, 13, 14, 15, or 16 return the same record. +In this case, the key for the record describing "12-16" is 12. If none of the +records overlap, we only need to store one key. + +This is the format of a standard b+tree node: + +:: + + +--------+---------+---------+---------+---------+ + | header | key | keys... | ptr | ptrs... | + +--------+---------+---------+---------+---------+ + +If the b+tree records do not overlap, performing a b+tree lookup is simple. +Start with the root. If it is a leaf block, perform a binary search of the +records until we find the record with a lower key than our search key. If the +block is a node block, perform a binary search of the keys until we find a key +lower than our search key, then follow the pointer to the next block. Repeat +until we find a record. + +However, if b+tree records contain intervals and are allowed to overlap, the +internal nodes of the b+tree become larger: + +:: + + +--------+---------+----------+---------+-------------+---------+---------+ + | header | low key | high key | low key | high key... | ptr | ptrs... | + +--------+---------+----------+---------+-------------+---------+---------+ + +The low keys are exactly the same as the keys in the non-overlapping b+tree. +High keys, however, are a little different. Recall that a record with a key +consisting of an interval can be referenced by a number of keys. Since the low +key of a record indexes the low end of that key range, the high key indexes +the high end of the key range. Returning to the example above, the high key +for the record describing "12-16" is 16. The high key recorded in a b+tree +node is the largest of the high keys of all records accessible under the +subtree rooted by the pointer. For a level 1 node, this is the largest high +key in the pointed-to leaf node; for any other node, this is the largest of +the high keys in the pointed-to node. + +Nodes and leaves use the same magic numbers. + +Short Format B+trees +~~~~~~~~~~~~~~~~~~~~ + +Each allocation group uses a "short format" B+tree to index various +information about the allocation group. The structure is called short format +because all block pointers are AG block numbers. The trees use the following +header: + +.. code:: c + + struct xfs_btree_sblock { + __be32 bb_magic; + __be16 bb_level; + __be16 bb_numrecs; + __be32 bb_leftsib; + __be32 bb_rightsib; + + /* version 5 filesystem fields start here */ + __be64 bb_blkno; + __be64 bb_lsn; + uuid_t bb_uuid; + __be32 bb_owner; + __le32 bb_crc; + }; + +**bb\_magic** + Specifies the magic number for the per-AG B+tree block. + +**bb\_level** + The level of the tree in which this block is found. If this value is 0, + this is a leaf block and contains records; otherwise, it is a node block + and contains keys and pointers. Level values increase towards the root. + +**bb\_numrecs** + Number of records in this block. + +**bb\_leftsib** + AG block number of the left sibling of this B+tree node. + +**bb\_rightsib** + AG block number of the right sibling of this B+tree node. + +**bb\_blkno** + FS block number of this B+tree block. + +**bb\_lsn** + Log sequence number of the last write to this block. + +**bb\_uuid** + The UUID of this block, which must match either sb\_uuid or sb\_meta\_uuid + depending on which features are set. + +**bb\_owner** + The AG number that this B+tree block ought to be in. + +**bb\_crc** + Checksum of the B+tree block. + +Long Format B+trees +~~~~~~~~~~~~~~~~~~~ + +Long format B+trees are similar to short format B+trees, except that their +block pointers are 64-bit filesystem block numbers instead of 32-bit AG block +numbers. Because of this, long format b+trees can be (and usually are) rooted +in an inode’s data or attribute fork. The nodes and leaves of this B+tree use +the xfs\_btree\_lblock declaration: + +.. code:: c + + struct xfs_btree_lblock { + __be32 bb_magic; + __be16 bb_level; + __be16 bb_numrecs; + __be64 bb_leftsib; + __be64 bb_rightsib; + + /* version 5 filesystem fields start here */ + __be64 bb_blkno; + __be64 bb_lsn; + uuid_t bb_uuid; + __be64 bb_owner; + __le32 bb_crc; + __be32 bb_pad; + }; + +**bb\_magic** + Specifies the magic number for the btree block. + +**bb\_level** + The level of the tree in which this block is found. If this value is 0, + this is a leaf block and contains records; otherwise, it is a node block + and contains keys and pointers. + +**bb\_numrecs** + Number of records in this block. + +**bb\_leftsib** + FS block number of the left sibling of this B+tree node. + +**bb\_rightsib** + FS block number of the right sibling of this B+tree node. + +**bb\_blkno** + FS block number of this B+tree block. + +**bb\_lsn** + Log sequence number of the last write to this block. + +**bb\_uuid** + The UUID of this block, which must match either sb\_uuid or sb\_meta\_uuid + depending on which features are set. + +**bb\_owner** + The AG number that this B+tree block ought to be in. + +**bb\_crc** + Checksum of the B+tree block. + +**bb\_pad** + Pads the structure to 64 bytes. diff --git a/Documentation/filesystems/xfs/ondisk/globals.rst b/Documentation/filesystems/xfs/ondisk/globals.rst index baabe867609d..0c919e561fb1 100644 --- a/Documentation/filesystems/xfs/ondisk/globals.rst +++ b/Documentation/filesystems/xfs/ondisk/globals.rst @@ -2,3 +2,5 @@ Global Structures ================= + +.. include:: btrees.rst