Re: [PATCH 5/6] xfsdocs: reverse-mapping btree documentation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jun 29, 2016 at 04:03:07PM +0200, Carlos Maiolino wrote:
> On Thu, Jun 16, 2016 at 06:48:39PM -0700, Darrick J. Wong wrote:
> > Add chapters on the operation of the reverse mapping btree and future
> > things we could do with rmap data.
> > 
> > v2: Add magic number to the table.
> > 
> 
> Hi darrick.
> 
> Just FYI, I'm not sure why, but this patch came to the list in base64 format,
> although the remaining ones are in text-plain.

It's probably the curly quotes, I'll have a look.

--D

> 
> 
> > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > ---
> >  .../allocation_groups.asciidoc                     |   24 +-
> >  design/XFS_Filesystem_Structure/docinfo.xml        |   17 +
> >  .../journaling_log.asciidoc                        |  130 +++++++++
> >  design/XFS_Filesystem_Structure/magic.asciidoc     |    3 
> >  .../reconstruction.asciidoc                        |   53 +++
> >  design/XFS_Filesystem_Structure/rmapbt.asciidoc    |  305 ++++++++++++++++++++
> >  .../xfs_filesystem_structure.asciidoc              |    4 
> >  7 files changed, 528 insertions(+), 8 deletions(-)
> >  create mode 100644 design/XFS_Filesystem_Structure/reconstruction.asciidoc
> >  create mode 100644 design/XFS_Filesystem_Structure/rmapbt.asciidoc
> > 
> > 
> > diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
> > index 55bbc50..8ced83a 100644
> > --- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
> > +++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
> > @@ -12,6 +12,7 @@ Each AG has the following characteristics:
> >           * A super block describing overall filesystem info
> >           * Free space management
> >           * Inode allocation and tracking
> > +         * Reverse block-mapping index (optional)
> >  
> >  Having multiple AGs allows XFS to handle most operations in parallel without
> >  degrading performance as the number of concurrent accesses increases.
> > @@ -379,6 +380,12 @@ it doesn't understand the flag.
> >  Free inode B+tree.  Each allocation group contains a B+tree to track inode chunks
> >  containing free inodes.  This is a performance optimization to reduce the time
> >  required to allocate inodes.
> > +
> > +| +XFS_SB_FEAT_RO_COMPAT_RMAPBT+ |
> > +Reverse mapping B+tree.  Each allocation group contains a B+tree containing
> > +records mapping AG blocks to their owners.  See the section about
> > +xref:Reconstruction[reconstruction] for more details.
> > +
> >  |=====
> >  
> >  *sb_features_incompat*::
> > @@ -529,9 +536,7 @@ struct xfs_agf {
> >       __be32              agf_seqno;
> >       __be32              agf_length;
> >       __be32              agf_roots[XFS_BTNUM_AGF];
> > -     __be32              agf_spare0;
> >       __be32              agf_levels[XFS_BTNUM_AGF];
> > -     __be32              agf_spare1;
> >       __be32              agf_flfirst;
> >       __be32              agf_fllast;
> >       __be32              agf_flcount;
> > @@ -550,9 +555,10 @@ struct xfs_agf {
> >  };
> >  ----
> >  
> > -The rest of the bytes in the sector are zeroed. +XFS_BTNUM_AGF+ is set to 2:
> > -index 0 for the free space B+tree indexed by block number; and index 1 for the
> > -free space B+tree indexed by extent size.
> > +The rest of the bytes in the sector are zeroed. +XFS_BTNUM_AGF+ is set to 3:
> > +index 0 for the free space B+tree indexed by block number; index 1 for the free
> > +space B+tree indexed by extent size; and index 2 for the reverse-mapping
> > +B+tree.
> >  
> >  *agf_magicnum*::
> >  Specifies the magic number for the AGF sector: ``XAGF'' (0x58414746).
> > @@ -570,11 +576,13 @@ this could be less than the +sb_agblocks+ value. It is this value that should
> >  be used to determine the size of the AG.
> >  
> >  *agf_roots*::
> > -Specifies the block number for the root of the two free space B+trees.
> > +Specifies the block number for the root of the two free space B+trees and the
> > +reverse-mapping B+tree, if enabled.
> >  
> >  *agf_levels*::
> > -Specifies the level or depth of the two free space B+trees. For a fresh AG, this
> > -will be one, and the ``roots'' will point to a single leaf of level 0.
> > +Specifies the level or depth of the two free space B+trees and the
> > +reverse-mapping B+tree, if enabled.  For a fresh AG, this value will be one,
> > +and the ``roots'' will point to a single leaf of level 0.
> >  
> >  *agf_flfirst*::
> >  Specifies the index of the first ``free list'' block. Free lists are covered in
> > diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
> > index cc5596d..44f944a 100644
> > --- a/design/XFS_Filesystem_Structure/docinfo.xml
> > +++ b/design/XFS_Filesystem_Structure/docinfo.xml
> > @@ -122,4 +122,21 @@
> >  			</simplelist>
> >  		</revdescription>
> >  	</revision>
> > +	<revision>
> > +		<revnumber>3.141</revnumber>
> > +		<date>June 2016</date>
> > +		<author>
> > +			<firstname>Darrick</firstname>
> > +			<surname>Wong</surname>
> > +			<email></email>
> > +		</author>
> > +		<revdescription>
> > +			<simplelist>
> > +				<member>Document the reverse-mapping btree.</member>
> > +				<member>Move the b+tree info to a separate chapter.</member>
> > +				<member>Discuss overlapping interval b+trees.</member>
> > +				<member>Discuss new log items for atomic updates.</member>
> > +			</simplelist>
> > +		</revdescription>
> > +	</revision>
> >  </revhistory>
> > diff --git a/design/XFS_Filesystem_Structure/journaling_log.asciidoc b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
> > index 67d209f..0fbbcee 100644
> > --- a/design/XFS_Filesystem_Structure/journaling_log.asciidoc
> > +++ b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
> > @@ -209,6 +209,8 @@ magic number to distinguish themselves.  Buffer data items only appear after
> >  | +XFS_LI_DQUOT+		| 0x123d        | xref:Quota_Update_Log_Item[Update Quota]
> >  | +XFS_LI_QUOTAOFF+		| 0x123e        | xref:Quota_Off_Log_Item[Quota Off]
> >  | +XFS_LI_ICREATE+		| 0x123f        | xref:Inode_Create_Log_Item[Inode Creation]
> > +| +XFS_LI_RUI+			| 0x1240        | xref:RUI_Log_Item[Reverse Mapping Update Intent]
> > +| +XFS_LI_RUD+			| 0x1241        | xref:RUD_Log_Item[Reverse Mapping Update Done]
> >  |=====
> >  
> >  [[Log_Transaction_Headers]]
> > @@ -386,6 +388,134 @@ Variable-length array of extents to be freed.  The array length is given by
> >  +xfs_extent_32_t+; this can be determined from the log item size (+oh_len+) and
> >  the number of extents (+efd_nextents+).
> >  
> > +[[RUI_Log_Item]]
> > +=== Reverse Mapping Updates Intent
> > +
> > +The next two operation types work together to handle deferred reverse mapping
> > +updates.  Naturally, the mappings to be updated can be expressed in terms of
> > +mapping extents:
> > +
> > +[source, c]
> > +----
> > +struct xfs_map_extent {
> > +     __uint64_t                me_owner;
> > +     __uint64_t                me_startblock;
> > +     __uint64_t                me_startoff;
> > +     __uint32_t                me_len;
> > +     __uint32_t                me_flags;
> > +};
> > +----
> > +
> > +*me_owner*::
> > +Owner of this reverse mapping.  See the values in the section about
> > +xref:Reverse_Mapping_Btree[reverse mapping] for more information.
> > +
> > +*me_startblock*::
> > +Filesystem block of this mapping.
> > +
> > +*me_startoff*::
> > +Logical block offset of this mapping.
> > +
> > +*me_len*::
> > +The length of this mapping.
> > +
> > +*me_flags*::
> > +The lower byte of this field is a type code indicating what sort of
> > +reverse mapping operation we want.  The upper three bytes are flag bits.
> > +
> > +.Reverse mapping update log intent types
> > +[options="header"]
> > +|=====
> > +| Value				| Description
> > +| +XFS_RMAP_EXTENT_MAP+		| Add a reverse mapping for file data.
> > +| +XFS_RMAP_EXTENT_MAP_SHARED+	| Add a reverse mapping for file data for a file with shared blocks.
> > +| +XFS_RMAP_EXTENT_UNMAP+	| Remove a reverse mapping for file data.
> > +| +XFS_RMAP_EXTENT_UNMAP_SHARED+	| Remove a reverse mapping for file data for a file with shared blocks.
> > +| +XFS_RMAP_EXTENT_CONVERT+	| Convert a reverse mapping for file data between unwritten and normal.
> > +| +XFS_RMAP_EXTENT_CONVERT_SHARED+	| Convert a reverse mapping for file data between unwritten and normal for a file with shared blocks.
> > +| +XFS_RMAP_EXTENT_ALLOC+	| Add a reverse mapping for non-file data.
> > +| +XFS_RMAP_EXTENT_FREE+	| Remove a reverse mapping for non-file data.
> > +|=====
> > +
> > +.Reverse mapping update log intent flags
> > +[options="header"]
> > +|=====
> > +| Value				| Description
> > +| +XFS_RMAP_EXTENT_ATTR_FORK+	| Extent is for the attribute fork.
> > +| +XFS_RMAP_EXTENT_BMBT_BLOCK+	| Extent is for a block mapping btree block.
> > +| +XFS_RMAP_EXTENT_UNWRITTEN+	| Extent is unwritten.
> > +|=====
> > +
> > +The ``rmap update intent'' operation comes first; it tells the log that XFS
> > +wants to update some reverse mappings.  This record is crucial for correct log
> > +recovery because it enables us to spread a complex metadata update across
> > +multiple transactions while ensuring that a crash midway through the complex
> > +update will be replayed fully during log recovery.
> > +
> > +[source, c]
> > +----
> > +struct xfs_rui_log_format {
> > +     __uint16_t                rui_type;
> > +     __uint16_t                rui_size;
> > +     __uint32_t                rui_nextents;
> > +     __uint64_t                rui_id;	
> > +     struct xfs_map_extent     rui_extents[1];
> > +};
> > +----
> > +
> > +*rui_type*::
> > +The signature of an RUI operation, 0x1240.  This value is in host-endian order,
> > +not big-endian like the rest of XFS.
> > +
> > +*rui_size*::
> > +Size of this log item.  Should be 1.
> > +
> > +*rui_nextents*::
> > +Number of reverse mappings.
> > +
> > +*rui_id*::
> > +A 64-bit number that binds the corresponding RUD log item to this RUI log item.
> > +
> > +*rui_extents*::
> > +Variable-length array of reverse mappings to update.
> > +
> > +[[RUD_Log_Item]]
> > +=== Completion of Reverse Mapping Updates
> > +
> > +The ``reverse mapping update done'' operation complements the ``reverse mapping
> > +update intent'' operation.  This second operation indicates that the update
> > +actually happened, so that log recovery needn't replay the update.  The RUD and
> > +the actual updates are typically found in a new transaction following the
> > +transaction in which the RUI was logged.
> > +
> > +[source, c]
> > +----
> > +struct xfs_rud_log_format {
> > +      __uint16_t               rud_type;
> > +      __uint16_t               rud_size;
> > +      __uint32_t               rud_nextents;
> > +      __uint64_t               rud_rui_id;
> > +      struct xfs_map_extent    rud_extents[1];
> > +};
> > +----
> > +
> > +*rud_type*::
> > +The signature of an RUD operation, 0x1241.  This value is in host-endian order,
> > +not big-endian like the rest of XFS.
> > +
> > +*rud_size*::
> > +Size of this log item.  Should be 1.
> > +
> > +*rud_nextents*::
> > +Number of reverse mappings.
> > +
> > +*rud_id*::
> > +A 64-bit number that binds the corresponding RUI log item to this RUD log item.
> > +
> > +*rud_extents*::
> > +Variable-length array of reverse mappings.  The array length is given by
> > ++rud_nextents+.
> > +
> >  [[Inode_Log_Item]]
> >  === Inode Updates
> >  
> > diff --git a/design/XFS_Filesystem_Structure/magic.asciidoc b/design/XFS_Filesystem_Structure/magic.asciidoc
> > index 301cfa0..10fd15f 100644
> > --- a/design/XFS_Filesystem_Structure/magic.asciidoc
> > +++ b/design/XFS_Filesystem_Structure/magic.asciidoc
> > @@ -44,6 +44,7 @@ relevant chapters.  Magic numbers tend to have consistent locations:
> >  | +XFS_ATTR_LEAF_MAGIC+		| 0xfbee	|     	| xref:Leaf_Attributes[Leaf Attribute]
> >  | +XFS_ATTR3_LEAF_MAGIC+	| 0x3bee	|     	| xref:Leaf_Attributes[Leaf Attribute], v5 only
> >  | +XFS_ATTR3_RMT_MAGIC+		| 0x5841524d	| XARM	| xref:Remote_Values[Remote Attribute Value], v5 only
> > +| +XFS_RMAP_CRC_MAGIC+		| 0x524d4233	| RMB3	| xref:Reverse_Mapping_Btree[Reverse Mapping B+tree], v5 only
> >  |=====
> >  
> >  The magic numbers for log items are at offset zero in each log item, but items
> > @@ -61,6 +62,8 @@ are not aligned to blocks.
> >  | +XFS_LI_DQUOT+		| 0x123d        |       | xref:Quota_Update_Log_Item[Update Quota Log Item]
> >  | +XFS_LI_QUOTAOFF+		| 0x123e        |       | xref:Quota_Off_Log_Item[Quota Off Log Item]
> >  | +XFS_LI_ICREATE+		| 0x123f        |       | xref:Inode_Create_Log_Item[Inode Creation Log Item]
> > +| +XFS_LI_RUI+			| 0x1240        |       | xref:RUI_Log_Item[Reverse Mapping Update Intent]
> > +| +XFS_LI_RUD+			| 0x1241        |       | xref:RUD_Log_Item[Reverse Mapping Update Done]
> >  |=====
> >  
> >  = Theoretical Limits
> > diff --git a/design/XFS_Filesystem_Structure/reconstruction.asciidoc b/design/XFS_Filesystem_Structure/reconstruction.asciidoc
> > new file mode 100644
> > index 0000000..f172e0f
> > --- /dev/null
> > +++ b/design/XFS_Filesystem_Structure/reconstruction.asciidoc
> > @@ -0,0 +1,53 @@
> > +[[Reconstruction]]
> > += Metadata Reconstruction
> > +
> > +[NOTE]
> > +This is a theoretical discussion of how reconstruction could work; none of this
> > +is implemented as of 2015.
> > +
> > +A simple UNIX filesystem can be thought of in terms of a directed acyclic graph.
> > +To a first approximation, there exists a root directory node, which points to
> > +other nodes.  Those other nodes can themselves be directories or they can be
> > +files.  Each file, in turn, points to data blocks.
> > +
> > +XFS adds a few more details to this picture:
> > +
> > +* The real root(s) of an XFS filesystem are the allocation group headers
> > +(superblock, AGF, AGI, AGFL).
> > +* Each allocation group’s headers point to various per-AG B+trees (free space,
> > +inode, free inodes, free list, etc.)
> > +* The free space B+trees point to unused extents;
> > +* The inode B+trees point to blocks containing inode chunks;
> > +* All superblocks point to the root directory and the log;
> > +* Hardlinks mean that multiple directories can point to a single file node;
> > +* File data block pointers are indexed by file offset;
> > +* Files and directories can have a second collection of pointers to data blocks
> > +which contain extended attributes;
> > +* Large directories require multiple data blocks to store all the subpointers;
> > +* Still larger directories use high-offset data blocks to store a B+tree of
> > +hashes to directory entries;
> > +* Large extended attribute forks similarly use high-offset data blocks to store
> > +a B+tree of hashes to attribute keys; and
> > +* Symbolic links can point to data blocks.
> > +
> > +The beauty of this massive graph structure is that under normal circumstances,
> > +everything known to the filesystem is discoverable (access controls
> > +notwithstanding) from the root.  The major weakness of this structure of course
> > +is that breaking a edge in the graph can render entire subtrees inaccessible.
> > ++xfs_repair+ “recovers” from broken directories by scanning for unlinked inodes
> > +and connecting them to +/lost+found+, but this isn’t sufficiently general to
> > +recover from breaks in other parts of the graph structure.  Wouldn’t it be
> > +useful to have back pointers as a secondary data structure?  The current repair
> > +strategy is to reconstruct whatever can be rebuilt, but to scrap anything that
> > +doesn't check out.
> > +
> > +The xref:Reverse_Mapping_Btree[reverse-mapping B+tree] fills in part of the
> > +puzzle.  Since it contains copies of every entry in each inode’s data and
> > +attribute forks, we can fix a corrupted block map with these records.
> > +Furthermore, if the inode B+trees become corrupt, it is possible to visit all
> > +inode chunks using the reverse-mapping data.  Should XFS ever gain the ability
> > +to store parent directory information in each inode, it also becomes possible
> > +to resurrect damaged directory trees, which should reduce the complaints about
> > +inodes ending up in +/lost+found+.  Everything else in the per-AG primary
> > +metadata can already be reconstructed via +xfs_repair+.  Hopefully,
> > +reconstruction will not turn out to be a fool's errand.
> > diff --git a/design/XFS_Filesystem_Structure/rmapbt.asciidoc b/design/XFS_Filesystem_Structure/rmapbt.asciidoc
> > new file mode 100644
> > index 0000000..a8a210b
> > --- /dev/null
> > +++ b/design/XFS_Filesystem_Structure/rmapbt.asciidoc
> > @@ -0,0 +1,305 @@
> > +[[Reverse_Mapping_Btree]]
> > +== Reverse-Mapping B+tree
> > +
> > +[NOTE]
> > +This data structure is under construction!  Details may change.
> > +
> > +If the feature is enabled, each allocation group has its own reverse
> > +block-mapping B+tree, which grows in the free space like the free space
> > +B+trees.  As mentioned in the chapter about
> > +xref:Reconstruction[reconstruction], this data structure is another piece of
> > +the puzzle necessary to reconstruct the data or attribute fork of a file from
> > +reverse-mapping records; we can also use it to double-check allocations to
> > +ensure that we are not accidentally cross-linking blocks, which can cause
> > +severe damage to the filesystem.
> > +
> > +This B+tree is only present if the +XFS_SB_FEAT_RO_COMPAT_RMAPBT+
> > +feature is enabled.  The feature requires a version 5 filesystem.
> > +
> > +Each record in the reverse-mapping B+tree has the following structure:
> > +
> > +[source, c]
> > +----
> > +struct xfs_rmap_rec {
> > +     __be32                     rm_startblock;
> > +     __be32                     rm_blockcount;
> > +     __be64                     rm_owner;
> > +     __be64                     rm_fork:1;
> > +     __be64                     rm_bmbt:1;
> > +     __be64                     rm_unwritten:1;
> > +     __be64                     rm_unused:7;
> > +     __be64                     rm_offset:54;
> > +};
> > +----
> > +
> > +*rm_startblock*::
> > +AG block number of this record.
> > +
> > +*rm_blockcount*::
> > +The length of this extent.
> > +
> > +*rm_owner*::
> > +A 64-bit number describing the owner of this extent.  This is typically the
> > +absolute inode number, but can also correspond to one of the following:
> > +
> > +.Special owner values
> > +[options="header"]
> > +|=====
> > +| Value				| Description
> > +| +XFS_RMAP_OWN_NULL+           | No owner.  This should never appear on disk.
> > +| +XFS_RMAP_OWN_UNKNOWN+        | Unknown owner; for EFI recovery.  This should never appear on disk.
> > +| +XFS_RMAP_OWN_FS+             | Allocation group headers
> > +| +XFS_RMAP_OWN_LOG+            | XFS log blocks
> > +| +XFS_RMAP_OWN_AG+             | Per-allocation group B+tree blocks.  This means free space B+tree blocks, blocks on the freelist, and reverse-mapping B+tree blocks.
> > +| +XFS_RMAP_OWN_INOBT+          | Per-allocation group inode B+tree blocks.  This includes free inode B+tree blocks.
> > +| +XFS_RMAP_OWN_INODES+         | Inode chunks
> > +|=====
> > +
> > +*rm_fork*::
> > +If +rm_owner+ describes an inode, this can be 1 if this record is for an
> > +attribute fork.
> > +
> > +*rm_bmbt*::
> > +If +rm_owner+ describes an inode, this can be 1 to signify that this record is
> > +for a block map B+tree block.  In this case, +rm_offset+ has no meaning.
> > +
> > +*rm_unwritten*::
> > +A flag indicating that the extent is unwritten.  This corresponds to the flag in
> > +the xref:Data_Extents[extent record] format which means +XFS_EXT_UNWRITTEN+.
> > +
> > +*rm_offset*::
> > +The 54-bit logical file block offset, if +rm_owner+ describes an inode.
> > +Meaningless otherwise.
> > +
> > +[NOTE]
> > +The single-bit flag values +rm_unwritten+, +rm_fork+, and +rm_bmbt+ are packed
> > +into the larger fields in the C structure definition.
> > +
> > +The key has the following structure:
> > +
> > +[source, c]
> > +----
> > +struct xfs_rmap_key {
> > +     __be32                     rm_startblock;
> > +     __be64                     rm_owner;
> > +     __be64                     rm_fork:1;
> > +     __be64                     rm_bmbt:1;
> > +     __be64                     rm_reserved:1;
> > +     __be64                     rm_unused:7;
> > +     __be64                     rm_offset:54;
> > +};
> > +----
> > +
> > +For the reverse-mapping B+tree on a filesystem that supports sharing of file
> > +data blocks, the key definition is larger than the usual AG block number.  On a
> > +classic XFS filesystem, each block has only one owner, which means that
> > ++rm_startblock+ is sufficient to uniquely identify each record.  However,
> > +shared block support (reflink) on XFS breaks that assumption; now filesystem
> > +blocks can be linked to any logical block offset of any file inode.  Therefore,
> > +the key must include the owner and offset information to preserve the 1 to 1
> > +relation between key and record.
> > +
> > +* As the reference counting is AG relative, all the block numbers are only
> > +32-bits.
> > +* The +bb_magic+ value is "RMB3" (0x524d4233).
> > +* The +xfs_btree_sblock_t+ header is used for intermediate B+tree node as well
> > +as the leaves.
> > +* Each pointer is associated with two keys.  The first of these is the "low
> > +key", which is the key of the smallest record accessible through the pointer.
> > +This low key has the same meaning as the key in all other btrees.  The second
> > +key is the high key, which is the maximum of the largest key that can be used
> > +to access a given record underneath the pointer.  Recall that each record
> > +in the reverse mapping b+tree describes an interval of physical blocks mapped
> > +to an interval of logical file block offsets; therefore, it makes sense that
> > +a range of keys can be used to find to a record.
> > +
> > +=== xfs_db rmapbt Example
> > +
> > +This example shows a reverse-mapping B+tree from a freshly populated root
> > +filesystem:
> > +
> > +----
> > +xfs_db> agf 0
> > +xfs_db> addr rmaproot
> > +xfs_db> p
> > +magic = 0x524d4233
> > +level = 1
> > +numrecs = 43
> > +leftsib = null
> > +rightsib = null
> > +bno = 56
> > +lsn = 0x3000004c8
> > +uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4
> > +owner = 0
> > +crc = 0x7cf8be6f (correct)
> > +keys[1-43] = [startblock,owner,offset]
> > +keys[1-43] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,
> > +	     offset_hi,attrfork_hi,bmbtblock_hi]
> > +        1:[0,-3,0,0,0,351,4418,66,0,0]
> > +        2:[417,285,0,0,0,827,4419,2,0,0]
> > +        3:[829,499,0,0,0,2352,573,55,0,0]
> > +        4:[1292,710,0,0,0,32168,262923,47,0,0]
> > +        5:[32215,-5,0,0,0,34655,2365,3411,0,0]
> > +        6:[34083,1161,0,0,0,34895,265220,1,0,1]
> > +        7:[34896,256191,0,0,0,36522,-9,0,0,0]
> > +        ...
> > +        41:[50998,326734,0,0,0,51430,-5,0,0,0]
> > +        42:[51431,327010,0,0,0,51600,325722,11,0,0]
> > +        43:[51611,327112,0,0,0,94063,23522,28375272,0,0]
> > +ptrs[1-43] = 1:5 2:6 3:8 4:9 5:10 6:11 7:418 ... 41:46377 42:48784 43:49522
> > +----
> > +
> > +We arbitrarily pick pointer 17 to traverse downwards:
> > +
> > +----
> > +xfs_db> addr ptrs[17]
> > +xfs_db> p
> > +magic = 0x524d4233
> > +level = 0
> > +numrecs = 168
> > +leftsib = 36284
> > +rightsib = 37617
> > +bno = 294760
> > +lsn = 0x200002761
> > +uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4
> > +owner = 0
> > +crc = 0x2dad3fbe (correct)
> > +recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
> > +        1:[40326,1,259615,0,0,0,0] 2:[40327,1,-5,0,0,0,0]
> > +        3:[40328,2,259618,0,0,0,0] 4:[40330,1,259619,0,0,0,0]
> > +        ...
> > +        127:[40540,1,324266,0,0,0,0] 128:[40541,1,324266,8388608,0,0,0]
> > +        129:[40542,2,324266,1,0,0,0] 130:[40544,32,-7,0,0,0,0]
> > +----
> > +
> > +Several interesting things pop out here.  The first record shows that inode
> > +259,615 has mapped AG block 40,326 at offset 0.  We confirm this by looking at
> > +the block map for that inode:
> > +
> > +----
> > +xfs_db> inode 259615
> > +xfs_db> bmap
> > +data offset 0 startblock 40326 (0/40326) count 1 flag 0
> > +----
> > +
> > +Next, notice records 127 and 128, which describe neighboring AG blocks that are
> > +mapped to non-contiguous logical blocks in inode 324,266.  Given the logical
> > +offset of 8,388,608 we surmise that this is a leaf directory, but let us
> > +confirm:
> > +
> > +----
> > +xfs_db> inode 324266
> > +xfs_db> p core.mode
> > +core.mode = 040755
> > +xfs_db> bmap
> > +data offset 0 startblock 40540 (0/40540) count 1 flag 0
> > +data offset 1 startblock 40542 (0/40542) count 2 flag 0
> > +data offset 3 startblock 40576 (0/40576) count 1 flag 0
> > +data offset 8388608 startblock 40541 (0/40541) count 1 flag 0
> > +xfs_db> p core.mode
> > +core.mode = 0100644
> > +xfs_db> dblock 0
> > +xfs_db> p dhdr.hdr.magic
> > +dhdr.hdr.magic = 0x58444433
> > +xfs_db> dblock 8388608
> > +xfs_db> p lhdr.info.hdr.magic
> > +lhdr.info.hdr.magic = 0x3df1
> > +----
> > +
> > +Indeed, this inode 324,266 appears to be a leaf directory, as it has regular
> > +directory data blocks at low offsets, and a single leaf block.
> > +
> > +Notice further the two reverse-mapping records with negative owners.  An owner
> > +of -7 corresponds to +XFS_RMAP_OWN_INODES+, which is an inode chunk, and an
> > +owner code of -5 corresponds to +XFS_RMAP_OWN_AG+, which covers free space
> > +B+trees and free space.  Let's see if block 40,544 is part of an inode chunk:
> > +
> > +----
> > +xfs_db> blockget
> > +xfs_db> fsblock 40544
> > +xfs_db> blockuse
> > +block 40544 (0/40544) type inode
> > +xfs_db> stack
> > +1:
> > +        byte offset 166068224, length 4096
> > +        buffer block 324352 (fsbno 40544), 8 bbs
> > +        inode 324266, dir inode 324266, type data
> > +xfs_db> type inode
> > +xfs_db> p
> > +core.magic = 0x494e
> > +----
> > +
> > +Our suspicions are confirmed.  Let's also see if 40,327 is part of a free space
> > +tree:
> > +
> > +----
> > +xfs_db> fsblock 40327
> > +xfs_db> blockuse
> > +block 40327 (0/40327) type btrmap
> > +xfs_db> type rmapbt
> > +xfs_db> p
> > +magic = 0x524d4233
> > +----
> > +
> > +As you can see, the reverse block-mapping B+tree is an important secondary
> > +metadata structure, which can be used to reconstruct damaged primary metadata.
> > +Now let's look at an extend rmap btree:
> > +
> > +----
> > +xfs_db> agf 0
> > +xfs_db> addr rmaproot
> > +xfs_db> p
> > +magic = 0x34524d42
> > +level = 1
> > +numrecs = 5
> > +leftsib = null
> > +rightsib = null
> > +bno = 6368
> > +lsn = 0x100000d1b
> > +uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f
> > +owner = 0
> > +crc = 0x8d4ace05 (correct)
> > +keys[1-5] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,offset_hi,attrfork_hi,bmbtblock_hi]
> > +1:[0,-3,0,0,0,705,132,681,0,0]
> > +2:[24,5761,0,0,0,548,5761,524,0,0]
> > +3:[24,5929,0,0,0,380,5929,356,0,0]
> > +4:[24,6097,0,0,0,212,6097,188,0,0]
> > +5:[24,6277,0,0,0,807,-7,0,0,0]
> > +ptrs[1-5] = 1:5 2:771 3:9 4:10 5:11
> > +----
> > +
> > +The second pointer stores both the low key [24,5761,0,0,0] and the high key
> > +[548,5761,524,0,0], which means that we can expect block 771 to contain records
> > +starting at physical block 24, inode 5761, offset zero; and that one of the
> > +records can be used to find a reverse mapping for physical block 548, inode
> > +5761, and offset 524:
> > +
> > +----
> > +xfs_db> addr ptrs[2]
> > +xfs_db> p
> > +magic = 0x34524d42
> > +level = 0
> > +numrecs = 168
> > +leftsib = 5
> > +rightsib = 9
> > +bno = 6168
> > +lsn = 0x100000d1b
> > +uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f
> > +owner = 0
> > +crc = 0xd58eff0e (correct)
> > +recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
> > +1:[24,525,5761,0,0,0,0]
> > +2:[24,524,5762,0,0,0,0]
> > +3:[24,523,5763,0,0,0,0]
> > +...
> > +166:[24,360,5926,0,0,0,0]
> > +167:[24,359,5927,0,0,0,0]
> > +168:[24,358,5928,0,0,0,0]
> > +----
> > +
> > +Observe that the first record in the block starts at physical block 24, inode
> > +5761, offset zero, just as we expected.  Note that this first record is also
> > +indexed by the highest key as provided in the node block; physical block 548,
> > +inode 5761, offset 524 is the very last block mapped by this record.  Furthermore,
> > +note that record 168, despite being the last record in this block, has a lower
> > +maximum key (physical block 382, inode 5928, offset 23) than the first record.
> > diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> > index 62502b3..1b8658d 100644
> > --- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> > +++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> > @@ -48,6 +48,8 @@ include::overview.asciidoc[]
> >  
> >  include::metadata_integrity.asciidoc[]
> >  
> > +include::reconstruction.asciidoc[]
> > +
> >  include::common_types.asciidoc[]
> >  
> >  include::magic.asciidoc[]
> > @@ -66,6 +68,8 @@ include::btrees.asciidoc[]
> >  
> >  include::allocation_groups.asciidoc[]
> >  
> > +include::rmapbt.asciidoc[]
> > +
> >  include::journaling_log.asciidoc[]
> >  
> >  include::internal_inodes.asciidoc[]
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@xxxxxxxxxxx
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> -- 
> Carlos

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs




[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux