Document the new sparse inodes feature and how it affects the inobt records. Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> --- .../allocation_groups.asciidoc | 157 ++++++++++++++++++++ design/XFS_Filesystem_Structure/docinfo.xml | 1 2 files changed, 155 insertions(+), 3 deletions(-) diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc index 845b359..ca3210c 100644 --- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc +++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc @@ -388,6 +388,18 @@ Directory file type. Each directory entry tracks the type of the inode to which the entry points. This is a performance optimization to remove the need to load every inode into memory to iterate a directory. +| +XFS_SB_FEAT_INCOMPAT_SPINODES+ | +Sparse inodes. This feature relaxes the requirement to allocate inodes in +chunks of 64. When the free space is heavily fragmented, there might exist +plenty of free space but not enough contiguous free space to allocate a new +inode chunk. With this feature, the user can continue to create files until +all free space is exhausted. + +Unused space in the inode B+tree records are used to track which parts of the +inode chunk are not inodes. + +See the chapter on xref:Sparse_Inodes[Sparse Inodes] for more information. + | +XFS_SB_FEAT_INCOMPAT_META_UUID+ | Metadata UUID. The UUID stamped into each metadata block must match the value in +sb_meta_uuid+. This enables the administrator to change +sb_uuid+ at will @@ -977,9 +989,9 @@ Specifies the number of levels in the free inode B+tree. [[Inode_Btrees]] == Inode B+trees -Inodes are allocated in chunks of 64, and a B+tree is used to track these chunks -of inodes as they are allocated and freed. The block containing root of the -B+tree is defined by the AGI's +agi_root+ value. If the +Inodes are traditionally allocated in chunks of 64, and a B+tree is used to +track these chunks of inodes as they are allocated and freed. The block +containing root of the B+tree is defined by the AGI's +agi_root+ value. If the +XFS_SB_FEAT_RO_COMPAT_FINOBT+ feature is enabled, a second B+tree is used to track the chunks containing free inodes; this is an optimization to speed up inode allocation. @@ -1111,6 +1123,145 @@ recs[1] = [startino,freecount,free] 1:[5792,9,0xff80000000000000] Observe also that the AGI's +agi_newino+ points to this chunk, which has never been fully allocated. +[[Sparse_Inodes]] +== Sparse Inodes + +As mentioned in the previous section, XFS allocates inodes in chunks of 64. If +there are no free extents large enough to hold a full chunk of 64 inodes, the +inode allocation fails and XFS claims to have run out of space. On a +filesystem with highly fragmented free space, this can lead to out of space +errors long before the filesystem runs out of free blocks. + +The sparse inode feature tracks inode chunks in the inode B+tree as if they +were full chunks but uses some previously unused bits in the freecount field to +track which parts of the inode chunk are not allocated for use as inodes. This +allows XFS to allocate inodes one block at a time if absolutely necessary. + +The inode and free inode B+trees operate in the same manner as they do without +the sparse inode feature; the B+tree header for the nodes and leaves use the ++xfs_btree_sblock+ structure which is the same as the header used in the +xref:AG_Free_Space_Btrees[AGF B+trees]. + +Leaves contain an array of the following structure: + +[source,c] +---- +struct xfs_inobt_rec { + __be32 ir_startino; + __be16 ir_holemask; + __u8 ir_count; + __u8 ir_freecount; + __be64 ir_free; +}; +---- + +*ir_startino*:: +The lowest-numbered inode in this chunk, rounded down to the nearest multiple +of 64, even if the start of this chunk is sparse. + +*ir_holemask*:: +A 16 element bitmap showing which parts of the chunk are not allocated to +inodes. Each bit represents four inodes; if a bit is marked here, the +corresponding bits in ir_free must also be marked. + +*ir_count*:: +Number of inodes allocated to this chunk. + +*ir_freecount*:: +Number of free inodes in this chunk. + +*ir_free*:: +A 64 element bitmap showing which inodes in this chunk are not available for +allocation. + +==== xfs_db Sparse Inode AGI Example + +This example derives from an AG that has been deliberately fragmented. The +inode B+tree: + +---- +xfs_db> agi 0 +xfs_db> p +magicnum = 0x58414749 +versionnum = 1 +seqno = 0 +length = 6400 +count = 10432 +root = 2381 +level = 2 +freecount = 0 +newino = 14912 +dirino = null +unlinked[0-63] = +uuid = b9b4623b-f678-4d48-8ce7-ce08950e3cd6 +lsn = 0x600000ac4 +crc = 0xef550dbc (correct) +free_root = 4 +free_level = 1 +---- + +This AGI was formatted on a v5 filesystem; notice the extra v5 fields. So far +everything else looks much the same as always. + +---- +xfs_db> addr root +magic = 0x49414233 +level = 1 +numrecs = 2 +leftsib = null +rightsib = null +bno = 19048 +lsn = 0x50000192b +uuid = b9b4623b-f678-4d48-8ce7-ce08950e3cd6 +owner = 0 +crc = 0xd98cd2ca (correct) +keys[1-2] = [startino] 1:[128] 2:[35136] +ptrs[1-2] = 1:3 2:2380 +xfs_db> addr ptrs[1] +xfs_db> p +magic = 0x49414233 +level = 0 +numrecs = 159 +leftsib = null +rightsib = 2380 +bno = 24 +lsn = 0x600000ac4 +uuid = b9b4623b-f678-4d48-8ce7-ce08950e3cd6 +owner = 0 +crc = 0x836768a6 (correct) +recs[1-159] = [startino,holemask,count,freecount,free] + 1:[128,0,64,0,0] + 2:[14912,0xff,32,0,0xffffffff] + 3:[15040,0,64,0,0] + 4:[15168,0xff00,32,0,0xffffffff00000000] + 5:[15296,0,64,0,0] + 6:[15424,0xff,32,0,0xffffffff] + 7:[15552,0,64,0,0] + 8:[15680,0xff00,32,0,0xffffffff00000000] + 9:[15808,0,64,0,0] + 10:[15936,0xff,32,0,0xffffffff] +---- + +Here we see the difference in the inode B+tree records. For example, in record +2, we see that the holemask has a value of 0xff. This means that the first +sixteen inodes in this chunk record do not actually map to inode blocks; the +first inode in this chunk is actually inode 14944: + +---- +xfs_db> inode 14912 +Metadata corruption detected at block 0x3a40/0x2000 +... +Metadata CRC error detected for ino 14912 +xfs_db> p core.magic +core.magic = 0 +xfs_db> inode 14944 +xfs_db> p core.magic +core.magic = 0x494e +---- + +The chunk record also indicates that this chunk has 32 inodes, and that the +missing inodes are also ``free''. + [[Real-time_Devices]] == Real-time Devices diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml index 6189fd6..ba97809 100644 --- a/design/XFS_Filesystem_Structure/docinfo.xml +++ b/design/XFS_Filesystem_Structure/docinfo.xml @@ -104,6 +104,7 @@ <member>Discuss metadata integrity.</member> <member>Document the free inode B+tree.</member> <member>Create an index of magic numbers.</member> + <member>Document sparse inodes.</member> </simplelist> </revdescription> </revision> _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs