Fix typos and grammatical errors in the text. Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> --- .../allocation_groups.asciidoc | 128 +++++++++++-------- .../XFS_Filesystem_Structure/data_extents.asciidoc | 64 +++++---- .../XFS_Filesystem_Structure/directories.asciidoc | 136 ++++++++++++-------- design/XFS_Filesystem_Structure/docinfo.xml | 16 ++ .../extended_attributes.asciidoc | 96 ++++++++------ .../internal_inodes.asciidoc | 33 +++-- .../XFS_Filesystem_Structure/ondisk_inode.asciidoc | 83 +++++++----- 7 files changed, 319 insertions(+), 237 deletions(-) diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc index a86274c..680f90c 100644 --- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc +++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc @@ -3,9 +3,8 @@ XFS filesystems are divided into a number of equally sized chunks called Allocation Groups. Each AG can almost be thought of as an individual filesystem -that maintains it's own space usage. Each AG can be up to one terabyte in size -(512 bytes * 2^31^), regardless of the underlying -device's sector size. +that maintains its own space usage. Each AG can be up to one terabyte in size +(512 bytes × 2^31^), regardless of the underlying device's sector size. Each AG has the following characteristics: @@ -14,15 +13,15 @@ Each AG has the following characteristics: * Inode allocation and tracking Having multiple AGs allows XFS to handle most operations in parallel without -degrading performance as the number of concurrent accessing increases. +degrading performance as the number of concurrent accesses increases. -The only global information maintained by the first AG (primary) is free spac e +The only global information maintained by the first AG (primary) is free space across the filesystem and total inode counts. If the +XFS_SB_VERSION2_LAZYSBCOUNTBIT+ flag is set in the superblock, these are only updated on-disk when the filesystem is cleanly unmounted (umount or shutdown). -Immediately after a mkfs.xfs, the primary AG has the following disk layout the -subsequent AGs do not have any inodes allocated: +Immediately after a +mkfs.xfs+, the primary AG has the following disk layout; +the subsequent AGs do not have any inodes allocated: .Allocation group layout image::images/6.png[] @@ -32,9 +31,10 @@ Each of these structures are expanded upon in the following sections. [[Superblocks]] == Superblocks -Each AG starts with a superblock. The first one is the primary superblock that -stores aggregate AG information. Secondary superblocks are only used by -xfs_repair when the primary superblock has been corrupted. +Each AG starts with a superblock. The first one, in AG 0, is the primary +superblock which stores aggregate AG information. Secondary superblocks are +only used by xfs_repair when the primary superblock has been corrupted. A +superblock is one sector in length. The superblock is defined by the following structure. The description of each field follows. @@ -93,7 +93,7 @@ struct xfs_sb }; ---- *sb_magicnum*:: -Identifies the filesystem. It's value is +XFS_SB_MAGIC = 0x58465342 "XFSB"+. +Identifies the filesystem. Its value is +XFS_SB_MAGIC = 0x58465342 "XFSB"+. *sb_blocksize*:: The size of a basic unit of space allocation in bytes. Typically, this is 4096 @@ -116,11 +116,14 @@ the UUID instead of device name. *sb_logstart*:: First block number for the journaling log if the log is internal (ie. not on a separate disk device). For an external log device, this will be zero (the log -will also start on the first block on the log device). +will also start on the first block on the log device). The identity of the log +devices is not recorded in the filesystem, but the UUIDs of the filesystem and +the log device are compared to prevent corruption. *sb_rootino*:: -Root inode number for the filesystem. Typically, this is 128 when using a -4KB block size. +Root inode number for the filesystem. Normally, the root inode is at the +start of the first possible inode chunk in AG 0. This is 128 when using a 4KB +block size. *sb_rbmino*:: Bitmap inode for real-time extents. @@ -147,9 +150,9 @@ Number of blocks for the journaling log. *sb_versionnum*:: Filesystem version number. This is a bitmask specifying the features enabled when creating the filesystem. Any disk checking tools or drivers that do not -recognize any set bits must not operate upon the filesystem. Most of the flagsi -indicate features introduced over time. The value must be 4 including the -following flags: +recognize any set bits must not operate upon the filesystem. Most of the flags +indicate features introduced over time. If the value of the lower nibble is 4, +the higher bits indicate feature flags as follows: .Version 4 Superblock version flags [options="header"] @@ -175,8 +178,8 @@ Set if the sb_features2 field in the superblock contains more flags. |===== *sb_sectsize*:: -Specifies the underlying disk sector size in bytes. Majority of the time, this -is 512 bytes. This determines the minimum I/O alignment including Direct I/O. +Specifies the underlying disk sector size in bytes. Typically this is 512 or +4096 bytes. This determines the minimum I/O alignment, especially for direct I/O. *sb_inodesize*:: Size of the inode in bytes. The default is 256 (2 inodes per standard sector) @@ -258,6 +261,13 @@ Quota flags. It can be a combination of the following flags: *sb_flags*:: Miscellaneous flags. +.Superblock flags +[options="header"] +|===== +| Flag | Description +| +XFS_SBF_READONLY+ | Only read-only mounts allowed. +|===== + *sb_shared_vn*:: Reserved and must be zero ("vn" stands for version number). @@ -300,17 +310,29 @@ primary superblock when the filesystem is cleanly unmounted. | +XFS_SB_VERSION2_ATTR2BIT+ | Extended attributes version 2. Making a filesystem with this optimises the inode -layout of extended attributes. +layout of extended attributes. See the section about +xref:Extended_Attribute_Versions[extended attribute versions] for more +information. | +XFS_SB_VERSION2_PARENTBIT+ | Parent pointers. All inodes must have an extended attribute that points back to its parent inode. The primary purpose for this information is in backup systems. + +| +XFS_SB_VERSION2_PROJID32BIT+ | +32-bit Project ID. Inodes can be associated with a project ID number, which +can be used to enforce disk space usage quotas for a particular group of +directories. This flag indicates that project IDs can be 32 bits in size. + +| +XFS_SB_VERSION2_FTYPE+ | +Directory file type. Each directory entry records the type of the inode to +which the entry points. This speeds up directory iteration by removing the +need to load every inode into memory. |===== === xfs_db Superblock Example -A filesystem is made on a single SATA disk with the following command: +A filesystem is made on a single disk with the following command: ---- # mkfs.xfs -i attr=2 -n size=16384 -f /dev/sda7 @@ -385,7 +407,7 @@ One B+tree tracks space by block number, the second by the size of the free space block. This scheme allows XFS to quickly find free space near a given block or of a given size. -All block numbers, indexes and counts are AG relative. +All block numbers, indexes, and counts are AG relative. [[AG_Free_Space_Block]] === AG Free Space Block @@ -414,8 +436,9 @@ struct xfs_agf { }; ---- -The rest of the bytes in the sector are zeroed. +XFS_BTNUM_AGF+ is set to 2, -index 0 for the count B+tree and index 1 for the size B+tree. +The rest of the bytes in the sector are zeroed. +XFS_BTNUM_AGF+ is set to 2: +index 0 for the free space B+tree indexed by block number; and index 1 for the +free space B+tree indexed by extent size. *agf_magicnum*:: Specifies the magic number for the AGF sector: "XAGF" (0x58414746). @@ -428,7 +451,7 @@ Specifies the AG number for the sector. *agf_length*:: Specifies the size of the AG in filesystem blocks. For all AGs except the last, -This must be equal to the superblock's +sb_agblocks+ value. For the last AG, +this must be equal to the superblock's +sb_agblocks+ value. For the last AG, this could be less than the +sb_agblocks+ value. It is this value that should be used to determine the size of the AG. @@ -459,14 +482,13 @@ Specifies the number of blocks of longest contiguous free space in the AG. Specifies the number of blocks used for the free space B+trees. This is only used if the +XFS_SB_VERSION2_LAZYSBCOUNTBIT+ bit is set in +sb_features2+. -[[AG_Free_Space_Btrees]] -=== AG Free Space B+trees +[[Short_Format_Btrees]] +=== Short Format B+trees -The two Free Space B+trees store a sorted array of block offset and block -counts in the leaves of the B+tree. The first B+tree is sorted by the offset, -the second by the count or size. - -The trees use the following header: +Each allocation group uses a ``short format'' B+tree to index various +information about the allocation group. The structure is called short format +because all block pointers are AG block numbers. The trees use the following +header: [source, c] ---- @@ -479,6 +501,13 @@ struct xfs_btree_sblock { }; ---- +[[AG_Free_Space_Btrees]] +=== AG Free Space B+trees + +The two Free Space B+trees store a sorted array of block offset and block +counts in the leaves of the B+tree. The first B+tree is sorted by the offset, +the second by the count or size. + Leaf nodes contain a sorted array of offset/count pairs which are also used for node keys: @@ -531,7 +560,7 @@ including inodes, data, directories and extended attributes. With a freshly made filesystem, 4 blocks are reserved immediately after the free space B+tree root blocks (blocks 4 to 7). As they are used up as the free space fragments, additional blocks will be reserved from the AG and added to the free -list array. +list array. This size may increase as features are added. As the free list array is located within a single sector, a typical device will have space for 128 elements in the array (512 bytes per sector, 4 bytes per AG @@ -545,8 +574,8 @@ values. The array is managed as a circular list. .AG Free List layout image::images/16.png[] -The presence of these reserved block guarantees that the free space B+trees can -be updated if any blocks are freed by extent changes in a full AG. +The presence of these reserved blocks guarantees that the free space B+trees +can be updated if any blocks are freed by extent changes in a full AG. ==== xfs_db AGF Example @@ -584,8 +613,8 @@ bno[0-127] = 0:4 1:5 2:6 3:7 4:83342 5:83343 6:83344 7:83345 8:83346 9:83347 26:80205 27:83344 ---- -The free space B+tree sorted by block offset, the root block is from the AGF's -+bnoroot+ value: +The root block of the free space B+tree sorted by block offset is found in the +AGF's +bnoroot+ value: ---- xfs_db> fsblock 7 @@ -602,11 +631,11 @@ ptrs[1-4] = 1:2 2:83347 3:6 4:4 ---- Blocks 2, 83347, 6 and 4 contain the leaves for the free space B+tree by -starting block. Block 2 would contain offsets 16 up to but not including 184586 +starting block. Block 2 would contain offsets 12 up to but not including 184586 while block 4 would have all offsets from 511629 to the end of the AG. -The free space B+tree sorted by block count, the root block is from the AGF's -+cntroot+ value: +The root block of the free space B+tree sorted by block count is found in the +AGF's +cntroot+ value: ---- xfs_db> fsblock 83343 @@ -642,7 +671,7 @@ recs[1-344] = [startblock,blockcount] 342:[513712,755] 343:[230317,258229] 344:[538795,3384327] ---- -The longest block count must be the same as the AGF's +longest+ value. +The longest block count (3384327) must be the same as the AGF's +longest+ value. [[AG_Inode_Management]] == AG Inode Management @@ -659,7 +688,7 @@ structures. Absolute inode numbers include the AG number in the high bits, above the bits used for the AG relative inode number. Absolute inode numbers are found in -xref:Directories[directory] entries. +xref:Directories[directory] entries and the superblock. .Inode number formats image::images/18.png[] @@ -713,10 +742,10 @@ Specifies the number of levels in the inode B+tree. Specifies the number of free inodes in the AG. *agi_newino*:: -Specifies AG relative inode number most recently allocated. +Specifies AG-relative inode number of the most recently allocated chunk. *agi_dirino*:: -Deprecated and not used, it's always set to NULL (-1). +Deprecated and not used, this is always set to NULL (-1). *agi_unlinked[64]*:: Hash table of unlinked (deleted) inodes that are still being referenced. Refer @@ -734,6 +763,8 @@ The B+tree header for the nodes and leaves use the +xfs_btree_sblock+ structure which is the same as the header used in the xref:AG_Free_Space_Btrees[AGF B+trees]. +The magic number of the inode B+tree is ``IABT'' (0x49414254). + Leaves contain an array of the following structure: [source,c] @@ -755,20 +786,15 @@ struct xfs_inobt_key { typedef __be32 xfs_inobt_ptr_t; ---- -For the leaf entries, +ir_startino+ specifies the starting inode number for the -chunk, +ir_freecount+ specifies the number of free entries in the chuck, and the -+ir_free+ is a 64 element bit array specifying which entries are free in the -chunk. - The following diagram illustrates a single level inode B+tree: -.Single Level inode b+tree +.Single Level inode B+tree image::images/20a.png[] And a 2-level inode B+tree: -.Multi-Level inode b+tree +.Multi-Level inode B+tree image::images/20b.png[] diff --git a/design/XFS_Filesystem_Structure/data_extents.asciidoc b/design/XFS_Filesystem_Structure/data_extents.asciidoc index 7850165..a09fcc2 100644 --- a/design/XFS_Filesystem_Structure/data_extents.asciidoc +++ b/design/XFS_Filesystem_Structure/data_extents.asciidoc @@ -1,18 +1,18 @@ [[Data_Extents]] = Data Extents -XFS allocates space for a file using extents: starting location and length. XFS -extents also specify the file's logical starting offset for a file. This allows -a files extent map to automatically support sparse files (i.e. "holes" in the -file). A flag is also used to specify if the extent has been preallocated and -not yet been written to (unwritten extent). +XFS manages space using extents, which are defined as a starting location and +length. A fork in an XFS inode maps a logical offset to a space extent. This +enables a file's extent map to support sparse files (i.e. "holes" in the file). +A flag is also used to specify if the extent has been preallocated but has not +yet been written (unwritten extent). A file can have more than one extent if one chunk of contiguous disk space is not available for the file. As a file grows, the XFS space allocator will -attempt to keep space contiguous and merge extents. If more than one file is +attempt to keep space contiguous and to merge extents. If more than one file is being allocated space in the same AG at the same time, multiple extents for the -files will occur as the extents get interleaved. The effect of this can vary -depending on the extent allocator used in the XFS driver. +files will occur as the extent allocations interleave. The effect of this can +vary depending on the extent allocator used in the XFS driver. An extent is 128 bits in size and uses the following packed layout: @@ -48,15 +48,16 @@ typedef enum { Some other points about extents: -* The +xfs_bmbt_rec_32_t+ and +xfs_bmbt_rec_64_t+ structures are effectively +* The +xfs_bmbt_rec_32_t+ and +xfs_bmbt_rec_64_t+ structures were effectively the same as +xfs_bmbt_rec_t+, just different representations of the same 128 -bits in on-disk big endian format. +bits in on-disk big endian format. +xfs_bmbt_rec_32_t+ was removed and ++xfs_bmbt_rec_64_t+ renamed to +xfs_bmbt_rec_t+ some time ago. * When a file is created and written to, XFS will endeavour to keep the extents within the same AG as the inode. It may use a different AG if the AG is busy or there is no space left in it. -* If a file is zero bytes long, it will have no extents, +di_nblocks+ and +* If a file is zero bytes long, it will have no extents and +di_nblocks+ and +di_nexents+ will be zero. Any file with data will have at least one extent, and each extent can use from 1 to over 2 million blocks (2^21^) on the filesystem. For a default 4KB block size filesystem, a single extent can be up to 8GB in @@ -72,20 +73,20 @@ efficiently. [[Extent_List]] == Extent List -Local extents are where the entire extent array is stored within the inode's -data fork itself. This is the most optimal in terms of speed and resource -consumption. The trade-off is the file can only have a few extents before the -inode runs out of space. +If the entire extent list is short enough to fit within the inode's fork +region, we say that the fork is in ``extent list'' format. This is the most +optimal in terms of speed and resource consumption. The trade-off is the file +can only have a few extents before the inode runs out of space. -The "data fork" of the inode contains an array of extents, the size of the array -determined by the inode's +di_nextents+ value. +The data fork of the inode contains an array of extents; the size of the array +is determined by the inode's +di_nextents+ value. .Inode data fork extent layout image::images/32.png[] The number of extents that can fit in the inode depends on the inode size and +di_forkoff+. For a default 256 byte inode with no extended attributes, a file -can up to 19 extents with this format. Beyond this, extents have to use the +can have up to 9 extents with this format. Beyond this, extents have to use the B+tree format. === xfs_db Inode Data Fork Extents Example @@ -129,7 +130,7 @@ u.bmx[0-2] = [startoff,startblock,blockcount,extentflag] 2:[4050,35481,2025,0] ---- -Raw disk version of the inode with the third extent highlighted (+di_u+ always +Raw disk version of the inode with the third extent highlighted (+di_u+ starts at offset 0x64): [subs="quotes"] @@ -193,9 +194,9 @@ u.bmx[0-1] = [startoff,startblock,blockcount,extentflag] [[Btree_Extent_List]] == B+tree Extent List -Beyond the simple extent array, to efficiently manage large extent maps, XFS -uses B+trees. The root node of the B+tree is stored in the inode's data fork. -All block pointers for extent B+trees are 64-bit absolute block numbers. +To manage extent maps that cannot fit in the inode fork area, XFS uses long +format B+trees. The root node of the B+tree is stored in the inode's data +fork. All block pointers for extent B+trees are 64-bit absolute block numbers. For a single level B+tree, the root node points to the B+tree's leaves. Each leaf occupies one filesystem block and contains a header and an array of extents @@ -204,11 +205,12 @@ forward) block pointers to adjacent leaves. For a standard 4KB filesystem block, a leaf can contain up to 254 extents before a B+tree rebalance is triggered. For a multi-level B+tree, the root node points to other B+tree nodes which -eventually point to the extent leaves. B+tree keys are based on the file's -offset. The nodes at each level in the B+tree point to the adjacent nodes. +eventually point to the extent leaves. B+tree keys are based on the file's +offset and have pointers to the next level down. Nodes at each level in the +B+tree also have pointers to the adjacent nodes. The base B+tree node is used for extents, directories and extended attributes. -The structures used for inode's B+tree root are: +The structures used for an inode's B+tree root are: [source, c] ---- @@ -222,15 +224,18 @@ struct xfs_bmbt_key { typedef xfs_fsblock_t xfs_bmbt_ptr_t, xfs_bmdr_ptr_t; ---- -* On disk, the B+tree node starts with the +xfs_bmbr_block_t+ header followed by +* On disk, the B+tree node starts with the +xfs_bmdr_block_t+ header followed by an array of +xfs_bmbt_key_t+ values and then an array of +xfs_bmbt_ptr_t+ values. The size of both arrays is specified by the header's +bb_numrecs+ value. -* The root node in the inode can only contain up to 19 key/pointer pairs for a +* The root node in the inode can only contain up to 9 key/pointer pairs for a standard 256 byte inode before a new level of nodes is added between the root and the leaves. This will be less if +di_forkoff+ is not zero (i.e. attributes are in use on the inode). +[[Long_Format_Btrees]] +== Long Format B+trees + The subsequent nodes and leaves of the B+tree use the +xfs_btree_lblock+ declaration: @@ -253,10 +258,7 @@ a 4096 byte filesystem block). * For leaves, an array of +xfs_bmbt_rec+ extents follow the +xfs_btree_lblock+ header. -* Nodes and leaves use the same value for +bb_magic+: - -[source, c] -#define XFS_BMAP_MAGIC 0x424d4150 /* 'BMAP' */ +* Nodes and leaves use the same value for +bb_magic+. * The +bb_level+ value determines if the node is an intermediate node or a leaf. Leaves have a +bb_level+ of zero, nodes are one or greater. diff --git a/design/XFS_Filesystem_Structure/directories.asciidoc b/design/XFS_Filesystem_Structure/directories.asciidoc index 7d84117..3521749 100644 --- a/design/XFS_Filesystem_Structure/directories.asciidoc +++ b/design/XFS_Filesystem_Structure/directories.asciidoc @@ -9,17 +9,18 @@ The term "block" in this section will refer to directory blocks, not filesystem blocks unless otherwise specified. The size of a "directory block" is defined by the xref:Superblocks[superblock's] -+sb_dirblklog+ value. The size in bytes = +sb_blocksize+ * 2^sb_dirblklog^. ++sb_dirblklog+ value. The size in bytes = +sb_blocksize+ × 2^sb_dirblklog^. For example, if +sb_blocksize+ = 4096 and +sb_dirblklog+ = 2, the directory block size is 16384 bytes. Directory blocks are always allocated in multiples based on +sb_dirblklog+. Directory blocks cannot be more that 65536 bytes in size. All directory entries contain the following "data": -* Entry's name (counted string consisting of a single byte +namelen+ followed by -+name+ consisting of an array of 8-bit chars without a NULL terminator). +* The entry's name (counted string consisting of a single byte +namelen+ +followed by +name+ consisting of an array of 8-bit chars without a NULL +terminator). -* Entry's absolute inode number (<xref linkend="Inode_Numbers"/>), which are +* The entry's absolute xref:Inode_Numbers[inode number], which are always 64 bits (8 bytes) in size except a special case for shortform directories. @@ -51,18 +52,19 @@ typedef __uint32_t xfs_dir2_dataptr_t; * Directory entries are stored within the inode. -* Only data stored is the name, inode # and offset, no "leaf" or "freespace index" -information is required as an inode can only store a few entries. +* The only data stored is the name, inode number, and offset. No "leaf" or +"freespace index" information is required as an inode can only store a few +entries. * "." is not stored (as it's in the inode itself), and ".." is a dedicated +parent+ field in the header. -* The number of directories that can be stored in an inode depends on the inode -size (<xref linkend="On-disk_Inode"/>), the number of entries, the length of the -entry names and extended attribute data. +* The number of directories that can be stored in an inode depends on the +xref:On-disk_Inode[inode] size, the number of entries, the length of the entry +names, and extended attribute data. -* Once the number of entries exceed the space available in the inode, the format -is converted to a "Block Directory". +* Once the number of entries exceeds the space available in the inode, the +format is converted to a xref:Block_Directories[block directory]. * Shortform directory data is packed as tightly as possible on the disk with the remaining space zeroed: @@ -82,6 +84,7 @@ typedef struct xfs_dir2_sf_entry { __uint8_t namelen; xfs_dir2_sf_off_t offset; __uint8_t name[1]; + __uint8_t ftype; xfs_dir2_inou_t inumber; } xfs_dir2_sf_entry_t; ---- @@ -94,8 +97,9 @@ numbers for the directory fit in 4 bytes (32 bits) or not. If all inode numbers fit in 4 bytes, the header's +count+ value specifies the number of entries in the directory and +i8count+ will be zero. If any inode number exceeds 4 bytes, all inode numbers will be 8 bytes in size and the header's +i8count+ value -specifies the number of entries and count will be zero. The following union -covers the shortform inode number structure: +specifies the number of entries requiring larger inodes. +i4count+ is still +the number of entries. The following union covers the shortform inode number +structure: [source, c] ---- @@ -222,7 +226,7 @@ b0: 72 61 6d 65 30 30 30 30 30 33 2e 74 73 74 01 80 rame000003.tst.. c0: 00 84 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ ---- -TODO: 8-byte inode number example</section> +TODO: 8-byte inode number example [[Block_Directories]] @@ -233,12 +237,10 @@ data is moved into a new single directory block outside the inode. The inode's format is changed from "local" to "extent". Following is a list of points about block directories. - <itemizedlist> - * All directory data is stored within the one directory block, including "." and ".." entries which are mandatory. -* The block also contains "leaf" and "freespace index " information. +* The block also contains "leaf" and "freespace index" information. * The location of the block is defined by the inode's in-core xref:Extent_List[extent list]: the +di_u.u_bmx[0]+ value. The file offset in @@ -273,6 +275,7 @@ typedef struct xfs_dir2_data_entry { xfs_ino_t inumber; __uint8_t namelen; __uint8_t name[1]; + __uint8_t ftype; xfs_dir2_data_off_t tag; } xfs_dir2_data_entry_t; typedef struct xfs_dir2_data_unused { @@ -293,10 +296,12 @@ typedef struct xfs_dir2_block_tail { .Block directory layout image::images/43.png[] +* The magic number in the header is "XD2B" (0x58443242). + * The +tag+ in the +xfs_dir2_data_entry_t+ structure stores its offset from the start of the block. -* Start of a free space region is marked with the +xfs_dir2_data_unused_t+ +* The start of a free space region is marked with the +xfs_dir2_data_unused_t+ structure where the +freetag+ is +0xffff+. The +freetag+ and +length+ overwrites the +inumber+ for an entry. The +tag+ is located at +length - sizeof(tag)+ from the start of the +unused+ entry on-disk. @@ -321,8 +326,8 @@ structure, contains an array of hash/address pairs for quickly looking up a name by a hash value. Hash values are covered by the introduction to directories. The +address+ on-disk is the offset into the block divided by 8 (+XFS_DIR2_DATA_ALIGN+). Hash/address pairs are stored on disk to optimise -lookup speed for large directories. If they were not stored, the hashes have to -be calculated for all entries each time a lookup occurs in a directory. +lookup speed for large directories. If they were not stored, the hashes would +have to be calculated for all entries each time a lookup occurs in a directory. === xfs_db Block Directory Example @@ -529,9 +534,12 @@ allocate a new block for the leaf and freespace index information. * The "leaf" block has a special offset defined by +XFS_DIR2_LEAF_OFFSET+. Currently, this is 32GB and in the extent view, a block offset of -32GB/sb_blocksize. On a 4KB block filesystem, this is 0x800000 (8388608 +32GB / +sb_blocksize+. On a 4KB block filesystem, this is 0x800000 (8388608 decimal). +* Blocks with directory entries ("data" extents) have the magic number "X2D2" +(0x58443244). + * The "data" extents have a new header (no "leaf" data): [source, c] @@ -562,9 +570,12 @@ typedef struct xfs_dir2_leaf_tail { } xfs_dir2_leaf_tail_t; ---- -* The leaves use the +xfs_da_blkinfo_t+ filesystem block header. This header is -used for directory and xref:Extended_Attributes[extended attribute] leaves and -B+tree nodes: +[[Directory_Attribute_Block_Header]] +=== Directory and Attribute Block Headers + +* Leaf nodes in directories and xref:Extended_Attributes[extended attributes] +use the +xfs_da_blkinfo_t+ filesystem block header. The structure appears as +follows: [source, c] ---- @@ -576,11 +587,13 @@ typedef struct xfs_da_blkinfo { } xfs_da_blkinfo_t; ---- +* The magic number of the leaf block is +XFS_DIR2_LEAF1_MAGIC+ (0xd2f1). + * The size of the +ents+ array is specified by +hdr.count+. -* The size of the bests array is specified by the tail.bestcount which is also the -number of "data" blocks for the directory. The bests array maintains each data -block's +bestfree[0].length+ value. +* The size of the +bests+ array is specified by the +tail.bestcount+, which is +also the number of "data" blocks for the directory. The bests array maintains +each data block's +bestfree[0].length+ value. .Leaf directory free entry detail image::images/48.png[] @@ -588,7 +601,7 @@ image::images/48.png[] === xfs_db Leaf Directory Example For this example, a directory was created with 256 entries (frame000000.tst to -frame000255.tst) and then deleted some files (frame00005*, frame00018* and +frame000255.tst). Some files were deleted (frame00005*, frame00018* and frame000240.tst) to show free list characteristics. ---- @@ -840,15 +853,19 @@ each "data" block. This is not possible with more than one leaf. * The "data" blocks stay the same as leaf directories. -* The "leaf" blocks eventually change into a B+tree with the generic B+tree header -pointing to directory "leaves" as described in Leaf Directories. The top-level -blocks are called "nodes". It can exist in a state where there is still a single -leaf block before it's split. Interpretation of the node vs. leaf blocks has to -be performed by inspecting the magic value in the header. The combined -leaf/freeindex blocks has a magic value of +XFS_DIR2_LEAF1_MAGIC (0xd2f1)+, a -node directory's leaf/leaves have a magic value of +XFS_DIR2_LEAFN_MAGIC -(0xd2ff)+ and intermediate nodes have a magic value of +XFS_DA_NODE_MAGIC -(0xfebe)+. +* After the "freeindex" data moves to its own block, it is possible for the +leaf data to fit within a single leaf block. This single leaf block has a +magic number of +XFS_DIR2_LEAFN_MAGIC+ (0xd2ff). + +* The "leaf" blocks eventually change into a B+tree with the generic B+tree +header pointing to directory "leaves" as described in +xref:Leaf_Directories[Leaf Directories]. Blocks with leaf data still have the ++LEAFN_MAGIC+ magic number as outlined above. The top-level tree blocks are +called "nodes" and have a magic number of +XFS_DA_NODE_MAGIC+ (0xfebe). + +* Distinguishing between a combined leaf/freeindex block (+LEAF1_MAGIC+), a +leaf-only block (+LEAFN_MAGIC+), and a btree node block (+NODE_MAGIC+) can only +be done by examining the magic number. * The new "freeindex" block(s) only contains the bests for each data block. @@ -869,11 +886,17 @@ typedef struct xfs_dir2_free { ---- * The location of the leaf blocks can be in any order, the only way to determine -the appropriate is by the node block hash/before values. Given a hash to lookup, +the appropriate is by the node block hash/before values. Given a hash to look up, you read the node's +btree+ array and first +hashval+ in the array that exceeds the given hash and it can then be found in the block pointed to by the +before+ value. +[[Directory_Attribute_Internal_Node]] +=== Directory and Attribute Internal Nodes + +The hashing B+tree of a directory or an extended attribute fork uses nodes with +the following format: + [source, c] ---- typedef struct xfs_da_intnode { @@ -901,9 +924,9 @@ directory to have a hole at the start. * The freeindex's +hdr.nvalid+ should always be the same as the number of allocated data directory blocks containing name/inode data and will always be -less than or equal to +hdr.nused. hdr.nused+ should be the same as the index of -the last data directory block plus one (i.e. when the last data block is freed, -+nused+ and +nvalid+ are decremented). +less than or equal to +hdr.nused+. The value of +hdr.nused+ should be the same +as the index of the last data directory block plus one (i.e. when the last data +block is freed, +nused+ and +nvalid+ are decremented). .Node directory layout image::images/54.png[] @@ -940,9 +963,10 @@ u.bmx[0-7] = [startoff,startblock,blockcount,extentflag] 0:[0,7368,4,0] As can already be observed, all extents are allocated is multiples of 4 blocks. -Blocks 0 to 19 (16+4-1) are used for the data. Looking at blocks 16-19, it can -seen that it's the same as the single-leaf format, except the +length+ values -are a lot larger to accommodate the increased directory block size: +Blocks 0 to 19 (16+4-1) are used for directory data blocks. Looking at blocks +16-19, we can seen that it's the same as the single-leaf format, except the ++length+ values are a lot larger to accommodate the increased directory block +size: ---- xfs_db> dblock 16 @@ -994,11 +1018,11 @@ nhdr.level = 1 nbtree[0-1] = [hashval,before] 0:[0xa3a440ac,8388616] 1:[0xf3a440bc,8388612] ---- -The following leaf blocks have been allocated once as XFS knows it needs at two -blocks when allocating a B+tree, so the length is 8 fsblocks. For all hashes -< 0xa3a440ac, they are located in the directory offset 8388616 and hashes -below 0xf3a440bc are in offset 8388612. Hashes above f3a440bc don't exist in -this directory. +The two following leaf blocks were allocated as part of the directory's +conversion to node format. All hashes less than 0xa3a440ac are located at +directory offset 8,388,616, and hashes less than 0xf3a440bc are located at +directory offset 8,388,612. Hashes greater or equal to 0xf3a440bc don't exist +in this directory. ---- xfs_db> dblock 8388616 @@ -1075,8 +1099,7 @@ fbests[0-4] = 0:0x10 1:0x10 2:0x10 3:0x10 4:0x3f50 Like the Leaf Directory, each of the +fbests+ values correspond to each data block's +bestfree[0].length+ value. -The raw disk layout, old data is not cleared after the array. The fbests array -is highlighted: +The +fbests+ array is highlighted in a raw block dump: [subs="quotes"] ---- @@ -1095,15 +1118,13 @@ TODO: Example with a hole in the middle When the extent map in an inode grows beyond the inode's space, the inode format is changed to a "btree". The inode contains a filesystem block point to the B+tree extent map for the directory's blocks. The B+tree extents contain the -extent map for the "data", "node", "leaf" and "freeindex" information as +extent map for the "data", "node", "leaf", and "freeindex" information as described in Node Directories. Refer to the previous section on B+tree xref:Btree_Extent_List[Data Extents] for more information on XFS B+tree extents. -The following situations and changes can apply over Node Directories, and apply -here as inode extents generally cannot contain the number of directory blocks -that B+trees can handle: +The following properties apply to both node and B+tree directories: * The node/leaf trees can be more than one level deep. @@ -1219,8 +1240,9 @@ nbtree[0-318] = [hashval,before] 0:[0x70b14711,8388919] ... ---- The leaves at each the end of a node always point to the end leaves in adjacent -nodes. Directory block 8388928 forward pointer is to block 8388919, and vice -versa as highlighted in the following example: +nodes. Directory block 8388928 has a forward pointer to block 8388919 and block +8388919 has a previous pointer to block 8388928, as highlighted in the +following example: [subs="quotes"] ---- diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml index cb5ffe7..856c01d 100644 --- a/design/XFS_Filesystem_Structure/docinfo.xml +++ b/design/XFS_Filesystem_Structure/docinfo.xml @@ -1,7 +1,9 @@ <subtitle>3rd Edition</subtitle> +<!-- <abstract> <para>This book documents the XFS Filesystem Structure</para> </abstract> +--> <corpauthor> </corpauthor> <copyright> @@ -69,4 +71,18 @@ </simplelist> </revdescription> </revision> + <revision> + <revnumber>3</revnumber> + <date>October 2015</date> + <author> + <firstname>Darrick</firstname> + <surname>Wong</surname> + <email></email> + </author> + <revdescription> + <simplelist> + <member>Miscellaneous fixes.</member> + </simplelist> + </revdescription> + </revision> </revhistory> diff --git a/design/XFS_Filesystem_Structure/extended_attributes.asciidoc b/design/XFS_Filesystem_Structure/extended_attributes.asciidoc index 747217c..18a4568 100644 --- a/design/XFS_Filesystem_Structure/extended_attributes.asciidoc +++ b/design/XFS_Filesystem_Structure/extended_attributes.asciidoc @@ -1,20 +1,20 @@ [[Extended_Attributes]] = Extended Attributes -Extended attributes implement the ability for a user to attach name:value pairs -to inodes within the XFS filesystem. They could be used to store +Extended attributes enable users and administrators to attach (name: value) +pairs to inodes within the XFS filesystem. They could be used to store meta-information about the file. -The attribute names can be up to 256 bytes in length, terminated by the first 0 +Attribute names can be up to 256 bytes in length, terminated by the first 0 byte. The intent is that they be printable ASCII (or other character set) names -for the attribute. The values can be up to 64KB of arbitrary binary data. Some -XFS internal attributes (eg. parent pointers) use non-printable names for the -attribute. +for the attribute. The values can contain up to 64KB of arbitrary binary data. +Some XFS internal attributes (eg. parent pointers) use non-printable names for +the attribute. Access Control Lists (ACLs) and Data Migration Facility (DMF) use extended attributes to store their associated metadata with an inode. -XFS uses two disjoint attribute name spaces associated with every inode. They +XFS uses two disjoint attribute name spaces associated with every inode. These are the root and user address spaces. The root address space is accessible only to the superuser, and then only by specifying a flag argument to the function call. Other users will not see or be able to modify attributes in the root @@ -27,15 +27,10 @@ To set or delete extended attributes, use the +setfattr+ command. ACLs control should use the +getfacl+ and +setfacl+ commands. XFS attributes supports three namespaces: "user", "trusted" (or "root" using -IRIX terminology) and "secure". +IRIX terminology), and "secure". -The location of the attribute fork in the inode's literal area is specified by -the +di_forkoff+ value in the inode's core. If this value is zero, the inode -does not contain any extended attributes. Non-zero, the byte offset into the -literal area = +di_forkoff * 8+, which also determines the 2048 byte maximum -size for an inode. Attributes must be allocated on a 64-bit boundary on the disk -except shortform attributes (they are tightly packed). To determine the offset -into the inode itself, add 100 (0x64) to +di_forkoff * 8+. +See the section about xref:Extended_Attribute_Versions[extended attributes] in +the inode for instructions on how to calculate the location of the attributes. The following four sections describe each of the on-disk formats. @@ -45,7 +40,7 @@ The following four sections describe each of the on-disk formats. When the all extended attributes can fit within the inode's attribute fork, the inode's +di_aformat+ is set to "local" and the attributes are stored in the -inode's literal area starting at offset +di_forkoff * 8+. +inode's literal area starting at offset +di_forkoff × 8+. Shortform attributes use the following structures: @@ -67,28 +62,39 @@ typedef struct xfs_attr_sf_hdr xfs_attr_sf_hdr_t; typedef struct xfs_attr_sf_entry xfs_attr_sf_entry_t; ---- -.Short form attribute layout -image::images/64.png[] - +*totsize*:: +Total size of the attribute structure in bytes. -* +namelen+ and +valuelen+ specify the size of the two byte arrays containing the -name and value pairs. +valuelen+ is zero for extended attributes with no value. +*count*:: +The number of entries that can be found in this structure. -* +nameval[]+ is a single array where it's size is the sum of +namelen+ and -+valuelen+. The names and values are not null terminated on-disk. The value -immediately follows the name in the array. +*namelen* and *valuelen*:: +These values specify the size of the two byte arrays containing the name and +value pairs. +valuelen+ is zero for extended attributes with no value. -* +flags+ specifies the namespace for the attribute (0 = "user"): +*nameval[]*:: +A single array whose size is the sum of +namelen+ and +valuelen+. The names and +values are not null terminated on-disk. The value immediately follows the name +in the array. +[[Attribute_Flags]] +*flags*:: +A combination of the following: .Attribute Namespaces [options="header"] |===== | Flag | Description +| 0 | The attribute's namespace is "user". | +XFS_ATTR_ROOT+ | The attribute's namespace is "trusted". | +XFS_ATTR_SECURE+ | The attribute's namespace is "secure". +| +XFS_ATTR_INCOMPLETE+ | This attribute is being modified. +| +XFS_ATTR_LOCAL+ | The attribute value is contained within this block. |===== +.Short form attribute layout +image::images/64.png[] + === xfs_db Short Form Attribute Example A file is created and two attributes are set: @@ -315,8 +321,9 @@ The first part of the "leaf" contains an array of fixed size hash/index pairs with the flags stored as well. The remaining part of the leaf block contains the array name/value pairs, where each element varies in length. -Each leaf is based on the +xfs_da_blkinfo_t+ block header declared in Leaf -Directories. The structure encapsulating all other structures in the +Each leaf is based on the +xfs_da_blkinfo_t+ block header declared in the +section about xref:Directory_Attribute_Block_Header[directories]. The structure +encapsulating all other structures in the attribute block is +xfs_attr_leafblock_t+. The structures involved are: @@ -365,18 +372,14 @@ typedef struct xfs_attr_leafblock { xfs_attr_leaf_name_remote_t valuelist; } xfs_attr_leafblock_t; ---- -</programlisting> - - Each leaf header uses the following magic number: -[source, c] -#define XFS_ATTR_LEAF_MAGIC 0xfbee +Each leaf header uses the magic number +XFS_ATTR_LEAF_MAGIC+ (0xfbee). The hash/index elements in the +entries[]+ array are packed from the top of the block. Name/values grow from the bottom but are not packed. The freemap contains run-length-encoded entries for the free bytes after the +entries[]+ array, but only the three largest runs are stored (smaller runs are dropped). When the -+freemap+ doesn't show enough space for an allocation, name/value area is ++freemap+ doesn't show enough space for an allocation, the name/value area is compacted and allocation is tried again. If there still isn't enough space, then the block is split. The name/value structures (both local and remote versions) must be 32-bit aligned. @@ -400,7 +403,7 @@ lookup, the actual name string must be compared. An "incomplete" bit is also used for attribute flags. It shows that an attribute is in the middle of being created and should not be shown to the user if we crash during the time that the bit is set. The bit is cleared when attribute -has finished being setup. This is done because some large attributes cannot +has finished being set up. This is done because some large attributes cannot be created inside a single transaction. === xfs_db Leaf Attribute Example @@ -529,15 +532,18 @@ When the number of attributes exceeds the space that can fit in one filesystem block (ie. hash, flag, name and local values), the first attribute block becomes the root of a B+tree where the leaves contain the hash/name/value information that was stored in a single leaf block. The inode's attribute format itself -remains extent based. The nodes use the +xfs_da_intnode_t+ structure introduced -in Node Directories. - -The location of the attribute leaf blocks can be in any order, the only way to -determine the appropriate is by the node block hash/before values. Given a hash -to lookup, you read the node's btree array and first +hashval+ in the array that -exceeds the given hash and it can then be found in the block pointed to by the +remains extent based. The nodes use the +xfs_da_intnode_t+ or ++xfs_da3_intnode_t+ structures introduced in the section about +xref:Directory_Attribute_Internal_Node[directories]. + +The location of the attribute leaf blocks can be in any order. The only way to +find an attribute is by walking the node block hash/before values. Given a hash +to look up, search the node's btree array for the first +hashval+ in the array +that exceeds the given hash. The entry is in the block pointed to by the +before+ value. +Each attribute node block has a magic number of +XFS_DA_NODE_MAGIC+ (0xfebe). + .Node attribute layout image::images/72.png[] @@ -679,11 +685,11 @@ When the attribute's extent map in an inode grows beyond the available space, the inode's attribute format is changed to a "btree". The inode contains root node of the extent B+tree which then address the leaves that contains the extent arrays for the attribute data. The attribute data itself in the allocated -filesystem blocks use the same layout and structures as described in Node -Attributes. +filesystem blocks use the same layout and structures as described in +xref:Node_Attributes[Node Attributes]. -Refer to the previous section on B+tree Data Extents for more information on XFS -B+tree extents. +Refer to the previous section on xref:Btree_Extent_List[B+tree Data Extents] for +more information on XFS B+tree extents. === xfs_db B+tree Attribute Example diff --git a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc index b69bea2..a926857 100644 --- a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc +++ b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc @@ -28,8 +28,9 @@ multiplied by the size of +xfs_dqblk_t+ (136 bytes). .Quota inode layout image::images/76.png[] -Quota information stored in the two inodes (in data extents) are an array of the -+xfs_dqblk+ structure where there is one instance for each ID in the system: +Quota information is stored in the data extents of the two reserved quota +inodes as an array of the +xfs_dqblk+ structures, where there is one array +element for each ID in the system: [source, c] ---- @@ -67,7 +68,7 @@ Specifies the signature where these two bytes are 0x4451 (+XFS_DQUOT_MAGIC+), or "DQ" in ASCII. *d_version*:: -Specifies the structure version, currently this is one (+XFS_DQUOT_VERSION+). +The structure version, currently this is 1 (+XFS_DQUOT_VERSION+). *d_flags*:: Specifies which type of ID the structure applies to: @@ -84,33 +85,33 @@ The ID for the quota structure. This will be a uid, gid or projid based on the value of +d_flags+. *d_blk_hardlimit*:: -Specifies the hard limit for the number of filesystem blocks the ID can own. The +The hard limit for the number of filesystem blocks the ID can own. The ID will not be able to use more space than this limit. If it is attempted, +ENOSPC+ will be returned. *d_blk_softlimit*:: -Specifies the soft limit for the number of filesystem blocks the ID can own. +The soft limit for the number of filesystem blocks the ID can own. The ID can temporarily use more space than by +d_blk_softlimit+ up to +d_blk_hardlimit+. If the space is not freed by the time limit specified by ID zero's +d_btimer+ value, the ID will be denied more space until the total blocks owned goes below +d_blk_softlimit+. *d_ino_hardlimit*:: -Specifies the hard limit for the number of inodes the ID can own. The ID will +The hard limit for the number of inodes the ID can own. The ID will not be able to create or own any more inodes if +d_icount+ reaches this value. *d_ino_softlimit*:: -Specifies the soft limit for the number of inodes the ID can own. The ID can -temporarily create or own more inodes than specified by d_ino_softlimit up to -d_ino_hardlimit. If the inode count is not reduced by the time limit specified -by ID zero's d_itimer value, the ID will be denied from creating or owning more -inodes until the count goes below d_ino_softlimit. +The soft limit for the number of inodes the ID can own. The ID can +temporarily create or own more inodes than specified by +d_ino_softlimit+ up to ++d_ino_hardlimit+. If the inode count is not reduced by the time limit specified +by ID zero's +d_itimer+ value, the ID will be denied from creating or owning more +inodes until the count goes below +d_ino_softlimit+. *d_bcount*:: -Specifies how many filesystem blocks are actually owned by the ID. +How many filesystem blocks are actually owned by the ID. *d_icount*:: -Specifies how many inodes are actually owned by the ID. +How many inodes are actually owned by the ID. *d_itimer*:: Specifies the time when the ID's +d_icount+ exceeded +d_ino_softlimit+. The soft @@ -130,18 +131,18 @@ is reset back to zero. Specifies how many times a warning has been issued. Currently not used. *d_rtb_hardlimit*:: -Specifies the hard limit for the number of real-time blocks the ID can own. The +The hard limit for the number of real-time blocks the ID can own. The ID cannot own more space on the real-time subvolume beyond this limit. *d_rtb_softlimit*:: -Specifies the soft limit for the number of real-time blocks the ID can own. The +The soft limit for the number of real-time blocks the ID can own. The ID can temporarily own more space than specified by +d_rtb_softlimit+ up to +d_rtb_hardlimit+. If +d_rtbcount+ is not reduced by the time limit specified by ID zero's +d_rtbtimer value+, the ID will be denied from owning more space until the count goes below +d_rtb_softlimit+. *d_rtbcount*:: -Specifies how many real-time blocks are currently owned by the ID. +How many real-time blocks are currently owned by the ID. *d_rtbtimer*:: Specifies the time when the ID's +d_rtbcount+ exceeded +d_rtb_softlimit+. The diff --git a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc index 7262178..a887f8e 100644 --- a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc +++ b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc @@ -1,8 +1,8 @@ [[On-disk_Inode]] = On-disk Inode -All files, directories and links are stored on disk with inodes and descend from -the root inode with it's number defined in the xref:Superblocks[superblock]. The +All files, directories, and links are stored on disk with inodes and descend from +the root inode with its number defined in the xref:Superblocks[superblock]. The previous section on xref:AG_Inode_Management[AG Inode Management] describes the allocation and management of inodes on disk. This section describes the contents of inodes themselves. @@ -12,10 +12,10 @@ An inode is divided into 3 parts: .On-disk inode sections image::images/23.png[] -* The core contains what the inode represents, stat data and information +* The core contains what the inode represents, stat data, and information describing the data and attribute forks. -* The +di_u+ "data fork" contains normal data related to the inode. It's contents +* The +di_u+ "data fork" contains normal data related to the inode. Its contents depends on the file type specified by +di_core.di_mode+ (eg. regular file, directory, link, etc) and how much information is contained in the file which determined by +di_core.di_format+. The following union to represent this data is @@ -34,7 +34,7 @@ union { } di_u; ---- -* The di_a "attribute fork" contains extended attributes. Its layout is +* The +di_a+ "attribute fork" contains extended attributes. Its layout is determined by the +di_core.di_aformat+ value. Its representation is declared as follows: @@ -82,7 +82,8 @@ struct xfs_dinode_core { __uint32_t di_gid; __uint32_t di_nlink; __uint16_t di_projid; - __uint8_t di_pad[8]; + __uint16_t di_projid_hi; + __uint8_t di_pad[6]; __uint16_t di_flushiter; xfs_timestamp_t di_atime; xfs_timestamp_t di_mtime; @@ -102,7 +103,7 @@ struct xfs_dinode_core { ---- *di_magic*:: -The inode signature where these two bytes are 0x494e, or "IN" in ASCII. +The inode signature; these two bytes are 0x494e, or "IN" in ASCII. *di_mode*:: Specifies the mode access bits and type of file using the standard S_Ixxx values @@ -115,12 +116,11 @@ values in the inode core. Initially, inodes are created as v1 but can be converted on the fly to v2 when required. *di_format*:: - Specifies the format of the data fork in conjunction with the +di_mode+ type. This can be one of several values. For directories and links, it can be "local" -where all metadata associated with the file is within the inode, "extents" where +where all metadata associated with the file is within the inode; "extents" where the inode contains an array of extents to other filesystem blocks which contain -the associated metadata or data or "btree" where the inode contains a B+tree +the associated metadata or data; or "btree" where the inode contains a B+tree root node which points to filesystem blocks containing the metadata or data. Migration between the formats depends on the amount of metadata associated with the inode. "dev" is used for character and block devices while "uuid" is @@ -150,15 +150,18 @@ Specifies the owner's GID of the inode. *di_nlink*:: Specifies the number of links to the inode from directories. This is maintained -for both inode versions for current versions of XFS. Old versions of XFS did not -support v2 inodes, and therefore this value was never updated and was classed as -reserved space (part of +di_pad+). +for both inode versions for current versions of XFS. Prior to v2 inodes, this +field was part of +di_pad+. *di_projid*:: Specifies the owner's project ID in v2 inodes. An inode is converted to v2 if the project ID is set. This value must be zero for v1 inodes. -*di_pad[8]*:: +*di_projid_hi*:: +Specifies the high 16 bits of the owner's project ID in v2 inodes, if the ++XFS_SB_VERSION2_PROJID32BIT+ feature is set; and zero otherwise. + +*di_pad[6]*:: Reserved, must be zero. *di_flushiter*:: @@ -167,8 +170,8 @@ Incremented on flush. *di_atime*:: Specifies the last access time of the files using UNIX time conventions the -following structure. This value maybe undefined if the filesystem is mounted -with the "noatime" option. +following structure. This value may be undefined if the filesystem is mounted +with the "noatime" option. XFS supports timestamps with nanosecond resolution: [source, c] ---- @@ -240,7 +243,6 @@ following values: [options="header"] |===== | Flag | Description -| +XFS_SB_VERSION_ATTRBIT+ | Set if any inode have extended attributes. | +XFS_DIFLAG_REALTIME+ | The inode's data is located on the real-time device. | +XFS_DIFLAG_PREALLOC+ | The inode's extents have been preallocated. | +XFS_DIFLAG_NEWRTBM+ | @@ -269,6 +271,12 @@ For directory inodes, new inodes inherit the +di_extsize+ value. | +XFS_DIFLAG_NODEFRAG+ | Specifies the inode is to be ignored when defragmenting the filesystem. +| +XFS_DIFLAG_FILESTREAMS+ | +Use the filestream allocator. The filestreams allocator allows a directory to +reserve an entire allocation group for exclusive use by files created in that +directory. Files in other directories cannot use AGs reserved by other +directories. + |===== *di_gen*:: @@ -280,16 +288,15 @@ can change by unlinking and creating a new file that reuses the inode. [[Unlinked_Pointer]] == Unlinked Pointer -The +di_next_unlinked+ value in the inode is used to track inodes that have been -unlinked (deleted) but which are still referenced. When an inode is unlinked and -there is still an outstanding reference, the inode is added to one of the -xref:AG_Inode_Management[AGI's] +agi_unlinked+ hash buckets. The AGI unlinked -bucket points to an inode and the +di_next_unlinked+ value points to the next -inode in the chain. The last inode in the chain has +di_next_unlinked+ set to -NULL (-1). +The +di_next_unlinked+ value in the inode is used to track inodes that have +been unlinked (deleted) but are still open by a program. When an inode is +in this state, the inode is added to one of the xref:AG_Inode_Management[AGI's] ++agi_unlinked+ hash buckets. The AGI unlinked bucket points to an inode and the ++di_next_unlinked+ value points to the next inode in the chain. The last inode +in the chain has +di_next_unlinked+ set to NULL (-1). Once the last reference is released, the inode is removed from the unlinked hash -chain, and +di_next_unlinked+ is set to NULL. In the case of a system crash, XFS +chain and +di_next_unlinked+ is set to NULL. In the case of a system crash, XFS recovery will complete the unlink process for any inodes found in these lists. The only time the unlinked fields can be seen to be used on disk is either on an @@ -372,8 +379,8 @@ This is accessed by casting the return value from +XFS_DFORK_DPTR+ to +char*+. block, the inode contains the extents to these filesystem blocks (+xfs_bmbt_rec_t*+). -Details for symbolic links is covered in the xref:Symbolic_Links[Symbolic Links] -later on. +Details for symbolic links is covered in the section about +xref:Symbolic_Links[Symbolic Links]. [[Other_File_Types]] === Other File Types @@ -390,16 +397,16 @@ For character and block devices (+S_IFCHR+ and +S_IFBLK+), cast the value from The attribute fork in the inode always contains the location of the extended attributes associated with the inode. -The location of the attribute fork in the inode's literal area (offset 100 to -the end of the inode) is specified by the +di_forkoff+ value in the inode's -core. If this value is zero, the inode does not contain any extended attributes. -Non-zero, the byte offset into the literal area = +di_forkoff+ * 8, which also -determines the 2048 byte maximum size for an inode. Attributes must be allocated -on a 64-bit boundary on the disk. To access the extended attributes in code, use -the +XFS_DFORK_PTR+ macro specifying +XFS_ATTR_FORK+ for the "which" parameter. -Alternatively, the +XFS_DFORK_APTR+ macro can be used. +The location of the attribute fork in the inode's literal area is specified by +the +di_forkoff+ value in the inode's core. If this value is zero, the inode +does not contain any extended attributes. If non-zero, the attribute fork's +byte offset into the literal area can be computed from +di_forkoff × 8+. +Attributes must be allocated on a 64-bit boundary on the disk. To access the +extended attributes in code, use the +XFS_DFORK_PTR+ macro specifying ++XFS_ATTR_FORK+ for the "which" parameter. Alternatively, the +XFS_DFORK_APTR+ +macro can be used. -Which structure in the attribute fork is used depends on the +di_aformat+ value +The structure of the attribute fork depends on the +di_aformat+ value in the inode. It can be one of the following values: * +XFS_DINODE_FMT_LOCAL+: The extended attributes are contained entirely within @@ -431,7 +438,7 @@ space is split between +di_u+ and +di_a+ forks which also determines how the With "attr1" attributes, the +di_forkoff+ is set to somewhere in the middle of the space between the core and end of the inode and never changes (which has the effect of artificially limiting the space for data information). As the data -fork grows, when it gets to +di_forkoff+, it will move the data to the level +fork grows, when it gets to +di_forkoff+, it will move the data to the next format level (ie. local < extent < btree). If very little space is used for either attributes or data, then a good portion of the available inode space is wasted with this version. @@ -446,3 +453,5 @@ The following diagram compares the two versions: .Extended attribute layouts image::images/30.png[] +Note that because +di_forkoff+ is an 8-bit value measuring units of 8 bytes, +the maximum size of an inode is 2^8^ × 2^3^ = 2^11^ = 2048 bytes. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs