From: Dave Chinner <dchinner@xxxxxxxxxx> When a filesystem ages or when certain workloads dominate the storage capacity of the filesystem, it can become difficult to find contiguous free space in the filesystem and hence inode allocation can fail long before the filesystem is out of space. To avoid this problem, we need to be able to use smaller extents in the filesystem to hold inodes than the size needed to hold a full chunk. To enable this, we need to keep track of the region of the inode chunk that has actually been allocated in the inode allocation record itself. The inobt record contains a free inode count field that uses 32 bits of space, but has a maximum possible value of 64. Hence there are many bitsin the field that we can repurpose for a "allocated regions" mask. To simplify the implementation and checking of the field, split the 32 bit field into an 8 bite count variable in the same location as the existing count (i.e. the LSB of the 32 bit variable, remembering that XFS big endian on disk), an 8 bit pad field and a 16 bit mask field that contains the allocated extent tracking. As we have 16 bits in the mask, each bit represents 4 inodes and hence that defines the minimum allocation size we can support. In all cases, this will limit the largest contiguous allocation required to 2 blocks for a new as the minimum filesystem block size is limited by mkfs to being twice the inode size. In most common configurations, a single block will contain more than 4 inodes and so this isn't a major limitation at all. Hence during extent allocation for the inode chunk, if we cannot find an aligned and contiguous extent, we can settle for something that is as large as possible and mask off the region that we weren't able to allocate. When freeing the chunk, we'll also know what extent we need to free. And for untrusted inode number lookup, we can determine if the inode number falls into the invalid part of the chunk. Further, to avoid needing to do multiple extent allocations for "sparse" inode chunks, if we allocate an extent that overlaps an existing partial inode chunk, we can simply update the mask and free count to indicate that there are multiple valid extents in the chunk. This gives us a potential route for partial inode chunks to be made whole via ongoing filesystem modification or a forced scan once space has been made available. To make this as close to transparent as possible, use a value of 0 to indicate that there are valid inodes in this location, and a value of 1 to indicate that it is an invalid region. This means that the filesystem will be backwards compatible with existing kernels and userspace up until the first partial chunk is allocated. At that point, we need to set an incompatible feature flag as older kernels and userspace are unable to interpret the value in the "free inodes" field correctly. This also means that if we scan the inode btrees and determine that there are no partial inode chunks, we can remove the feature bit... Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- fs/xfs/xfs_ialloc_btree.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h index 3ac36b76..75ee794 100644 --- a/fs/xfs/xfs_ialloc_btree.h +++ b/fs/xfs/xfs_ialloc_btree.h @@ -48,7 +48,9 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n) */ typedef struct xfs_inobt_rec { __be32 ir_startino; /* starting inode number */ - __be32 ir_freecount; /* count of free inodes (set bits) */ + __be16 ir_alloc_mask; + __u8 ir_pad; + __u8 ir_freecount; __be64 ir_free; /* free inode mask */ } xfs_inobt_rec_t; -- 1.8.3.2 _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs