[RFD 06/17] xfs: partial inode chunk allocation

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 12 Aug 2013 23:19:56 +1000

From: Dave Chinner <dchinner@xxxxxxxxxx>

When a filesystem ages or when certain workloads dominate the storage capacity
of the filesystem, it can become difficult to find contiguous free space in the
filesystem and hence inode allocation can fail long before the filesystem is out
of space.

To avoid this problem, we need to be able to use smaller extents in the
filesystem to hold inodes than the size needed to hold a full chunk. To enable
this, we need to keep track of the region of the inode chunk that has actually
been allocated in the inode allocation record itself. The inobt record contains
a free inode count field that uses 32 bits of space, but has a maximum possible
value of 64. Hence there are many bitsin the field that we can repurpose for
a "allocated regions" mask.

To simplify the implementation and checking of the field, split the 32 bit field
into an 8 bite count variable in the same location as the existing count (i.e.
the LSB of the 32 bit variable, remembering that XFS big endian on disk), an 8
bit pad field and a 16 bit mask field that contains the allocated extent
tracking.

As we have 16 bits in the mask, each bit represents 4 inodes and hence that
defines the minimum allocation size we can support. In all cases, this will
limit the largest contiguous allocation required to 2 blocks for a new as the
minimum filesystem block size is limited by mkfs to being twice the inode size.
In most common configurations, a single block will contain more than 4
inodes and so this isn't a major limitation at all.

Hence during extent allocation for the inode chunk, if we cannot find an aligned
and contiguous extent, we can settle for something that is as large as possible
and mask off the region that we weren't able to allocate. When freeing the
chunk, we'll also know what extent we need to free. And for untrusted inode
number lookup, we can determine if the inode number falls into the invalid part
of the chunk.

Further, to avoid needing to do multiple extent allocations for "sparse" inode
chunks, if we allocate an extent that overlaps an existing partial inode chunk,
we can simply update the mask and free count to indicate that there are multiple
valid extents in the chunk. This gives us a potential route for partial inode
chunks to be made whole via ongoing filesystem modification or a forced scan
once space has been made available.

To make this as close to transparent as possible, use a value of 0 to indicate
that there are valid inodes in this location, and a value of 1 to indicate that
it is an invalid region. This means that the filesystem will be backwards
compatible with existing kernels and userspace up until the first partial chunk
is allocated. At that point, we need to set an incompatible feature flag as
older kernels and userspace are unable to interpret the value in the "free
inodes" field correctly. This also means that if we scan the inode btrees and
determine that there are no partial inode chunks, we can remove the feature
bit...

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/xfs/xfs_ialloc_btree.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h
index 3ac36b76..75ee794 100644
--- a/fs/xfs/xfs_ialloc_btree.h
+++ b/fs/xfs/xfs_ialloc_btree.h
@@ -48,7 +48,9 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
  */
 typedef struct xfs_inobt_rec {
 	__be32		ir_startino;	/* starting inode number */
-	__be32		ir_freecount;	/* count of free inodes (set bits) */
+	__be16		ir_alloc_mask;
+	__u8		ir_pad;
+	__u8		ir_freecount;
 	__be64		ir_free;	/* free inode mask */
 } xfs_inobt_rec_t;
 
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs