Re: [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 9 Feb 2015 08:57:10 +1100

On Sun, Feb 08, 2015 at 11:06:01AM -0500, Brian Foster wrote:
> On Sat, Feb 07, 2015 at 10:16:15AM +1100, Dave Chinner wrote:
> > On Fri, Feb 06, 2015 at 02:52:51PM -0500, Brian Foster wrote:
> > > The inode btrees track 64 inodes per record, regardless of inode size.
> > > Thus, inode chunks on disk vary in size depending on the size of the
> > > inodes. This creates a contiguous allocation requirement for new inode
> > > chunks that can be difficult to satisfy on an aged and fragmented (free
> > > space) filesystem.
> > > 
> > > The inode record freecount currently uses 4 bytes on disk to track the
> > > free inode count. With a maximum freecount value of 64, only one byte is
> > > required. Convert the freecount field to a single byte and use two of
> > > the remaining 3 higher order bytes left for the hole mask field. Use
> > > the final leftover byte for the total count field.
> > > 
> > > The hole mask field tracks holes in the chunks of physical space that
> > > the inode record refers to. This facilitates the sparse allocation of
> > > inode chunks when contiguous chunks are not available and allows the
> > > inode btrees to identify what portions of the chunk contain valid
> > > inodes. The total count field contains the total number of valid inodes
> > > referred to by the record. This can also be deduced from the hole mask.
> > > The count field provides clarity and redundancy for internal record
> > > verification.
> > > 
> > > Note that both fields are initialized to zero to maintain backwards
> > > compatibility with existing filesystems (e.g., the higher order bytes of
> > > freecount are always 0). Tracking holes means that the hole mask is
> > > initialized to zero and thus remains "valid" in accordance with a
> > > non-sparse inode fs when no sparse chunks are physically allocated.
> > > Update the inode record management functions to handle the new fields
> > > and initialize to zero.
> > > 
> > > [XXX: The count field breaks backwards compatibility with !sparseinode
> > > fs. Should we reconsider the addition of total count or the idea of
> > > converting back and forth between sparse inode fs with a feature bit?]
> > > 
> > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h       |  8 ++++++--
> > >  fs/xfs/libxfs/xfs_ialloc.c       | 12 ++++++++++--
> > >  fs/xfs/libxfs/xfs_ialloc_btree.c |  4 +++-
> > >  3 files changed, 19 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 26e5d92..6c2f1be 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -1295,13 +1295,17 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
> > >   */
> > >  typedef struct xfs_inobt_rec {
> > >  	__be32		ir_startino;	/* starting inode number */
> > > -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> > > +	__be16		ir_holemask;	/* hole mask for sparse chunks */
> > > +	__u8		ir_count;	/* total inode count */
> > > +	__u8		ir_freecount;	/* count of free inodes (set bits) */
> > >  	__be64		ir_free;	/* free inode mask */
> > >  } xfs_inobt_rec_t;
> > 
> > I think I'd prefer to see a union here so that we explicitly state
> > what the difference in the on-disk format is. i.e. similar to how we
> > express the difference in long and short btree block headers.
> > 
> > struct xfs_inobt_rec {
> > 	__be32		ir_startino;	/* starting inode number */
> > 	__be32		ir_freecount;	/* count of free inodes (set bits) */
> > 	union {
> > 		struct {
> > 			__be32	ir_freecount;
> > 		} f;
> > 		struct {
> > 			__be16	ir_holemask;	/* hole mask for sparse chunks */
> > 			__u8	ir_count;	/* total inode count */
> > 			__u8	ir_freecount;	/* count of free inodes (set bits) */
> > 		} sp;
> > 	}
> > 	__be64		ir_free;	/* free inode mask */
> > };
> > 
> > This will prevent us from using the wrong method of
> > referencing/modifying the record because we now need to be explicit
> > in how we modify it...
> > 
> 
> I agree in principle. This is definitely explicit in that there are two
> variants of this structure that must be handled in a particular way.
> That said, 'sparse' and 'full' aren't quite logical differentiators
> given that we now have the ir_count field and that it is used whenever
> sparse inode support is enabled. In other words, a sparse enabled fs'
> always uses the 'sp' variant of the inobt record, regardless of whether
> the record happens to sparse.

Sure, that's fine for the in memory code, but we've always been
explicit about the disk format and marshalling to/from it,
especially in the cases where the structure on disk can have
multiple formats.

> > > @@ -82,7 +84,9 @@ xfs_inobt_update(
> > >  	union xfs_btree_rec	rec;
> > >  
> > >  	rec.inobt.ir_startino = cpu_to_be32(irec->ir_startino);
> > > -	rec.inobt.ir_freecount = cpu_to_be32(irec->ir_freecount);
> > > +	rec.inobt.ir_holemask = cpu_to_be16(irec->ir_holemask);
> > > +	rec.inobt.ir_count = irec->ir_count;
> > > +	rec.inobt.ir_freecount = irec->ir_freecount;
> > >  	rec.inobt.ir_free = cpu_to_be64(irec->ir_free);
> > >  	return xfs_btree_update(cur, &rec);
> > >  }
> > 
> > Hmmm - perhaps a similar set of helpers for sparse inode enabled
> > filesystems
> > 
> 
> We could do that for the insert/update helpers, but note again the
> separate helpers would not split along the lines of sparse vs full
> chunks. Even if we were to change the design such that they do, that
> would have th effect of complicating some of the subsequent code. For

I suspect it might, but I really don't like the idea of writing
fields that don't actually exist (and hence are always zero) in
the on-disk format to disk. See, for example, the v5 superblock
field writes are conditional in xfs_sb_to_disk, the v3 inode field
writes are conditional in xfs_dinode_to_disk, etc.

> example, the merge/update code currently has no reason to care about
> whether it has created a full chunk out of multiple sparse chunks. This
> would require more code to make that distinction simply to pick the
> correct api, for something that can easily be done with a simple generic
> helper. The same goes for things like the finobt, where now
> lookup/update/insert has to make distinctions depending on the type of
> inode record.

That was another thing that wasn't clear to me - does the finobt
record format change on disk? I don't think it does, as it only
holds a free count and a free inode bitmask, so it doesn't care
which bits of the chunk are allocated or not....

> An alternate approach could be to stick with the generic helpers, but
> try and tweak them to use the appropriate on-disk format depending on
> the record. Then again, how can one really identify one from the other
> in the lookup or _get_rec() scenarios?

Solving these problem was exactly why I suggested different helpers
- the btree ops structure that is chosen for the cursor is in a
place where we know what format we are using, and that avoids
needing "if (xfs_sb_version...)" checks in the helpers to determine
what format we need to use.

> > > @@ -118,6 +124,8 @@ xfs_inobt_insert_rec(
> > >  	xfs_inofree_t		free,
> > >  	int			*stat)
> > >  {
> > > +	cur->bc_rec.i.ir_holemask = 0;
> > > +	cur->bc_rec.i.ir_count = 0; /* zero for backwards compatibility */
> > >  	cur->bc_rec.i.ir_freecount = freecount;
> > >  	cur->bc_rec.i.ir_free = free;
> > >  	return xfs_btree_insert(cur, stat);
> > 
> > That would make this sort of thing very clear - this doesn't look
> > like it would work for a sparse chunk record...
> 
> Right... but this is just a plumbing patch and effectively a placeholder
> for the subsequent patches that implement the mechanism. I suppose the
> api could have been pulled back sooner into this patch, but I'd rather
> not reshuffle things just for that intermediate state. That context
> probably wasn't quite clear to me at the time.
> 
> Before I go one way or another here with regard to the on-disk data
> structure, care to take a look at the subsequent patch(es) that use
> these helpers? Patch 10 in particular pretty much sums up how these
> helpers are used for sparse inode record management.

Yup, I'll get to it.

> FWIW, comments on the generic bitmap stuff is also appreciated as that's
> another area I'm not totally convinved is done right. :)

Why use the generic bitmap stuff rather than the existing XFS code?
If that's just an in-memory change to the free mask processing, then
it can be done separately to the sparse inode functionality, and
probably should be done first in the series....

-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs