Re: [PATCH] [RFC] xfs: filesystem expansion design documentation

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 25 Jul 2024 11:42:41 +1000

On Wed, Jul 24, 2024 at 05:38:46PM -0500, Eric Sandeen wrote:
> On 7/24/24 5:12 PM, Darrick J. Wong wrote:
> > On Wed, Jul 24, 2024 at 05:06:58PM -0500, Eric Sandeen wrote:
> >> On 7/21/24 6:01 PM, Dave Chinner wrote:
> >>> +The solution should already be obvious: we can exploit the sparseness of FSBNO
> >>> +addressing to allow AGs to grow to 1TB (maximum size) simply by configuring
> >>> +sb_agblklog appropriately at mkfs.xfs time. Hence if we have 16MB AGs (minimum
> >>> +size) and sb_agblklog = 30 (1TB max AG size), we can expand the AG size up
> >>> +to their maximum size before we start appending new AGs.
> >>
> >> there's a check in xfs_validate_sb_common() that tests whether sg_agblklog is
> >> really the next power of two from sb_agblklog:
> >>
> >> sbp->sb_agblklog != xfs_highbit32(sbp->sb_agblocks - 1) + 1
> >>
> >> so I think the proposed idea would require a feature flag, right?
> >>
> >> That might make it a little trickier as a drop-in replacement for cloud
> >> providers because these new expandable filesystem images would only work on
> >> kernels that understand the (trivial) new feature, unless I'm missing
> >> something.

Yes, I think that's the case.

I don't see this as a showstopper - golden images that VMs get
cloned from tend to be specific to the distro release they contain,
so as long as both the orchestration nodes and the distro kernel
within the image support the feature bit it will just work, yes?

This essentially makes it an image build time decision to support
the expand feature.  i.e. if the fs in the image has the feature bit
set, deployment can use expand, otherwise skip it and go straight to
grow.

> > agblklog and agblocks would both be set correctly if you did mkfs.xfs -d
> > agsize=1T on a small image; afaict it's only mkfs that cares that
> > dblocks >= agblocks.

I don't think we can change sb_agblocks like this 

Fundamentally, sb_agblocks is the physical length of AGs, not the
theoretical maximum size. Setting it to the maximum possible size
fails in all sorts of ways for multiple AG filesystems
(__xfs_ag_block_count() is the simplest example to point out), and
even on single AG filesystems it will be problematic.

sb->sb_agblklog is not a physical size - it is decoupled from the
physical size of the AG so it can be used to calculate the address
space locations of inodes and FSBNOs. Hence we can change
sb_agblklog without changing sb_agblocks and other physical lengths
because it doesn't change anything to do with the physical layout of
the filesystem.

> Yes, in a single AG image like you've suggested.

I'm not so sure about that.

We're not so lucky with code that validate against the AG header
lengths or just use sb_agblocks blindly.

e.g. for single AG fs's to work, they need to use
__xfs_ag_block_count() everywhere to get the size of the AG and have
it return sb_dblocks instead.  We have code that assumes that
sb_agblocks is the highest AGBNO possible in an AG and doesn't take
into account that the size of the runt AG (e.g.
xfs_alloc_ag_max_usable() is one of these).

I've always understood there to be an implicit requirement for
sb_agblocks and agf->agf_length/agi->agi_length to be the same and
define the actual maximum valid block number in the AG.

I don't think we can change that, and I don't think we should try to
change this even for single AG filesystems. We don't even support
single AG filesystems. And I don't think we should try to change
such a fundamental physical layout definition for what would be a
filesystem format that only physically exists temporarily in a cloud
deployment context.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx