Re: [PATCH] [RFC] xfs: filesystem expansion design documentation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 24, 2024 at 05:50:18PM -0500, Eric Sandeen wrote:
> On 7/24/24 4:08 PM, Darrick J. Wong wrote:
> > On Wed, Jul 24, 2024 at 10:46:15AM +1000, Dave Chinner wrote:
> 
> ...
> 
> > Counter-proposal: Instead of remapping the AGs to higher LBAs, what if
> > we allowed people to create single-AG filesystems with large(ish)
> > sb_agblocks.  You could then format a 2GB image with (say) a 100G AG
> > size and copy your 2GB of data into the filesystem.  At deploy time,
> > growfs will expand AG 0 to 100G and add new AGs after that, same as it
> > does now.
> 
> And that could be done oneline...
> 
> > I think all we'd need is to add a switch to mkfs to tell it that it's
> > creating one of these gold master images, which would disable this
> > check:
> > 
> > 	if (agsize > dblocks) {
> > 		fprintf(stderr,
> > 	_("agsize (%lld blocks) too big, data area is %lld blocks\n"),
> > 			(long long)agsize, (long long)dblocks);
> > 			usage();
> > 	}
> 
> (plus removing the single-ag check)
> 
> > and set a largeish default AG size.  We might want to set a compat bit
> > so that xfs_repair won't complain about the single AG.
> > 
> > Yes, there are drawbacks, like the lack of redundant superblocks.  But
> > if growfs really runs at firstboot, then the deployed customer system
> > will likely have more than 1 AG and therefore be fine.
> 
> Other drawbacks are that you've fixed the AG size, so if you don't grow
> past the AG size you picked at mkfs time, you've still got only one
> superblock in the deployed image.

Yes, that is a significant drawback. :)

> i.e. if you set it to 100G, you're OK if you're growing to 300-400G.
> If you are only growing to 50G, not so much.

Yes, though the upside of this counter proposal is that it can be done
today with relatively little code changes.  Dave's requires storage
devices and the kernel to support accelerated remapping, which is going
to take some time and conversations with vendors.

That said, I agree with Dave that his proposal probably results in
files spread more evenly around the disk.

But let's think about this -- would it be advantageous for a freshly
deployed system to have a lot of contiguous space at the end?

If the expand(ed) image is a root filesystem, then the existing content
isn't going to change a whole lot, right?  And if we're really launching
into the nopets era, then the system gets redeployed every quarter with
the latest OS update.

(Not that I do that; I'm still a grumpy Debian greybeard with too many
pets.)

OTOH, do you (or Dave) anticipate needing to expandfs an empty data
partition in the deployed image?  A common pattern amongst our software
is to send out a ~16G root fs image which is deployed into a VM with a
~250G boot volume and a 100TB data volume.  The firstboot process growfs
the rootfs by another ~235G, then it formats a fresh xfs onto the 100TB
volume.

The performance of the freshly formatted data partition is most
important, and we've spent years showing that layout and performance are
better if you do the fresh format.  So I don't think we're going to go
back to expanding data partitions.

> (and vice versa - if you optimize for gaining superblocks, you have to
> pick a fairly small AG size, then run the risk of growing thousands of them)
>
> In other words, it requires choices at mkfs time, whereas Dave's proposal
> lets those choices be made per system, at "expand" time, when the desired
> final size is known.

If you only have one AG, then the agnumber segment of the FSBNO will be
zero.  IOWs, you can increase agblklog on a single-AG fs because there
are no FSBNOs that need re-encoding.  You can even decrease it, so long
as you don't go below the size of the fs.

The ability to adjust goes away as soon as you hit two AGs.

Adjusting agblklog would require some extension to the growfs ioctl.

> (And, you start right out of the gate with poorly distributed data and inodes,
> though I'm not sure how much that'd matter in practice.)

On fast storage it probably doesn't matter.  OTOH, Dave's proposal does
mean that the log stays in the middle of the disk, which might be
advantageous if you /are/ running on spinning rust.

> (I'm not sure the ideas are even mutually exclusive; I think you could have
> a single AG image with dblocks << agblocks << 2^agblocklog, and a simple
> growfs adds agblocks-sized AGs, whereas an "expand" could adjust agblocks,
> then growfs to add more?)

Yes.

> > As for validating the integrity of the GM image, well, maybe the vendor
> > should enable fsverity. ;)
> 
> And host it on ext4, LOL.

I think we can land fsverity in the same timeframe as whatever we land
on for implementing xfs_explode^Wexpandfs.  Probably sooner.

--D

> -Eric
> 




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux