On Tue, Aug 27, 2024 at 10:53:16AM -0400, Brian Foster wrote: > On Mon, Aug 26, 2024 at 02:17:20PM -0700, Darrick J. Wong wrote: > > On Mon, Aug 26, 2024 at 12:08:13PM -0400, Brian Foster wrote: > > > On Thu, Aug 22, 2024 at 06:15:02PM -0700, Darrick J. Wong wrote: > > > > On Mon, Aug 12, 2024 at 09:56:52AM -0400, Brian Foster wrote: > > > > > Tweak a few checks to facilitate experimentation with an agcount=1 > > > > > filesystem format with a larger agsize than the filesystem data > > > > > size. The purpose of this is to POC a filesystem image mode format > > > > > for XFS that better supports the typical cloud filesystem image > > > > > deployment use case where a very small fs image is created and then > > > > > immediately grown orders of magnitude in size once deployed to > > > > > container environments. The large grow size delta produces > > > > > filesystems with excessive AG counts, which leads to various other > > > > > functional problems that eventually derive from this sort of > > > > > pathological geometry. > > > > > > > > > > To experiment with this patch, format a small fs with something like > > > > > the following: > > > > > > > > > > mkfs.xfs -f -lsize=64m -dsize=512m,agcount=1,agsize=8g <imgfile> > > > > > > > > > > Increase the underlying image file size, mount and grow. The > > > > > filesystem will grow according to the format time AG size as if the > > > > > AG was a typical runt AG on a traditional multi-AG fs. > > > > > > > > > > This means that the filesystem remains with an AG count of 1 until > > > > > fs size grows beyond AG size. Since the typical deployment workflow > > > > > is an immediate very small -> very large, one-time grow, the image > > > > > fs can set a reasonable enough default or configurable AG size > > > > > (based on user input) that ensures deployed filesystems end up in a > > > > > generally supportable geometry (i.e. with multiple AGs for > > > > > superblock redundancy) before seeing production workloads. > > > > > > > > > > Further optional changes are possible on the kernel side to help > > > > > provide some simple guardrails against misuse of this mechanism. For > > > > > example, the kernel could do anything from warn/fail or restrict > > > > > runtime functionality for an insufficient grow. The image mode > > > > > itself could set a backwards incompat feature bit that requires a > > > > > mount option to enable full functionality (with the exception of > > > > > growfs). More discussion is required to determine whether this > > > > > provides a usable solution for the common cloud workflows that > > > > > exhibit this problem and what the right interface and/or limitations > > > > > are to ensure it is used correctly. > > > > > > > > > > Not-Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > > > > --- > > > > > > > > > > Hi all, > > > > > > > > > > This is a followup to the idea Darrick brought up in the expansion > > > > > discussion here [1]. I poked through the code a bit and found it > > > > > somewhat amusing how little was in the way of experimenting with this, > > > > > so threw this against an fstests run over the weekend. I see maybe > > > > > around ~10 or so test failures, most of which look like simple failures > > > > > related to either not expecting agcount == 1 fs' or my generally > > > > > hacky/experimental changes. There are a couple or so that require a bit > > > > > more investigation to properly characterize before I would consider this > > > > > fully sane. > > > > > > > > > > I'm posting this separately from the expansion discussion to hopefully > > > > > avoid further conflating the two. My current sense is that if this turns > > > > > out to be a fundamentally workable approach, mkfs would more look > > > > > something like 'mkfs --image-size 40g ...' and the kernel side may grow > > > > > some optional guardrail logic mentioned above and in the previous > > > > > discussion here [2], but others might have different ideas. > > > > > > > > > > Darrick, you originally raised this idea and then Eric brought up some > > > > > legitimate technical concerns in the expansion design thread. I'm > > > > > curious if either of you have any further thoughts/ideas on this. > > > > > > > > Well, it /does/ seem to work as intended -- you can create a filesystem > > > > with a 100G agsize even on a 8G filesystem. We've been over the > > > > drawbacks of this approach vs. Dave's (i.e. explodefs is really only a > > > > one-time event) so I won't rehash that here. > > > > > > > > > > Yeah.. I view this as kind of an incremental step in the broader > > > expandfs topic to opportunistically solve a real (and lingering) problem > > > with a minimal amount of complexity. > > > > > > > But it does solve the much more constrained problem of disk images > > > > deployed from a gold master image undergoing a massive onetime expansion > > > > during firstboot. The lack of secondary superblocks and xfs_repair > > > > warnings are concerning, but to that, I have these things to say: > > > > > > > > a. If it's a gold master image from a trusted distributor, they should > > > > be installing verity information on all the read-mostly files so that > > > > one can validate the vendor's signatures. The fs metadata being corrupt > > > > is moot if fsverity fails. > > > > > > > > > > This crossed my mind at one point as well. I thought these sort of image > > > repo setups at least had per-image checksum files or whatever that would > > > hopefully somewhat mitigate this sort of problem. > > > > > > > b. We can turn off the xfs_repair hostility to single-AG filesystems > > > > so that people who /do/ want to fsck their goldmaster images don't get > > > > annoying warnings. We /could/ add a new compat flag to say it's ok to > > > > single-AG, or we could hang that on the fsverity sb flag. > > > > > > > > c. Over on our end, we know the minimum cloud boot volume size -- it's > > > > 47GB. My canned OL9 image seems to come with a 36GB root filesystem. > > > > My guess is that most of our users will say yes to the instance creator > > > > asking them if they want more space (a princely 67GB minimum if you're > > > > stingy like I am!) so I think one could tweak the image creator to spit > > > > out a 36GB AG XFS, knowing that only the Ferengi users *won't* end up > > > > with a double-AG rootfs. > > > > > > > > > > Do you mean that the image size is 37GB, or it's grown to that size at > > > some point during the deployment? If the latter, any idea on the > > > ballpark size of a typical install image before it's deployed and grown? > > > > I think the actual gold master images have ~37G root filesystems. I > > didn't choose to expand my stingy OL8 image at firstboot, and it has 4 > > AGs like you'd expect. Most of the time the deployment control panel > > tries to convince you to upgrade to 67G, if not ~200G. > > > > Interesting, that seems notably larger than I'd expect. Wouldn't this > concern you if you just ended up with a 37GB single AG fs? Not that much if I had fsverity to ensure that the deployed files hadn't been tampered with. > Do you know how much of that might be free space on the image side? $ xfs_info / meta-data=/dev/mapper/ocivolume-root isize=512 agcount=4, agsize=2324736 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=9298944, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=4540, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ df / Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/ocivolume-root 37177616 14264712 22912904 39% / So I think the 9G AGs aren't that much of a problem. OTOH > I.e., I'm wondering if the size is that large because you really need > the space for data, or rather as a workaround where it could otherwise > be formatted much smaller if the deployment side could grow it out to > well beyond 37GB without having issues. It depends on the workload -- vanilla OL images only give you a rootfs and assume that you'll create a separate higher iops block volume for actual data storage. $database automates provisioning and autotuning the separate block device for you. Another problem is that we don't necessarily know how many CPUs the system will have. I've seen a system with ~200 cores all pounding on a fs with four AGs (oops! lots of AGF contention!) or 10000 AGs (all cpus pounding on the radix tree). IOWs, the deploy image could be a lot smaller, but apparently they're 48GB in size. My guess is that these are sparse images, since the billable size is about a tenth of that. > With a 37GB 4xAG fs, that's still closer to ~9GB AGs, which doesn't seem > all that bad from a grow perspective. IIRC the insane filesystems I've > seen tend to have much smaller (< 1GB maybe?) AG sizes. Have you seen > grow issues with these ~9GB AG size fs'? Not on our homegrown images. The problems crop up when people try to import images from outside $cloud and generate those minified images with 1G AGs. Unfortunately some of those people aren't even outside the org, but I'll get to them... ;) > > > > IOWs, this more or less works for the usecase that generates the most > > > > internal complaints about xfs sucking. As I said earlier, I've not > > > > heard about people wanting to 10000x growfs after firstboot; usually > > > > they only grow incrementally as they decide that extra $$$$$ is worth > > > > another 10GB. > > > > > > > > > > This is my understanding as well. > > > > > > > I think this is worth prototyping a bit more. Do you? > > > > > > > > > > I do as well.. > > > > > > Eric and I had a random discussion about this the other week and one of > > > the concerns he brought up is the risk that some deployment might grow > > > and still end up on a single AG fs because maybe the deployment just > > > didn't provide sufficient storage as expected for the image. This was > > > part of the reason I brought up the whole image mode feature bit in the > > > expandfs discussion, because we could always do something to enforce > > > that fs functionality is hobbled until sufficiently grown to at least > > > 2xAGs and thus clear the image mode feature bit. > > > > > > That said, another thing I played around a bit with after posting this > > > was the ability to just grow the AG size at runtime. I.e., for a single > > > AG fs, it's fairly easy to just change the AG size at growfs time. I > > > currently have that prototyped by using a separate transaction to avoid > > > some verifier failure madness I didn't want to sift through to prove the > > > concept, but this could also be prototyped to exist as a separate, > > > optional GROWFS_AGSIZE ioctl without much additional fuss. > > > > What if we rev the growfsdata ioctl to allow the user to specify an > > agsize? Userspace (i.e. xfs_growfs) can then try to set the AG size, or > > if that fails, fall back to the current growfsdata. > > > > Either way seems reasonable. I think we're mostly talking about the same > thing. For a functional prototype I'd be more concerned with just > showing whether it works and then if so, whether it's useful/flexible > enough for the intended use cases. Agreed! > Brian > > > > So for example, suppose that we just created an image file fs with > > > something like "mkfs.xfs -d agcount=1 -lsize=64m <file>," where the > > > superblock AG size is not actually larger than the single AG. Then we > > > update xfs_growfs to look at the current block device size, make a > > > decision on an ideal AG size from that, and run an ioctl(GROWFS_AGSIZE, > > > agsize) followed by the typical ioctl(GROWFS_DATA, bdev_size). The > > > GROWFS_AGSIZE can either update the sb ag size as appropriate before the > > > grow, or just fail and we fall back to default growfs behavior. > > > > > > That would allow a bit more dynamic runtime behavior and maybe ensure we > > > end up with a more typical 4xAG fs in most cases. Dave had pointed out > > > that there might be concerns around AG sizing logic, but this would > > > exist in userspace and could be lifted straight from mkfs if need be. > > > TBH, on my first look it looked like it was mostly alignment (i.e. > > > stripe unit, etc.) related stuff, so I'd probably just punt and let > > > GROWFS_AGSIZE fail if any such unsupported fields are set. Worst case we > > > just fall back to existing behavior. > > > > I wouldn't have a problem with hoisting that to libxfs and sharing with > > the kernel. > > > > > Thoughts on that? If that sounds interesting enough I can follow up with > > > a v2 prototype along those lines. Appreciate the discussion. > > > > > > Brian > > > > > > P.S. Semi-random thought after writing up the above.. it occurred to me > > > that we already have a provisional shrink mechanism, so in theory > > > perhaps we could also do something like "mkfs.xfs --image-mode <file>" > > > where we just hardcoded agcount=1,agsize=1TB, and then let userspace > > > grow shrink to the appropriate agsize in the delta < 0 && agsize > > > > fssize case and then grow out normally from there. Maybe that's just > > > another way of doing the same thing, but something else to think about > > > at least.. > > > > <shrug> I'd rather do the ag resizing explicitly than implied if someone > > says to shrink (or grow) and the agcount is zero and magic compat flag > > is set. > > > > --D > > > > > > > > > --D > > > > > > > > > Brian > > > > > > > > > > [1] https://lore.kernel.org/linux-xfs/20240721230100.4159699-1-david@xxxxxxxxxxxxx/ > > > > > [2] https://lore.kernel.org/linux-xfs/ZqzMay58f0SvdWxV@bfoster/ > > > > > > > > > > mkfs/xfs_mkfs.c | 11 +++++------ > > > > > 1 file changed, 5 insertions(+), 6 deletions(-) > > > > > > > > > > diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c > > > > > index 6d2469c3c..50a874a03 100644 > > > > > --- a/mkfs/xfs_mkfs.c > > > > > +++ b/mkfs/xfs_mkfs.c > > > > > @@ -325,8 +325,7 @@ static struct opt_params dopts = { > > > > > }, > > > > > .subopt_params = { > > > > > { .index = D_AGCOUNT, > > > > > - .conflicts = { { &dopts, D_AGSIZE }, > > > > > - { &dopts, D_CONCURRENCY }, > > > > > + .conflicts = { { &dopts, D_CONCURRENCY }, > > > > > { NULL, LAST_CONFLICT } }, > > > > > .minval = 1, > > > > > .maxval = XFS_MAX_AGNUMBER, > > > > > @@ -368,8 +367,7 @@ static struct opt_params dopts = { > > > > > .defaultval = SUBOPT_NEEDS_VAL, > > > > > }, > > > > > { .index = D_AGSIZE, > > > > > - .conflicts = { { &dopts, D_AGCOUNT }, > > > > > - { &dopts, D_CONCURRENCY }, > > > > > + .conflicts = { { &dopts, D_CONCURRENCY }, > > > > > { NULL, LAST_CONFLICT } }, > > > > > .convert = true, > > > > > .minval = XFS_AG_MIN_BYTES, > > > > > @@ -1233,7 +1231,7 @@ validate_ag_geometry( > > > > > usage(); > > > > > } > > > > > > > > > > - if (agsize > dblocks) { > > > > > + if (agsize > dblocks && agcount != 1) { > > > > > fprintf(stderr, > > > > > _("agsize (%lld blocks) too big, data area is %lld blocks\n"), > > > > > (long long)agsize, (long long)dblocks); > > > > > @@ -2703,7 +2701,8 @@ validate_supported( > > > > > * Filesystems should not have fewer than two AGs, because we need to > > > > > * have redundant superblocks. > > > > > */ > > > > > - if (mp->m_sb.sb_agcount < 2) { > > > > > + if (mp->m_sb.sb_agcount < 2 && > > > > > + mp->m_sb.sb_agblocks <= mp->m_sb.sb_dblocks) { > > > > > fprintf(stderr, > > > > > _("Filesystem must have at least 2 superblocks for redundancy!\n")); > > > > > usage(); > > > > > -- > > > > > 2.45.0 > > > > > > > > > > > > > > > > > > >