Re: [PATCH] [RFC] xfs: filesystem expansion design documentation

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 23 Jul 2024 16:58:01 -0700

On Mon, Jul 22, 2024 at 09:01:00AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> xfs-expand is an attempt to address the container/vm orchestration
> image issue where really small XFS filesystems are grown to massive
> sizes via xfs_growfs and end up with really insane, suboptimal
> geometries.
> 
> Rather that grow a filesystem by appending AGs, expanding a
> filesystem is based on allowing existing AGs to be expanded to
> maximum sizes first. If further growth is needed, then the
> traditional "append more AGs" growfs mechanism is triggered.
> 
> This document describes the structure of an XFS filesystem needed to
> achieve this expansion, as well as the design of userspace tools
> needed to make the mechanism work.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  Documentation/filesystems/xfs/index.rst       |   1 +
>  .../filesystems/xfs/xfs-expand-design.rst     | 312 ++++++++++++++++++
>  2 files changed, 313 insertions(+)
>  create mode 100644 Documentation/filesystems/xfs/xfs-expand-design.rst
> 
> diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst
> index ab66c57a5d18..cb570fc886b2 100644
> --- a/Documentation/filesystems/xfs/index.rst
> +++ b/Documentation/filesystems/xfs/index.rst
> @@ -12,3 +12,4 @@ XFS Filesystem Documentation
>     xfs-maintainer-entry-profile
>     xfs-self-describing-metadata
>     xfs-online-fsck-design
> +   xfs-expand-design
> diff --git a/Documentation/filesystems/xfs/xfs-expand-design.rst b/Documentation/filesystems/xfs/xfs-expand-design.rst
> new file mode 100644
> index 000000000000..fffc0b44518d
> --- /dev/null
> +++ b/Documentation/filesystems/xfs/xfs-expand-design.rst
> @@ -0,0 +1,312 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============================
> +XFS Filesystem Expansion Design
> +===============================
> +
> +Background
> +==========
> +
> +XFS has long been able to grow the size of the filesystem dynamically whilst
> +mounted. The functionality has been used extensively over the past 3 decades
> +for managing filesystems on expandable storage arrays, but over the past decade
> +there has been significant growth in filesystem image based orchestration
> +frameworks that require expansion of the filesystem image during deployment.
> +
> +These frameworks want the initial image to be as small as possible to minimise
> +the cost of deployment, but then want that image to scale to whatever size the
> +deployment requires. This means that the base image can be as small as a few
> +hundred megabytes and be expanded on deployment to tens of terabytes.
> +
> +Growing a filesystem by 4-5 orders of magnitude is a long way outside the scope
> +of the original xfs_growfs design requirements. It was designed for users who
> +were adding physical storage to already large storage arrays; a single order of
> +magnitude in growth was considered a very large expansion.
> +
> +As a result, we have a situation where growing a filesystem works well up to a
> +certain point, yet we have orchestration frameworks that allows users to expand
> +filesystems a long way past this point without them being aware of the issues
> +it will cause them further down the track.

Ok, same growfs-on-deploy problem that we have.  Though, the minimum OCI
boot volume size is ~47GB so at least we're not going from 2G -> 200G.
Usually.

> +Scope
> +=====
> +
> +The need to expand filesystems with a geometry optimised for small storage
> +volumes onto much larger storage volumes results in a large filesystem with
> +poorly optimised geometry. Growing a small XFS filesystem by several orders of
> +magnitude results in filesystem with many small allocation groups (AGs). This is
> +bad for allocation effciency, contiguous free space management, allocation
> +performance as the filesystem fills, and so on. The filesystem will also end up
> +with a very small journal for the size of the filesystem which can limit the
> +metadata performance and concurrency in the filesystem drastically.
> +
> +These issues are a result of the filesystem growing algorithm. It is an
> +append-only mechanism which takes advantage of the fact we can safely initialise
> +the metadata for new AGs beyond the end of the existing filesystem without
> +impacting runtime behaviour. Those newly initialised AGs can then be enabled
> +atomically by running a single transaction to expose that newly initialised
> +space to the running filesystem.
> +
> +As a result, the growing algorithm is a fast, transparent, simple and crash-safe
> +algorithm that can be run while the filesystem is mounted. It's a very good
> +algorithm for growing a filesystem on a block device that has has new physical
> +storage appended to it's LBA space.
> +
> +However, this algorithm shows it's limitations when we move to system deployment
> +via filesystem image distribution. These deployments optimise the base
> +filesystem image for minimal size to minimise the time and cost of deploying
> +them to the newly provisioned system (be it VM or container). They rely on the
> +filesystem's ability to grow the filesystem to the size of the destination
> +storage during the first system bringup when they tailor the deployed filesystem
> +image for it's intented purpose and identity.
> +
> +If the deployed system has substantial storage provisioned, this means the
> +filesystem image will be expanded by multiple orders of magnitude during the
> +system initialisation phase, and this is where the existing append-based growing
> +algorithm falls apart. This is the issue that this design seeks to resolve.

I very much appreciate the scope definition here.  I also very much
appreciate starting off with a design document!  Thank you.

<snip out parts I'm already familiar with>

> +Optimising Physical AG Realignment
> +==================================
> +
> +The elephant in the room at this point in time is the fact that we have to
> +physically move data around to expand AGs. While this makes AG size expansion
> +prohibitive for large filesystems, they should already have large AGs and so
> +using the existing grow mechanism will continue to be the right tool to use for
> +expanding them.
> +
> +However, for small filesystems and filesystem images in the order of hundreds of
> +MB to a few GB in size, the cost of moving data around is much more tolerable.
> +If we can optimise the IO patterns to be purely sequential, offload the movement
> +to the hardware, or even use address space manipulation APIs to minimise the
> +cost of this movement, then resizing AGs via realignment becomes even more
> +appealing.
> +
> +Realigning AGs must avoid overwriting parts of AGs that have not yet been
> +realigned. That means we can't realign the AGs from AG 1 upwards - doing so will
> +overwrite parts of AG2 before we've realigned that data. Hence realignment must
> +be done from the highest AG first, and work downwards.
> +
> +Moving the data within an AG could be optimised to be space usage aware, similar
> +to what xfs_copy does to build sparse filesystem images. However, the space
> +optimised filesystem images aren't going to have a lot of free space in them,
> +and what there is may be quite fragmented. Hence doing free space aware copying
> +of relatively full small AGs may be IOPS intensive. Given we are talking about
> +AGs in the typical size range from 64-512MB, doing a sequential copy of the
> +entire AG isn't going to take very long on any storage. If we have to do several
> +hundred seeks in that range to skip free space, then copying the free space will
> +cost less than the seeks and the partial RAID stripe writes that small IOs will
> +cause.
> +
> +Hence the simplest, sequentially optimised data moving algorithm will be:
> +
> +.. code-block:: c
> +
> +	for (agno = sb_agcount - 1; agno > 0; agno--) {
> +		src = agno * sb_agblocks;
> +		dst = agno * new_agblocks;
> +		copy_file_range(src, dst, sb_agblocks);
> +	}
> +
> +This also leads to optimisation via server side or block device copy offload
> +infrastructure. Instead of streaming the data through kernel buffers, the copy
> +is handed to the server/hardware to moves the data internally as quickly as
> +possible.
> +
> +For filesystem images held in files and, potentially, on sparse storage devices
> +like dm-thinp, we don't even need to copy the data.  We can simply insert holes
> +into the underlying mapping at the appropriate place.  For filesystem images,
> +this is:
> +
> +.. code-block:: c
> +
> +	len = new_agblocks - sb_agblocks;
> +	for (agno = 1; agno < sb_agcount; agno++) {
> +		src = agno * sb_agblocks;
> +		fallocate(FALLOC_FL_INSERT_RANGE, src, len)
> +	}
> +
> +Then the filesystem image can be copied to the destination block device in an
> +efficient manner (i.e. skipping holes in the image file).

Does dm-thinp support insert range?  In the worst case (copy_file_range,
block device doesn't support xcopy) this results in a pagecache copy of
nearly all of the filesystem, doesn't it?

What about the log?  If sb_agblocks increases, that can cause
transaction reservations to increase, which also increases the minimum
log size.  If mkfs is careful, then I suppose xfs_expand could move the
log and make it bigger?  Or does mkfs create a log as if sb_agblocks
were 1TB, which will make the deployment image bigger?

Also, perhaps xfs_expand is a good opportunity to stamp a new uuid into
the superblock and set the metauuid bit?

I think the biggest difficulty for us (OCI) is that our block storage is
some sort of software defined storage system that exposes iscsi and
virtio-scsi endpoints.  For this to work, we'd have to have an
INSERT_RANGE SCSI command that the VM could send to the target and have
the device resize.  Does that exist today?

> +Hence there are several different realignment stratgeies that can be used to
> +optimise the expansion of the filesystem. The optimal strategy will ultimately
> +depend on how the orchestration software sets up the filesystem for
> +configuration at first boot. The userspace xfs expansion tool should be able to
> +support all these mechanisms directly so that higher level infrastructure
> +can simply select the option that best suits the installation being performed.
> +
> +
> +Limitations
> +===========
> +
> +This document describes an offline mechanism for expanding the filesystem
> +geometery. It doesn't add new AGs, just expands they existing AGs. If the
> +filesystem needs to be made larger than maximally sized AGs can address, then
> +a subsequent online xfs_growfs operation is still required.
> +
> +For container/vm orchestration software, this isn't a huge issue as they
> +generally grow the image from within the initramfs context on first boot. That
> +is currently a "mount; xfs_growfs" operation pair; adding expansion to this
> +would simply require adding expansion before the mount. i.e. first boot becomes
> +a "xfs_expand; mount; xfs_growfs" operation. Depending on the eventual size of
> +the target filesystem, the xfs-growfs operation may be a no-op.

I don't know about your cloud, but ours seems to optimize vm deploy
times very heavily.  Right now their firstboot payload calls xfs_admin
to change the fs uuid, mounts the fs, and then growfs's it into the
container.

Adding another pre-mount firstboot program (and one that potentially
might do a lot of IO) isn't going to be popular with them.  The vanilla
OL8 images that you can deploy from seem to consume ~12GB at first boot,
and that's before installing anything else.  Large Well Known Database
Products use quite a bit more... though at least those appliances format
a /data partition at deploy time and leave the rootfs alone.

> +Whether expansion can be done online is an open question. AG expansion cahnges
> +fundamental constants that are calculated at mount time (e.g. maximum AG btree
> +heights), and so an online expand would need to recalculate many internal
> +constants that are used throughout the codebase. This seems like a complex
> +problem to solve and isn't really necessary for the use case we need to address,
> +so online expansion remain as a potential future enhancement that requires a lot
> +more thought.

<nod> There are a lot of moving pieces, online explode sounds hard.

--D

> -- 
> 2.45.1
> 
>