Re: [TOPIC LPC] Filesystem Shrink

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 9 Sep 2021 12:34:58 +1000

On Wed, Sep 08, 2021 at 08:31:24PM +0300, Amir Goldstein wrote:
> On Wed, Sep 8, 2021 at 10:51 AM Allison Henderson
> <allison.henderson@xxxxxxxxxx> wrote:
> >
> > Hi All,
> >
> > Earlier this month I had sent out a lpc micro conference proposal for
> > file system shrink.  It sounds like the talk is of interest, but folks
> > recommended I forward the discussion to fsdevel for more feed back.
> > Below is the abstract for the talk:
> >
> >
> > File system shrink allows a file system to be reduced in size by some
> > specified size blocks as long as the file system has enough unallocated
> > space to do so.  This operation is currently unsupported in xfs.  Though
> > a file system can be backed up and recreated in smaller sizes, this is
> > not functionally the same as an in place resize.  Implementing this
> > feature is costly in terms of developer time and resources, so it is
> > important to consider the motivations to implement this feature.  This
> > talk would aim to discuss any user stories for this feature.  What are
> > the possible cases for a user needing to shrink the file system after
> > creation, and by how much?  Can these requirements be satisfied with a
> > simpler mkfs option to backup an existing file system into a new but
> > smaller filesystem?

That has been the traditional answer - create a new block device,
mkfs, xfsdump/xfs_restore and off you go. It has the benefit of only
needing to read/write data once, but has the downside that it does
not keep extent sharing information (reflink/dedupe) intact.

The problem I hear about most regularly these days is management of
cloudy stuff, where there aren't new block devices and/or storage
space available so this mechanism is not really available for use.
The ideal solution for these environments is sparse storage and
fstrim to free storage space that is unused by the filesystem and
that does not require shrink, but that seems difficult because
cloudy management interfaces don't seem to have a concept of
storage consumption vs assigned filesystem capacity....

> > In the cases of creating a rootfs, will a protofile
> > suffice?  If the shrink feature is needed, we should further discuss the
> > APIs that users would need.
> >
> > Beyond the user stories, it is also worth discussing implementation
> > challenges.  Reflink and parent pointers can assist in facilitating

"Reflink, reverse mapping and parent pointers"....

> > shrink operations, but is it reasonable to make them requirements for
> > shrink?

If the filesystem metadata is larger than what can be cached in
memory, then the only way of doing a performant shrink operation is
to have rmap (for GETFSMAP queries) and parent pointers (for path
name reconstruction).

Indeed, GETFSMAP is the only way we can find owners of all the
metadata in the AG that needs to be moved as some per-inode metadata
can be otherwise invisible to userspace (e.g. BMBT blocks).  Finding
such metadata without rmapbt support requires GETFSMAP to implement
a brute-force in-kernel used space scanner to identify such blocks
to report their owners. That's a lot of new code just to replicate
what rmapbt already does...

Really, though, userspace should just rely on having GETFSMAP tell
it everything that needs to move. Support for filesystems
that don't support rmapbt and require dumb, brute force searches to
provide the information can be added in future as they aren't
actually required to implement a working shrink algorithm.

As for reflink, it's been the default for a few years now, so making
shrink require it so that it can do atomic data movement in
userspace without any additional kernel support requirements doesn't
seem particularly bothersome to me...

> > Gathering feedback and addressing these challenges will help
> > guide future development efforts for this feature.
> >
> >
> > Comments and feedback are appreciated!
> > Thanks!
> >
> 
> Hi Allison,
> 
> That sounds like an interesting topic for discussion.
> It reminds me of a cool proposal that Dave posted a while back [1]
> about limiting the thin provisioned disk usage of xfs.

That's a different kettle of fish altogether - it allows for the
filesystem to grow and shrink logically, not physically, and has a
fundamental requirement for a sparse block device to decouple the
filesystem LBA from the physical storage LBAs. In the extreme,
the filesystem still needs a physical shrink operation if the user
requires the sparse device size to change....

> I imagine that online shrinking would involve limiting new block
> allocations to a certain blockdev offset (or AG) am I right?

Sort of.

We do need to limit new _user_ allocations (data and metadata) in
AGs that we are going to shrink away. We still need to be able
to atomically move data and metadata out of those AGs and that may
require allocation of new AG internal metadata to facilitate. e.g.
modifying freespace, rmaps, refcounts, etc can all require
allocation of new btree blocks in the offline AG.

> I wonder, how is statfs() going to present the available/free blocks
> information in that state?

No matter what we do, it will be "wrong" for someone.

In the current design, visible filesystem size does not change until
the final stage where the physical space is atomically removed via a
recoverable transaction.  There are several reasons for this, the
least of which is that turning off allocation is intended to be used
by more than just shrink. e.g AG could be offline for repair, etc.

As it is, ENOSPC can already happen when there is heaps of free
space available in the filesystem.  e.g. reflink copies can fail
ENOSPC because there isn't space in the AG for the new AG internal
refcount or rmap records to be recorded in the relevant AG btrees.

Indeed, the only way we are going to know if shrink cannot move all the
data out of the AGs we want to shrink away is to have all the other
AGs hit "AG full and no other allocation candidate" ENOSPC
conditions during data movement.

e.g. we start a shrink by checking if there's space available in the
lower AGs for all the data that needs to be moved (via
XFS_IOC_AG_GEOMETRY) so we know it should succeed. But if the user
starts consuming space after this check, there's every chance that
the shrink is going to fail because there is no longer enough space
available in the lower AGs to move all the data.

Changing what statfs() reports isn't going to fix/prevent problems
like this...

> If high blocks are presented as free then users may encounter
> surprising ENOSPC.
> If all high blocks are presented as used, then removing files
> in high space, won't free up available disk space.

Yup. And if you present them as used the userspace data movement
algorithm may not be able to make progress even when there is still
internal space available in the remaining AGs that could be used.

> There is an option to reduce total size and present the high blocks
> as over committed disk usage, but that is going to be weird...

Not to mention complex to account for and incredibly fragile to
maintain.

Of course, I haven't really even mentioned shrink failure semantics.
If the data movement fails because of a transient ENOSPC condition,
should the applications even be aware that a shrink was in progress?

> Have you spent any time considering these user visible
> implications?

An awful lot, in fact. Physically shrinking an active filesystem
cannot be done instantly, and so there are always going to be
situations where the behaviour we choose is going to be the wrong
choice for some user. Remember that the data movement part of a
physical shrink operation could take hours, days or even weeks to
complete; this is the dominating user visible implication of
physical shrinking...

The likelihood of a physical shrink failing is quite high - data
movement to empty physical space is not guaranteed to succeed.
There's all sorts of complexity around moving shared data extents
(reflink/deduped copies) that actually increase filesystem space
usage during a shrink (transient increase as well as permanent).
That can result in a shrink failing even though there's technically
enough free space in the lower AGs to complete the shrink...

So when you take into account the likelihood of failure, transient
ENOSPC conditions during a shrink, the heavy impact on performance
the data movement will have, the difficulty in doing atomic
relocation on actively modified files and directories, etc, the
answer to all these problems is "don't run shrink on production
filesystems". i.e "Online" only means the filesystem is mounted
while the shrink runs, not that it's something you run in
production...

With that in mind, worrying about how applications react to shrink
changing the allocation patterns and the amount of space available
is pretty much the least of my concerns at this point in time...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx