Re: [PATCH 26/43] xfs: implement zoned garbage collection

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 17 Dec 2024 09:42:33 -0800

On Tue, Dec 17, 2024 at 05:06:55AM +0100, Christoph Hellwig wrote:
> On Mon, Dec 16, 2024 at 05:27:53PM -0800, Darrick J. Wong wrote:
> > > lot more work to move them and generates more metadata vs moving unshared
> > > blocks.  That being said it at least handles reflinks, which this currently
> > > doesn't.  I'll take a look at it for ideas on implementing shared block
> > > support for the GC code.
> > 
> > Hrmm.  For defragmenting free space, I thought it was best to move the
> > most highly shared extents first to increase the likelihood that the new
> > space allocation would be contiguous and not contribute to bmbt
> > expansion.
> 
> How does moving a highly shared extent vs a less shared extent help
> with keeping free space contiguous?  What matters for that in a non-zoned
> interface is that the extent is between two free space or soon to be
> free space extents, but the amount of sharing shouldn't really matter.

It might help if I mention that the clearspace code I wrote is given a
range of device daddrs to evacuate, so it tries to make *that range*
contiguous and free, possibly at the expense of other parts of the
filesystem.  Initially I wrote it to support evacuating near EOFS so
that you could shrink the filesystem, but Ted and others mentioned that
it can be more generally useful to recover after some database
compresses its table files and fragments the free space.

So I'm not defragmenting in the xfs_fsr sense, and maybe I should just
call it free space evacuation.  If the daddr range you want to evac
contains 1x 200MB extent shared 1000 times; and 10,000 fragmented 8k
blocks, you might want to move the 200MB extent (and all 1000 mappings)
first to try to keep that contiguous.  If moving the 8k fragments fails,
at least you cleared out 200MB of it.

> > For zone gc we have to clear out the whole rtgroup and we don't have a
> > /lot/ of control so maybe that matters less.  OTOH we know how much
> > space we can get out of the zone, so
> 
> But yes, independent of the above question, freespace for the zone
> allocator is always very contiguous.
> 
> > <nod> I'd definitely give the in-kernel gc a means to stop the userspace
> > gc if the zone runs out of space and it clearly isn't making progress.
> > The tricky part is how do we give the userspace gc one of the "gc
> > zones"?
> 
> Yes.  And how do we kill it when it doesn't act in time?  How do we
> even ensure it acts in time.  How do we deal with userspace GC not
> running or getting killed?
> 
> I have to say all my experiments with user space call ups for activity
> triggered by kernel fast path and memory reclaim activity have been
> overwhelmingly negative.  I won't NAK any of someone wants to experiment,
> but I don't plan to spend my time on it.

<nod> That was mostly built on the speculation that on a device with
130,000 zones, there probably aren't so many writer threads that we
couldn't add another gc process to clean out a few zones.  But that's
all highly speculative food for the roadmap.

> > Ah, right!  Would you mind putting that in a comment somewhere?
> 
> Will do.
> 
> > > 1 device XFS configurations we'll hit a metadata write error sooner
> > > or later and shut the file system down, but with an external RT device
> > > we don't and basically never shut down which is rather problematic.
> > > So I'm tempted to add code to (at least optionally) shut down after
> > > data write errors.
> > 
> > It would be kinda nice if we could report write(back) errors via
> > fanotify, but that's buried so deep in the filesystems that seems
> > tricky.
> 
> Reporting that is more useful than just the shutdown would be useful.
> How we get it on the other hand might be a bit hard.

Yeah.  The experimental healthmon code further down in my dev tree
explores that a little, but we'll see how everyone reacts to it. ;)

Also: while I was poking around with Felipe's ficlone/swapon test it
occurred to me -- does freezing the fs actually get the zonegc kthread
to finish up whatever work is in-flight at that moment?

--D