Re: Highly reflinked and fragmented considered harmful?

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 11 May 2022 15:18:29 +1000

On Wed, May 11, 2022 at 01:58:13PM +1000, Chris Dunlop wrote:
> On Wed, May 11, 2022 at 12:52:35PM +1000, Dave Chinner wrote:
> > On Wed, May 11, 2022 at 12:16:57PM +1000, Chris Dunlop wrote:
> > > Out of interest, would this work also reduce the time spent mounting
> > > in my case? I.e. would a lot of the work from my recovery mount be
> > > punted off to a background thread?
> > 
> > No. log recovery will punt the remaining inodegc work to background
> > threads so it might get slightly parallelised, but we have a hard
> > barrier between completing log recovery work and completing the
> > mount process at the moment. Hence we wait for inodegc to complete
> > before log recovery is marked as complete.
> > 
> > In theory we could allow background inodegc to bleed into active
> > duty once log recovery has processed all the unlinked lists, but
> > that's a change of behaviour that would require a careful audit of
> > the last part of the mount path to ensure that it is safe to be
> > running concurrent background operations whilst complete mount state
> > updates.
> > 
> > This hasn't been on my radar at all up until now, but I'll have a
> > think about it next time I look at those bits of recovery. I suspect
> > that probably won't be far away - I have a set of unlinked inode
> > list optimisations that rework the log recovery infrastructure near
> > the top of my current work queue, so I will be in that general
> > vicinity over the next few weeks...
> 
> I'll keep an eye out.
> 
> > Regardless, the inodegc work is going to be slow on your system no
> > matter what we do because of the underlying storage layout. What we
> > need to do is try to remove all the places where stuff can get
> > blocked on inodegc completion, but that is somewhat complex because
> > we still need to be able to throttle queue depths in various
> > situations.
> 
> That reminds of a something I've been wondering about for obvious reasons:
> for workloads where metadata operations are dominant, do you have any
> ponderings on allowing AGs to be put on fast storage whilst the bulk data is
> on molasses storage?

If you're willing to give up reflink and pretty much all the
allocation optimisations for storage locality that make spinning
disks perform OK, then you can do this right now with a realtime
device as the user data store. You still have AGs, but they will
contain metadata only - your bulk data storage device is the
realtime device.

This has downsides. You give up reflink. You give up rmap. You give
up allocation concurrency. You give up btree indexed free space,
which means giving up the ability to find optimal free spaces.
Allocation algorithms are optimised for deterministic, bound
overhead behaviour (the "realtime" aspect of the RT device) so you
give up smart, context aware allocation algorithms. the list goes
on.

reflink and rmap support for the realtime device are in the pipeline
(not likely to be added in the near term), but solutions for any of
the other issues are not. They are intrinsic behaviours that result
from the realtime device architecture.

However, there's no real way to separate data in AGs from metadata
in AGs - they share the same address space and there's no simple way
to keep them separate and map different parts of the AG to different
storage devices. that would require a fair chunk of slicing and
dicing at the DM level, and then we have a whole net set of problems
to deal with when AGs run out of metadata space because reflink
and/or rmap metadata explosions...

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx