Re: Highly reflinked and fragmented considered harmful?

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 11 May 2022 12:52:35 +1000

On Wed, May 11, 2022 at 12:16:57PM +1000, Chris Dunlop wrote:
> On Wed, May 11, 2022 at 11:36:54AM +1000, Dave Chinner wrote:
> > I think that's the thing that some people have missed in in this
> > thread - I've know for a while now the scope of problems blocking
> > flushes from statfs() can cause - any issue with background
> > inodegc not making progress can deadlock the filesystem. I've lost
> > count of the number of times I had inodegc go wrong or crash and the
> > only possible recovery was to power cycle because nothing could be
> > executed from the command line.
> > 
> > That's because statfs() appears to be used in some commonly used
> > library function and so many utility programs on the system will get
> > stuck and be unable to run when inodegc dies, deadlocks, or takes a
> > real long time to make progress.
> > 
> > Hence I didn't need to do any analysis of Chris' system to know
> > exactly what was going on - I've seen it many, many times myself and
> > have work in progress that avoids those issues with inodegc work
> > that never completes.
> > 
> > IOWs, everything is understood, fixes are already written, and
> > there's no actual threat of data loss or corruption from this issue.
> > It's just lots of stuff getting stuck behind a long running
> > operation...
> 
> Your patches are stuck behind other long running priorities, or the patches
> address an issue of stuff getting stuck? Or, of course, both? ;-)

I was refering to the way your system got blocked up - lots of
stuff stuck behind a long running operation. :)

As for priorities, the queue/flushing fixes are part of a body of
work in progress that I benched for 3 months when agreeing to take
over maintenance of the kernel code for the current dev cycle. My
priorities over recent weeks have been to get everyone else's work
reviewed, tested, improved, and merged, not focusing on progressing
my own projects.

As such, I wasn't considering changing inoedegc queuing in the next
merge window at all - it would just be a part of the larger body of
work when it was ready.

Your problems have made it apparent that we really need those
problems fixed ASAP (i.e. the next merge cycle) hence the priority
bump....

> Out of interest, would this work also reduce the time spent mounting in my
> case? I.e. would a lot of the work from my recovery mount be punted off to a
> background thread?

No. log recovery will punt the remaining inodegc work to background
threads so it might get slightly parallelised, but we have a hard
barrier between completing log recovery work and completing the
mount process at the moment. Hence we wait for inodegc to complete
before log recovery is marked as complete.

In theory we could allow background inodegc to bleed into active
duty once log recovery has processed all the unlinked lists, but
that's a change of behaviour that would require a careful audit of
the last part of the mount path to ensure that it is safe to be
running concurrent background operations whilst complete mount state
updates.

This hasn't been on my radar at all up until now, but I'll have a
think about it next time I look at those bits of recovery. I suspect
that probably won't be far away - I have a set of unlinked inode
list optimisations that rework the log recovery infrastructure near
the top of my current work queue, so I will be in that general
vicinity over the next few weeks...

Regardless, the inodegc work is going to be slow on your system no
matter what we do because of the underlying storage layout. What we
need to do is try to remove all the places where stuff can get
blocked on inodegc completion, but that is somewhat complex because
we still need to be able to throttle queue depths in various
situations.

As always, there is plenty of scope for improvement - the question
is working out which improvements are the ones we actually need
first...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx