Re: [PATCH 06/11] xfs: deferred inode inactivation

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 17 Mar 2021 08:49:04 -0700

On Wed, Mar 17, 2021 at 03:21:25PM +0000, Christoph Hellwig wrote:
> On Tue, Mar 16, 2021 at 08:47:29AM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 16, 2021 at 07:27:10AM +0000, Christoph Hellwig wrote:
> > > Still digesting this.  What trips me off a bit is the huge amount of
> > > duplication vs the inode reclaim mechanism.  Did you look into sharing
> > > more code there and if yes what speaks against that?
> > 
> > TBH I didn't look /too/ hard because once upon a time[1] Dave was aiming
> > to replace the inode reclaim tagging and iteration with an lru list walk
> > so I decided not to entangle the two.
> > 
> > [1] https://lore.kernel.org/linux-xfs/20191009032124.10541-23-david@xxxxxxxxxxxxx/
> 
> Well, it isn't just the radix tree tagging, but mostly the
> infrastructure in iget that seems duplicates a lot of very delicate
> code.
> 
> For the actual inactivation run:  why don't we queue up the inodes
> for deactivation directly that, that use the work_struct in the
> inode to directly queue up the inode to the workqueue and let the
> workqueue manage the details?  That also means we can piggy back on
> flush_work and flush_workqueue to force one or more entries out.
> 
> Again I'm not saying I know this is better, but this is something that
> comes to my mind when reading the code.

Hmm.  You mean reuse i_ioend_work (which maybe we should just rename to
i_work) and queueing the inodes directly into the workqueue?  I suppose
that would mean we don't even need the radix tree tag + inode walk...

I hadn't thought about reusing i_ioend_work, since this patchset
predates the writeback ioend chaining.  The biggest downside that I can
think of doing it that way is that right after a rm -rf, the unbound gc
workqueue will start hundreds of kworkers to deal with the sudden burst
of queued work, but all those workers will end up fighting each other
for (a) log grant space, and after that (b) the AGI buffer locks, and
meanwhile everything else on the frontend stalls on the log.

The other side benefit I can think of w.r.t. keeping the inactivation
work as a per-AG item is that (at least among AGs) we can walk the
inodes in disk order, which probably results in less seeking (note: I
haven't studied this) and might allow us to free inode cluster buffers
sooner in the rm -rf case.

<shrug> Thoughts?

--D