[no subject]

**Date** **Thread**

> So I dunno, ISTM that the current implementation can be hyper efficient
> for some workloads and perhaps unpredictable for others. As Dave already
> alluded to, the tradeoff often times for such hyper efficient structures
> is functional inflexibility, which is kind of what we're seeing between
> the inability to recycle inodes wrt to this topic as well as potential
> optimizations on the whole RCU recycling thing. The only real approach
> that comes to mind for managing this kind of infrastructure short of
> changing data structures is to preserve the ability to drain and quiesce
> it regardless of filesystem state.
> 
> For example, I wonder if we could do something like have the percpu
> queues amortize insertion into lock protected perag lists that can be
> managed/processed accordingly rather than complete the entire
> inactivation sequence in percpu context. From there, the perag lists
> could be processed by an unbound/multithreaded workqueue that's maybe
> more aligned with the AG count on the filesystem than cpu count on the
> system.

That per-ag based back end processing is exactly what Darrick's
original proposals did:

https://lore.kernel.org/linux-xfs/162758431072.332903.17159226037941080971.stgit@magnolia/

It used radix tree walks run in background kworker threads to batch
all the inode inactivation in a given AG via radix tree walks to
find them.

It performed poorly at scale because it destroyed the CPU affinity
between the unlink() operation and the inactivation operation of the
unlinked inode when the last reference to the inode is dropped and
the inode evicted from task syscall exit context. REgardless of
whether there is a per-cpu front end or not, the per-ag based
processing will destroy the CPU affinity of the data set being
processed because we cannot guarantee that the per-ag objects are
all local to the CPU that is processing the per-ag objects....

IOWs, all the per-cpu queues are trying to do is keep the inode
inactivation processing local to the CPU where they are already hot
in cache.  The per-cpu queues do not acheive this perfectly in all
cases, but they perform far, far better than the original
parallelised AG affinity inactivation processing that cannot
maintain any cache affinity for the objects being processed at all.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx