Re: [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration

Dave Chinner <david@xxxxxxxxxxxxx> · Sun, 24 May 2020 09:23:19 +1000

On Sat, May 23, 2020 at 04:31:31AM -0700, Christoph Hellwig wrote:
> On Fri, May 22, 2020 at 01:50:27PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > Now that we have all the dirty inodes attached to the cluster
> > buffer, we don't actually have to do radix tree lookups to find
> > them. Sure, the radix tree is efficient, but walking a linked list
> > of just the dirty inodes attached to the buffer is much better.
> > 
> > We are also no longer dependent on having a locked inode passed into
> > the function to determine where to start the lookup. This means we
> > can drop it from the function call and treat all inodes the same.
> > 
> > We also make xfs_iflush_cluster skip inodes marked with
> > XFS_IRECLAIM. This we avoid races with inodes that reclaim is
> > actively referencing or are being re-initialised by inode lookup. If
> > they are actually dirty, they'll get written by a future cluster
> > flush....
> > 
> > We also add a shutdown check after obtaining the flush lock so that
> > we catch inodes that are dirty in memory and may have inconsistent
> > state due to the shutdown in progress. We abort these inodes
> > directly and so they remove themselves directly from the buffer list
> > and the AIL rather than having to wait for the buffer to be failed
> > and callbacks run to be processed correctly.
> 
> I suspect we should just kill off xfs_iflush_cluster with this, as it
> is best split between xfs_iflush and xfs_inode_item_push.  POC patch
> below, but as part of that I noticed some really odd error handling,
> which I'll bring up separately.

That's the exact opposite way the follow-on patchset I have goes.
xfs_inode_item_push() goes away entirely because tracking 30+ inodes
in the AIL for a single writeback IO event is amazingly inefficient
and consumes huge amounts of CPU unnecessarily.

IOWs, the original point of pinning the inode cluster buffer
directly at inode dirtying was so we could track dirty inodes in the
AIL via the cluster buffer instead of individually as inodes.  The
AIL drops by a factor of 30 in size, and instead of seeing 10,000
inode item push "success" reports and 300,000 "flushing" reports a
second, we only see the 10,000 success reports.

IOWs, the xfsaild is CPU bound in highly concurrent inode dirtying
workloads because of the massive numbers of inodes we cycle through
it. Tracking them by cluster buffer reduces the CPU over of the
xfsaild substantially. The whole "hang on, this solves the OOM kill
problem with async reclaim" was a happy accident - I didn't start
this patch set with the intent of fixing inode reclaim, it started
because inode tracking in the AIL is highly inefficient...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx