On Sat, May 23, 2020 at 04:31:31AM -0700, Christoph Hellwig wrote: > On Fri, May 22, 2020 at 01:50:27PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > Now that we have all the dirty inodes attached to the cluster > > buffer, we don't actually have to do radix tree lookups to find > > them. Sure, the radix tree is efficient, but walking a linked list > > of just the dirty inodes attached to the buffer is much better. > > > > We are also no longer dependent on having a locked inode passed into > > the function to determine where to start the lookup. This means we > > can drop it from the function call and treat all inodes the same. > > > > We also make xfs_iflush_cluster skip inodes marked with > > XFS_IRECLAIM. This we avoid races with inodes that reclaim is > > actively referencing or are being re-initialised by inode lookup. If > > they are actually dirty, they'll get written by a future cluster > > flush.... > > > > We also add a shutdown check after obtaining the flush lock so that > > we catch inodes that are dirty in memory and may have inconsistent > > state due to the shutdown in progress. We abort these inodes > > directly and so they remove themselves directly from the buffer list > > and the AIL rather than having to wait for the buffer to be failed > > and callbacks run to be processed correctly. > > I suspect we should just kill off xfs_iflush_cluster with this, as it > is best split between xfs_iflush and xfs_inode_item_push. POC patch > below, but as part of that I noticed some really odd error handling, > which I'll bring up separately. That's the exact opposite way the follow-on patchset I have goes. xfs_inode_item_push() goes away entirely because tracking 30+ inodes in the AIL for a single writeback IO event is amazingly inefficient and consumes huge amounts of CPU unnecessarily. IOWs, the original point of pinning the inode cluster buffer directly at inode dirtying was so we could track dirty inodes in the AIL via the cluster buffer instead of individually as inodes. The AIL drops by a factor of 30 in size, and instead of seeing 10,000 inode item push "success" reports and 300,000 "flushing" reports a second, we only see the 10,000 success reports. IOWs, the xfsaild is CPU bound in highly concurrent inode dirtying workloads because of the massive numbers of inodes we cycle through it. Tracking them by cluster buffer reduces the CPU over of the xfsaild substantially. The whole "hang on, this solves the OOM kill problem with async reclaim" was a happy accident - I didn't start this patch set with the intent of fixing inode reclaim, it started because inode tracking in the AIL is highly inefficient... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx