On Fri, Nov 01, 2019 at 10:46:09AM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > When doing async node reclaiming, we grab a batch of inodes that we > are likely able to reclaim and ignore those that are already > flushing. However, when we actually go to reclaim them, the first > thing we do is lock the inode. If we are racing with something > else reclaiming the inode or flushing it because it is dirty, > we block on the inode lock. Hence we can still block kswapd here. > > Further, if we flush an inode, we also cluster all the other dirty > inodes in that cluster into the same IO, flush locking them all. > However, if the workload is operating on sequential inodes (e.g. > created by a tarball extraction) most of these inodes will be > sequntial in the cache and so in the same batch > we've already grabbed for reclaim scanning. > > As a result, it is common for all the inodes in the batch to be > dirty and it is common for the first inode flushed to also flush all > the inodes in the reclaim batch. In which case, they are now all > going to be flush locked and we do not want to block on them. > > Hence, for async reclaim (SYNC_TRYLOCK) make sure we always use > trylock semantics and abort reclaim of an inode as quickly as we can > without blocking kswapd. This will be necessary for the upcoming > conversion to LRU lists for inode reclaim tracking. > > Found via tracing and finding big batches of repeated lock/unlock > runs on inodes that we just flushed by write clustering during > reclaim. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > Reviewed-by: Christoph Hellwig <hch@xxxxxx> > --- Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> > fs/xfs/xfs_icache.c | 23 ++++++++++++++++++----- > 1 file changed, 18 insertions(+), 5 deletions(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index edcc3f6bb3bf..189cf423fe8f 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -1104,11 +1104,23 @@ xfs_reclaim_inode( > > restart: > error = 0; > - xfs_ilock(ip, XFS_ILOCK_EXCL); > - if (!xfs_iflock_nowait(ip)) { > - if (!(sync_mode & SYNC_WAIT)) > + /* > + * Don't try to flush the inode if another inode in this cluster has > + * already flushed it after we did the initial checks in > + * xfs_reclaim_inode_grab(). > + */ > + if (sync_mode & SYNC_TRYLOCK) { > + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) > goto out; > - xfs_iflock(ip); > + if (!xfs_iflock_nowait(ip)) > + goto out_unlock; > + } else { > + xfs_ilock(ip, XFS_ILOCK_EXCL); > + if (!xfs_iflock_nowait(ip)) { > + if (!(sync_mode & SYNC_WAIT)) > + goto out_unlock; > + xfs_iflock(ip); > + } > } > > if (XFS_FORCED_SHUTDOWN(ip->i_mount)) { > @@ -1215,9 +1227,10 @@ xfs_reclaim_inode( > > out_ifunlock: > xfs_ifunlock(ip); > +out_unlock: > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > out: > xfs_iflags_clear(ip, XFS_IRECLAIM); > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > /* > * We could return -EAGAIN here to make reclaim rescan the inode tree in > * a short while. However, this just burns CPU time scanning the tree > -- > 2.24.0.rc0 >