On Wed, Oct 09, 2019 at 02:21:16PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > When doing async node reclaiming, we grab a batch of inodes that we > are likely able to reclaim and ignore those that are already > flushing. However, when we actually go to reclaim them, the first > thing we do is lock the inode. If we are racing with something > else reclaiming the inode or flushing it because it is dirty, > we block on the inode lock. Hence we can still block kswapd here. > > Further, if we flush an inode, we also cluster all the other dirty > inodes in that cluster into the same IO, flush locking them all. > However, if the workload is operating on sequential inodes (e.g. > created by a tarball extraction) most of these inodes will be > sequntial in the cache and so in the same batch > we've already grabbed for reclaim scanning. > > As a result, it is common for all the inodes in the batch to be > dirty and it is common for the first inode flushed to also flush all > the inodes in the reclaim batch. In which case, they are now all > going to be flush locked and we do not want to block on them. > > Hence, for async reclaim (SYNC_TRYLOCK) make sure we always use > trylock semantics and abort reclaim of an inode as quickly as we can > without blocking kswapd. This will be necessary for the upcoming > conversion to LRU lists for inode reclaim tracking. > > Found via tracing and finding big batches of repeated lock/unlock > runs on inodes that we just flushed by write clustering during > reclaim. Looks good: Reviewed-by: Christoph Hellwig <hch@xxxxxx>