Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 17 Oct 2016 12:52:52 +1100

On Sun, Oct 16, 2016 at 08:24:33PM -0400, Chris Mason wrote:
> On Sun, Oct 16, 2016 at 09:34:54AM +1100, Dave Chinner wrote:
> >On Fri, Oct 14, 2016 at 08:27:24AM -0400, Chris Mason wrote:
> >>
> >>Hi Dave,
> >>
> >>This is part of a series of patches we're growing to fix a perf
> >>regression on a few straggler tiers that are still on v3.10.  In this
> >>case, hadoop had to switch back to v3.10 because v4.x is as much as 15%
> >>slower on recent kernels.
> >>
> >>Between v3.10 and v4.x, kswapd is less effective overall.  This leads
> >>more and more procs to get bogged down in direct reclaim Using SYNC_WAIT
> >>in xfs_reclaim_inodes_ag().
> >>
> >>Since slab shrinking happens very early in direct reclaim, we've seen
> >>systems with 130GB of ram where hundreds of procs are stuck on the xfs
> >>slab shrinker fighting to walk a slab 900MB in size.  They'd have better
> >>luck moving on to the page cache instead.
> >
> >We've already scanned the page cache for direct reclaim by the time
> >we get to running the shrinkers. Indeed, the amount of work the
> >shrinkers do is directly controlled by the amount of work done
> >scanning the page cache beforehand....
> >
> >>Also, we're going into direct reclaim much more often than we should
> >>because kswapd is getting stuck on XFS inode locks and writeback.
> >
> >Where and what locks, exactly?
> 
> This is from v4.0, because all of my newer hosts are trying a
> variety of patched kernels.  But the traces were very similar on
> newer kernels:
> 
> # cat /proc/282/stack
> [<ffffffff812ea2cd>] xfs_buf_submit_wait+0xbd/0x1d0
> [<ffffffff812ea6e4>] xfs_bwrite+0x24/0x60
> [<ffffffff812f18a4>] xfs_reclaim_inode+0x304/0x320
> [<ffffffff812f1b17>] xfs_reclaim_inodes_ag+0x257/0x370
> [<ffffffff812f2613>] xfs_reclaim_inodes_nr+0x33/0x40
> [<ffffffff81300fb9>] xfs_fs_free_cached_objects+0x19/0x20
> [<ffffffff811bb13b>] super_cache_scan+0x18b/0x190
> [<ffffffff8115acc6>] shrink_slab.part.40+0x1f6/0x380
> [<ffffffff8115e9da>] shrink_zone+0x30a/0x320
> [<ffffffff8115f94f>] kswapd+0x51f/0x9e0
> [<ffffffff810886b2>] kthread+0xd2/0xf0
> [<ffffffff81770d88>] ret_from_fork+0x58/0x90
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> This one hurts the most.  While kswapd is waiting for IO, all the
> other reclaim he might have been doing is backing up.

Which says two things: the journal tail pushing nor the background
inode reclaim threads are keeping up with dirty inode writeback
demand. Without knowing why that is occurring, we cannot solve the
problem.

> The other common path is the pag->pag_ici_reclaim_lock lock in
> xfs_reclaim_inodes_ag.  It goes through the trylock loop, didn't
> free enough, and then waits on the locks for real.

Which is the "prevent hundreds of threads from all issuing inode
writeback concurrently" throttling. Working as designed.

> XFS is also limiting the direct reclaim speed of all the other
> slabs.  We have 15 drives, each with its own filesystem.  But end
> result of the current system is to bottleneck behind whichever FS is
> slowest at any given moment.

So why is the filesystem slow in 4.0 and not slow at all in 3.10?

And how does a 4.8 kernel compare, given there were major changes to
the mm/ subsystem in this release? i.e. are you chasing a mm/
problem that has already been solved?

> >What XFS is doing is not wrong - the synchrnous behaviour is the
> >primary memory reclaim feedback mechanism that prevents reclaim from
> >trashing the working set of clean inodes when under memory pressure.
> >It's also the choke point where we prevent lots of concurrent
> >threads from trying to do reclaim at once, contending on locks
> >and inodes and causing catastrophic IO breakdown because such
> >reclaim results in random IO patterns for inode writeback instead of
> >nice clean ascending offset ordered IO.
> 
> It's also blocking kswapd (and all the other procs directly calling
> shrinkers) on IO.  Either IO it directly issues or IO run by other
> procs.  This causes the conditions that make all the threads want to
> do reclaim at once.

Yup, that's what it's meant to do. As I keep repeating - this
behaviour is indicative of /some other problem/ occurring. If inode
writeback and reclaim is occurring efficiently and correctly, then
kswapd will not throttle like this because it will never block on
IO. And, by design, when we throttle kswapd we effectively throttle
direct reclaim, too.

> >That commit (which I had to go find because you helpfully didn't
> >quote it) was a7b339f1b869 ("xfs: introduce background inode reclaim
> >work") introduced asynchronous background reclaim work and so made
> >the reclaim function able to handle both async and sync reclaim. To
> >maintain the direct reclaim throttling behaviour of the shrinker,
> >that function now needed to be told to be sycnhronous, hence the
> >addtion of the SYNC_WAIT. We didn't introduce sync
> >reclaim with this commit in 2.6.39(!), we've had that behaviour
> >since well before that. Hence if the analysis performed stopped at
> >this point in history, it was flawed.
> 
> This was an RFC because I was RFCing.  We're stuffing kswapd behind
> synchronous IO, and limiting the rate at which we can reclaim pages
> on the system.  I'm happy to toss unpatched kernels into the
> workload and gather stats to hep us nail down good behaviour, but
> really I'm asking what those stats might be.

I can't suggest anything right now, because you haven't given me
system/workload details or concrete analysis to base any suggestions
on.

> What we have now is a single synchronous shrinker taking the box
> over.  Nothing happens until XFS gets its inodes down the pipe, even
> when there are a considerable number of other freeable pages on the
> box.
> 
> There are probably a series of other optimizations we can be making
> in the MM code around when the shrinkers are called, and how well it
> deals with the constant churn from this workload.   I want to try
> those too,  but we're stuck right now on this one spot.
> 
> An over all description of the hadoop workload:
> 
> Lots of java threads, spanning 15 filesystems, 4T each.

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Things like number of allocation groups, journal size, what the
physical storage is, IO scheduler (or lack of), etc, are all
important here.

> Each thread runs an unpredictable amount of disk and network IO and
> uses massive amounts of CPU.  The threads last for unpredictable
> amounts of time.  These boxes have ~130GB of ram and two sockets of
> CPUs (12 cores per-cpu, HT enabled). The files themselves are
> relatively large and are mostly streaming reads/writes.

Again, details. meminfo, iostat, etc, details of how memory usage
and IO patterns change when everything backs up on inode reclaim,
etc are /really important/ here. I need to be able to reproduce a
similar memory imbalance myself to be able to test any solution we
come up with.

Here's some immediate questions I have from the workload
description:

	- If the workload is mostly large files and streaming reads
	  and writes, then why are there so many inodes that need
	  writeback that reclaim is getting stuck on them?
	- Why aren't the inodes getting written back regularly via
	  periodic log work (e.g. via the xfs_ail_push_all() call
	  that occur every 30s)?
	- Is there so much data IO that metadata IO is being
	  starved?
	- Why does it take a week to manifest - when does the system
	  go out of balance and is there anything in userspace that
	  changes behaviour that might trigger it?
	- Is there a behavioural step-change in the workload of
	  some threads?
	- are you running at near ENOSPC and so maybe hitting some
	  filesystem fragmentation level that causes the seek load
	  to slowly increase until there's no IOPS left in the
	  storage?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html