Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 15 Nov 2016 10:58:01 +1100

On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> On 11/14/2016 02:27 AM, Dave Chinner wrote:
> >On Sun, Nov 13, 2016 at 08:00:04PM -0500, Chris Mason wrote:
> >>On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote:
> >>
> >>[ Long stalls from xfs_reclaim_inodes_ag ]
> >>
> >>>XFS has *1* tunable that can change the behaviour of metadata
> >>>writeback. Please try it.
> >>
> >>[ weeks go by, so this email is insanely long ]
> >>
> >>Testing all of this was slow going because two of the three test
> >>boxes I had with the hadoop configuration starting having hardware
> >>problems.  The good news is that while I was adjusting the
> >>benchmark, we lined up access to a bunch of duplicate boxes, so I
> >>can now try ~20 different configurations in parallel.
> >>
> >>My rough benchmark is here:
> >>
> >>git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git
> >>
> >>The command line I ended up using was:
> >>
> >>simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i
> >>60 -w 300 -D 2 /hammer/*
> >
> >There's a lightly tested patch below that should do the trick.
> >
> >After 5 minutes on a modified simoop cli on a single filesystem,
> >4.9-rc4+for-next:
> >
> >$ ./simoop -t 128 -m 50 -M 128 -C 14 -r 60000 -f 2 -T 20 -R1 -W 1 -i 60 -w 300 -D 2 /mnt/scratch
> >....
> >Run time: 300 seconds
> >Read latency (p50: 3,174,400) (p95: 4,530,176) (p99: 18,055,168)
> >Write latency (p50: 14,991,360) (p95: 28,672,000) (p99: 33,325,056)
> >Allocation latency (p50: 1,771,520) (p95: 17,530,880) (p99: 23,756,800)
> >work rate = 4.75/sec (avg 5.24/sec) (p50: 5.79) (p95: 6.99) (p99: 6.99)
> >alloc stall rate = 94.42/sec (avg: 51.63) (p50: 51.60) (p95: 59.12) (p99: 59.12)
> >
> >With the patch below:
> >
> >Run time: 300 seconds
> >Read latency (p50: 3,043,328) (p95: 3,649,536) (p99: 4,710,400)
> >Write latency (p50: 21,004,288) (p95: 27,557,888) (p99: 29,130,752)
> >Allocation latency (p50: 280,064) (p95: 680,960) (p99: 863,232)
> >work rate = 4.08/sec (avg 4.76/sec) (p50: 5.39) (p95: 6.93) (p99: 6.93)
> >alloc stall rate = 0.08/sec (avg: 0.02) (p50: 0.00) (p95: 0.01) (p99: 0.01)
> >
> >Stall rate went to zero and stayed there at the 120s mark of the
> >warmup. Note the p99 difference for read and allocation latency,
> >too.
> >
> >I'll post some graphs tomorrow from my live PCP telemetry that
> >demonstrate the difference in behaviour better than any words...
> 
> Thanks Dave, this is definitely better.  But at least for the
> multi-disk setup, it's not quite there yet.

I can see why now I've run the test for half an hour to get a decent
idea of steady state behaviour. I'll come back to that....

> Your patch:
> ___
> Run time: 15535 seconds
> Read latency (p50: 22,708,224) (p95: 34,668,544) (p99: 41,746,432)
> Write latency (p50: 21,200,896) (p95: 34,799,616) (p99: 41,877,504)
> Allocation latency (p50: 11,550,720) (p95: 31,424,512) (p99: 39,518,208)
> work rate = 7.48/sec (avg 8.41/sec) (p50: 8.69) (p95: 9.57) (p99: 9.87)
> alloc stall rate = 14.08/sec (avg: 14.85) (p50: 15.74) (p95: 19.74)
> (p99: 22.04)
> 
> Original patch:
> ___
> Run time: 15474 seconds
> Read latency (p50: 20,414,464) (p95: 29,786,112) (p99: 34,275,328)
> Write latency (p50: 15,155,200) (p95: 25,591,808) (p99: 31,621,120)
> Allocation latency (p50: 7,675,904) (p95: 22,970,368) (p99: 29,523,968)
> work rate = 8.33/sec (avg 10.54/sec) (p50: 10.54) (p95: 10.58) (p99: 10.58)
> alloc stall rate = 37.08/sec (avg: 21.73) (p50: 23.16) (p95: 24.68)
> (p99: 25.00)

Ok, so stall rate is down, but the latencies are still garbage
because stalls are happening.

> I've had this running all day, but the differences stabilized after
> 5-10 minutes.  Everyone's p99s trend higher as the day goes on, but
> the percentage difference stays about the same.

Right - they stabilise after the first few stalls because they
dominate the p99 numbers. For example, with my patch, I saw the
first stall event between 1080 and 1140 seconds - you can see this
clearly in the 3rd chart:

Run time: 1080 seconds
Read latency (p50: 3,035,136) (p95: 4,284,416) (p99: 5,349,376)
Write latency (p50: 27,623,424) (p95: 31,031,296) (p99: 45,154,304)
Allocation latency (p50: 240,384) (p95: 709,632) (p99: 1,308,672)
work rate = 3.75/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04)
alloc stall rate = 0.65/sec (avg: 0.04) (p50: 0.00) (p95: 0.03) (p99: 0.03)
du thread is done /mnt/scratch
___
Run time: 1140 seconds
Read latency (p50: 3,035,136) (p95: 4,415,488) (p99: 6,119,424)
Write latency (p50: 27,557,888) (p95: 31,490,048) (p99: 45,285,376)
Allocation latency (p50: 247,552) (p95: 1,497,088) (p99: 19,496,960)
work rate = 3.68/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04)
alloc stall rate = 1.65/sec (avg: 0.12) (p50: 0.00) (p95: 0.12) (p99: 0.12)
du thread is running /mnt/scratch

So a single "stall event" blows out the p99 latencies really badly.
This is probably the single most important revelation about this
so far...

> I think the difference between mine and yours is we didn't quite get
> the allocation stalls down to zero, so making tasks wait for the
> shrinker shows up in the end numbers.

Right, but so far we haven't answered the obvious question: what
triggers the stall events?

> For your patch, the stalls from xfs_reclaim_inodes_ag() were about
> the same as the unpatched kernel yesterday.  We still had long tails
> in the 30 second+ category.

Because when a stall happens, it's doing more than just trashing the
inode cache. The 4 charts I've attached in a tarball are:

Chart 1: unpatched kernel, inode/dentry/xfs_inode cache usage
Chart 2: unpatched kernel, memory usage
Chart 3: unpatched kernel, inode/dentry/xfs_inode cache usage
Chart 4: unpatched kernel, memory usage

These are taken with a 5s sample period, so we're looking at overall
trends, not fine grained behaviour. Things to note:

	- discontinuities in the graphs indicate an unresponsive
	  machine and samples were lost. Good sign that the machine
	  is stalling on memory allocation
	- dentry/vfs inode counts come from
	  /proc/sys/fs/{dentry,inode}_state
	- memory usage chart: "other" is userspace and kernel memory
	  other than the page cache ("cached").
	- cached + other + free = 16GB (all memory)
	- "slab"  is a subset of "other", but is split out above
	  16GB so i can easily see salb cache memory usage
	- "dirty" is a subset of "cached", but is split out so I can
	  see how much of the page cache is dirty.

Now, with that in mind, let's compare the graphs. Let's start with
the inode/dentry/xfs_inode cache usage. We can see that the
unpatched kernel has rapid rises and falls in cache footprint. The
vertical falling edges are evidence of allocation stalls. You can
see that it's somewhat regular - the lines stay roughly level, then
we see a deep, sharp valley where the caches are completely freed
and then immediately repopulate.

This generally co-incides with periodic "du scan" starting - it adds
new memory pressure, tipping the kernel into memory reclaim, which
immediately stalls and trashes the cache. Given that there is now
lots of memory free, the du runs at several thousand IOPS (SSDs,
remember) and immediately refills the cache and finishes, producing
no more memory pressure. The cache then sits at the "full" level
until du starts again and we fall off the cliff again.

If we switch to the memory usage graph for the unpatched kernel
(chart 2), we can see that allocation stalls trash more than just
the slab caches - they clean all memory and free all the page cache,
too. i.e. it completely trashes any working set that we might have.
This is not caused by the shrinker - it has no effect on page cache
reclaim. It appears that page reclaim code is, for some reason,
agressively freeing the page cache for multiple seconds, even after
there is already noticable free memory. This is probably an issue
with there being too much direct reclaim stalling due to pressure
(e.g. on dirty pages), and then when the cork is pulled we've got
lots of threads that then run a full direct reclaim scan and that
empties the caches...

/me has always said unbound concurrency in memory reclaim and IO
writeback is a bad thing because you cannot control and limit
behaviour in overload situations like this.

Anyway, lets look at the patched kernel inode cache behaviour -
chart 3. The first thing to note is that there is a very different
pattern - there's a very regular triangular waveform. The trough
occurs when the du thread starts, and the peak occurs when the du
thread stops. IOWs, the inode cache is no longer retaining single
use inodes in the cache for extended periods. The waveform is neat
and regular, and it's very clear where the allocation stall at
~1100s occurred - there's a reclaim cliff.

If we look at chart 4 right now - the patched kernel memory usage -
we can see there is a big dip in the page cache, too. So when an
allocation stall event occurs, the bad behaviour is still there.

However, the way the system responds to a stall event is very
different - it doesn't immediately refill the inode cache. It goes
back to the triangle waveform until the next stall occurs.

What is particularly interesting in chart 3 is the vfs.inodes.count
and xfs_inode slab metrics (red and pink lines respectively). The
active xfs_inode slab objects starts off below the vfs inode count,
then over time 6 cycles xfs_inode count slowly increases above th
vfs inode count. Then we get an allocation stall event, and it
returns the xfs_inode count to less than the vfs inode count. The
second stall even occurs at a similar point.

The xfs_inode > vfs inode count values indicate that we've got XFs
inodes in the reclaim state and that reclaim of XFs inodes is not
keeping up with the VFS inode cache reclaim rate. This is indicative
of either dirty inodes being reclaimed, or we are not putting enough
pressure on the XFS inode cache and so it's putting additional
pressure on other parts of memory reclaim. This change in balance
seems to correlate to allocation stalls being triggered, so I'm
going to try to remove that imbalance and see if that avoids the
allocation stalls.

IOWs, I'm not using latency stats to guide me - they tell me nothing
useful about what causes the latency. The charts tell me when there
is a reclaim issue - I can see it clear as anything.  If the graphs
are smooth, maintain a consistent relationship to each other and
behave nicely when there are step changes in behaviour then
everything else will take care of itself.

It's easy to get lost in creating fancy histograms with shiny new
tools like ebpf/bcc, but this is mostly wasted effort when it comes
to finding the cause of the problem. Latency histograms don't help
us understand why the latencies occur, nor do they help track down
the cause. The charts I have show that there is an obvious, one-off,
large scale memory reclaim event that is causing the stalls and bad
latencies. Finding the trigger for that large scale event is what
matters, not pretty histograms of it's effect...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
Attachment:
charts.tar.gz

Description: application/gzip