Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path

Mel Gorman <mgorman@xxxxxxx> · Fri, 27 Jan 2017 14:27:56 +0000

On Fri, Jan 27, 2017 at 01:01:01PM +0100, Michal Hocko wrote:
> On Thu 26-01-17 13:50:27, Johannes Weiner wrote:
> > On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > > common memory pressure situations, so this code most likely doesn't
> > > > accomplish the described effect anymore. The previous patch wakes up
> > > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > > 
> > > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > 
> > > In general I like this. I worried first that if kswapd is blocked
> > > writing pages that it won't reach the wakeup_flusher_threads but the
> > > previous patch handles it.
> > > 
> > > Now though, it occurs to me with the last patch that we always writeout
> > > the world when flushing threads. This may not be a great idea. Consider
> > > for example if there is a heavy writer of short-lived tmp files. In such a
> > > case, it is possible for the files to be truncated before they even hit the
> > > disk. However, if there are multiple "writeout the world" calls, these may
> > > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > > could all be requested to writeout the world and each request unplugs.
> > > 
> > > Is it possible to maintain the property of writing back pages relative
> > > to the numbers of pages scanned or have you determined already that it's
> > > not necessary?
> > 
> > That's what I started out with - waking the flushers for nr_taken. I
> > was using a silly test case that wrote < dirty background limit and
> > then allocated a burst of anon memory. When the dirty data is linear,
> > the bigger IO requests are beneficial. They don't exhaust struct
> > request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> > only 32), and they require less frequent plugging.
> > 
> > Force-flushing temporary files under memory pressure is a concern -
> > although the most recently dirtied files would get queued last, giving
> > them still some time to get truncated - but I'm wary about splitting
> > the flush requests too aggressively when we DO sustain throngs of
> > dirty pages hitting the reclaim scanners.
> 
> I think the above would be helpful in the changelog for future
> reference.
> 

Agreed. I backported the series to 4.10-rc5 with one minor conflict and
ran a couple of tests on it. Mix of read/write random workload didn't show
anything interesting. Write-only database didn't show much difference in
performance but there were slight reductions in IO -- probably in the noise.

simoop did show big differences although not as big as I expected. This
is Chris Mason's workload that similate the VM activity of hadoop. I
won't go through the full details but over the samples measured during
an hour it reported

                                         4.10.0-rc5            4.10.0-rc5
                                            vanilla         johannes-v1r1
Amean    p50-Read             21346531.56 (  0.00%) 21697513.24 ( -1.64%)
Amean    p95-Read             24700518.40 (  0.00%) 25743268.98 ( -4.22%)
Amean    p99-Read             27959842.13 (  0.00%) 28963271.11 ( -3.59%)
Amean    p50-Write                1138.04 (  0.00%)      989.82 ( 13.02%)
Amean    p95-Write             1106643.48 (  0.00%)    12104.00 ( 98.91%)
Amean    p99-Write             1569213.22 (  0.00%)    36343.38 ( 97.68%)
Amean    p50-Allocation          85159.82 (  0.00%)    79120.70 (  7.09%)
Amean    p95-Allocation         204222.58 (  0.00%)   129018.43 ( 36.82%)
Amean    p99-Allocation         278070.04 (  0.00%)   183354.43 ( 34.06%)
Amean    final-p50-Read       21266432.00 (  0.00%) 21921792.00 ( -3.08%)
Amean    final-p95-Read       24870912.00 (  0.00%) 26116096.00 ( -5.01%)
Amean    final-p99-Read       28147712.00 (  0.00%) 29523968.00 ( -4.89%)
Amean    final-p50-Write          1130.00 (  0.00%)      977.00 ( 13.54%)
Amean    final-p95-Write       1033216.00 (  0.00%)     2980.00 ( 99.71%)
Amean    final-p99-Write       1517568.00 (  0.00%)    32672.00 ( 97.85%)
Amean    final-p50-Allocation    86656.00 (  0.00%)    78464.00 (  9.45%)
Amean    final-p95-Allocation   211712.00 (  0.00%)   116608.00 ( 44.92%)
Amean    final-p99-Allocation   287232.00 (  0.00%)   168704.00 ( 41.27%)

The latencies are actually completely horrific in comparison to 4.4 (and
4.10-rc5 is worse than 4.9 according to historical data for reasons I
haven't analysed yet).

Still, 95% of write latency (p95-write) is halved by the series and
allocation latency is way down. Direct reclaim activity is one fifth of
what it was according to vmstats. Kswapd activity is higher but this is not
necessarily surprising. Kswapd efficiency is unchanged at 99% (99% of pages
scanned were reclaimed) but direct reclaim efficiency went from 77% to 99%

In the vanilla kernel, 627MB of data was written back from reclaim
context. With the series, no data was written back. With or without the
patch, pages are being immediately reclaimed after writeback completes.
However, with the patch, only 1/8th of the pages are reclaimed like
this.

I expect you've done plenty of internal analysis but FWIW, I can confirm
for some basic tests that exercise this are and on one machine that it's
looking good and roughly matches my expectations.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>