Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 20 Feb 2019 18:27:07 +1100

On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> I'm just going to fix the original regression in the shrinker
> algorithm by restoring the gradual accumulation behaviour, and this
> whole series of problems can be put to bed.

Something like this lightly smoke tested patch below. It may be
slightly more agressive than the original code for really small
freeable values (i.e. < 100) but otherwise should be roughly
equivalent to historic accumulation behaviour.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

mm: fix shrinker scan accumulation regression

From: Dave Chinner <dchinner@xxxxxxxxxx>

Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
in 4.16-rc1 broke the shrinker scan accumulation algorithm for small
freeable caches. This was active when there isn't enough work to run
a full batch scan -  the shrinker is supposed to defer that work
until a future shrinker call. That then is fed back into the work to
do on the next call, and if the work is larger than a batch it will
run the scan. This is an efficiency mechanism that prevents repeated
small scans of caches from consuming too much CPU.

It also has the effect of ensure that caches with small numbers of
freeable objects are slowly scanned. While an individual shrinker
scan may not result in work to do, if the cache is queried enough
times then the work will accumulate and the cache will be scanned
and freed. This protects small and otherwise in use caches from
excessive scanning under light memory pressure, but keeps cross
caceh reclaim amounts fairly balalnced over time.

The change in the above commit broke all this with the way it
calculates the delta value. Instead of it being calculated to keep
the freeable:scan shrinker count in the same ratio as the previous
page cache freeable:scanned pass, it calculates the delta from the
relcaim priority based on a logarithmic scale and applies this to
the freeable count before anything else is done.

This means that the resolution of the delta calculation is (1 <<
priority) and so for low pritority reclaim the cacluated delta does
not go above zero unless there are at least 4096 freeable objects.
This completely defeats the accumulation of work for caches with few
freeable objects.

Old code (ignoring seeks scaling)

	delta ~= (pages_scanned * freeable) / pages_freeable

	Accumulation resoution: pages_scanned / pages_freeable

4.16 code:

	delta ~= freeable >> priority

	Accumulation resolution: (1 << priority)

IOWs, the old code would almost always result in delta being
non-zero when freeable was non zero, and hence it would always
accumulate scan even on the smallest of freeable caches regardless
of the reclaim pressure being applied. The new code won't accumulate
or scan the smallest of freeable caches until it reaches  priority
1. This is extreme memory pressure, just before th OOM killer is to
be kicked.

We want to retain the priority mechanism to scale the work the
shrinker does, but we also want to ensure it accumulates
appropriately, too. In this case, offset the delta by
ilog2(freeable) so that there is a slow accumulation of work. Use
this regardless of the delta calculated so that we don't decrease
the amount of work as the priority increases past the point where
delta is non-zero.

New code:

	delta ~= ilog2(freeable) + (freeable >> priority)

	Accumulation resolution: ilog2(freeable)

Typical delta calculations from different code (ignoring seek
scaling), keeping in mind that batch size is 128 by default and 1024
for superblock shrinkers.

freeable = 1

ratio	4.15	priority	4.16	4.18		new
1:100	  1	   12		0	batch		1
1.32	  1	    9		0	batch		1
1:12	  1	    6		0	batch		1
1:6	  1	    3		0	batch		1
1:1	  1	    1		1	batch		1

freeable = 10

ratio	4.15	priority	4.16	4.18		new
1:100	  1	   12		0	batch		3
1.32	  1	    9		0	batch		3
1:12	  1	    6		0	batch		3
1:6	  2	    3		0	batch		3
1:1	 10	    1		10	batch		10

freeable = 100

ratio	4.15	priority	4.16	4.18		new
1:100	  1	   12		0	batch		6
1.32	  3	    9		0	batch		6
1:12	  6	    6		1	batch		7
1:6	 16	    3		12	batch		18
1:1	100	    1		100	batch		100

freeable = 1000

ratio	4.15	priority	4.16	4.18		new
1:100	 10	   12		0	batch		9
1.32	 32	    9		1	batch		10
1:12	 60	    6		16	batch		26
1:6	160	    3		120	batch		130
1:1	1000	    1		1000	max(1000,batch)	1000

freeable = 10000

ratio	4.15	priority	4.16	4.18		new
1:100	 100	   12		2	batch		16
1.32	 320	    9		19	batch		35
1:12	 600	    6		160	max(160,batch)	175
1:6	1600	    3		1250	1250		1265
1:1	10000	    1		10000	10000		10000

It's pretty clear why the 4.18 algorithm caused such a problem - it
massively changed the balance of reclaim when all that was actually
required was a small tweak to always accumulating a small delta for
caches with very small freeable counts.

Fixes: 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 mm/vmscan.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e979705bbf32..9cc58e9f1f54 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -479,7 +479,16 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 
 	total_scan = nr;
 	if (shrinker->seeks) {
-		delta = freeable >> priority;
+		/*
+		 * Use a small non-zero offset for delta so that if the scan
+		 * priority is low we always accumulate some pressure on caches
+		 * that have few freeable objects in them. This allows light
+		 * memory pressure to turn over caches with few freeable objects
+		 * slowly without the need for memory pressure priority to wind
+		 * up to the point where (freeable >> priority) is non-zero.
+		 */
+		delta = ilog2(freeable);
+		delta += freeable >> priority;
 		delta *= 4;
 		do_div(delta, shrinker->seeks);
 	} else {