Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues

Dave Chinner <dchinner@xxxxxxxxxx> · Wed, 20 Feb 2019 15:33:32 +1100

On Tue, Feb 19, 2019 at 09:06:07PM -0500, Rik van Riel wrote:
> On Wed, 2019-02-20 at 10:26 +1100, Dave Chinner wrote:
> > On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote:
> > > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > > > first letter.
> > > > 
> > > > Please resend again with linux-fsdevel on the cc list, because
> > > > this
> > > > isn't a MM topic given the regressions from the shrinker patches
> > > > have all been on the filesystem side of the shrinkers....
> > > 
> > > It looks like there are two separate things going on here.
> > > 
> > > The first are an MM issues, one of potentially leaking memory
> > > by not scanning slabs with few items on them,
> > 
> > We don't leak memory. Slabs with very few freeable items on them
> > just don't get scanned when there is only light memory pressure.
> > That's /by design/ and it is behaviour we've tried hard over many
> > years to preserve. Once memory pressure ramps up, they'll be
> > scanned just like all the other slabs.
> 
> That may have been fine before cgroups, but when
> a system can have (tens of) thousands of slab
> caches, we DO want to scan slab caches with few
> freeable items in them.
> 
> The threshold for "few items" is 4096, not some
> actually tiny number. That can add up to a lot
> of memory if a system has hundreds of cgroups.

That doesn't sound right. The threshold is supposed to be low single
digits based on the amount of pressure on the page cache, and it's
accumulated by deferral until the batch threshold (128) is exceeded.

Ohhhhh. The penny just dropped - this whole sorry saga has be
triggered because people are chasing a regression nobody has
recognised as a regression because they don't actually understand
how the shrinker algorithms are /supposed/ to work.

And I'm betting that it's been caused by some other recent FB
shrinker change.....

Yup, there it is:

commit 9092c71bb724dba2ecba849eae69e5c9d39bd3d2
Author: Josef Bacik <jbacik@xxxxxx>
Date:   Wed Jan 31 16:16:26 2018 -0800

    mm: use sc->priority for slab shrink targets

....
    We don't need to know exactly how many pages each shrinker represents,
    it's objects are all the information we need.  Making this change allows
    us to place an appropriate amount of pressure on the shrinker pools for
    their relative size.
....

-       delta = (4 * nr_scanned) / shrinker->seeks;
-       delta *= freeable;
-       do_div(delta, nr_eligible + 1);
+       delta = freeable >> priority;
+       delta *= 4;
+       do_div(delta, shrinker->seeks);

So, prior to this change:

	delta ~= (4 * nr_scanned * freeable) / nr_eligible

IOWs, the ratio of nr_scanned:nr_eligible determined the resolution
of scan, and that meant delta could (and did!) have values in the
single digit range.

The current code introduced by the above patch does:

	delta ~= (freeable >> priority) * 4

Which, as you state, has a threshold of freeable > 4096 to trigger
scanning under low memory pressure.

So, that's the original regression that people are trying to fix
(root cause analysis FTW).  It was introduced in 4.16-rc1. The
attempts to fix this regression (i.e. the lack of low free object
shrinker scanning) were introduced into 4.18-rc1, which caused even
worse regressions and lead us directly to this point.

Ok, now I see where the real problem people are chasing is, I'll go
write a patch to fix it.

> Roman's patch, which reclaimed small slabs extra
> aggressively, introduced issues, but reclaiming
> small slabs at the same pressure/object as large
> slabs seems like the desired behavior.

It's still broken. Both of your patches do the wrong thing because
they don't address the resolution and accumulation regression and
instead add another layer of heuristics over the top of the delta
calculation to hide the lack of resolution.

> > That's a cgroup referencing and teardown problem, not a memory
> > reclaim algorithm problem. To treat it as a memory reclaim problem
> > smears memcg internal implementation bogosities all over the
> > independent reclaim infrastructure. It violates the concepts of
> > isolation, modularity, independence, abstraction layering, etc.
> 
> You are overlooking the fact that an inode loaded
> into memory by one cgroup (which is getting torn
> down) may be in active use by processes in other
> cgroups.

No I am not. I am fully aware of this problem (have been since memcg
day one because of the list_lru tracking issues Glauba and I had to
sort out when we first realised shared inodes could occur). Sharing
inodes across cgroups also causes "complexity" in things like cgroup
writeback control (which cgroup dirty list tracks and does writeback
of shared inodes?) and so on. Shared inodes across cgroups are
considered the exception rather than the rule, and they are treated
in many places with algorithms that assert "this is rare, if it's
common we're going to be in trouble"....

> > > The second is the filesystem (and maybe other) shrinker
> > > functions' behavior being somewhat fragile and depending
> > > on closely on current MM behavior, potentially up to
> > > and including MM bugs.
> > > 
> > > The lack of a contract between the MM and the shrinker
> > > callbacks is a recurring issue, and something we may
> > > want to discuss in a joint session.
> > > 
> > > Some reflections on the shrinker/MM interaction:
> > > - Since all memory (in a zone) could potentially be in
> > >   shrinker pools, shrinkers MUST eventually free some
> > >   memory.
> > 
> > Which they cannot guarantee because all the objects they track may
> > be in use. As such, shrinkers have never been asked to guarantee
> > that they can free memory - they've only ever been asked to scan a
> > number of objects and attempt to free those it can during the scan.
> 
> Shrinkers may not be able to free memory NOW, and that
> is ok, but shrinkers need to guarantee that they can
> free memory eventually.

If the memory the shrinker tracks is in use, they can't free
anything. Hence there is no guarantee a shrinker can free anything
from it's cache now or in the future. i.e. it can return freeable =
0 as much as it wants, and the memory reclaim infrastructure just
has to deal with the fact it can't free any memory.

This is where page reclaim would trigger the OOM killer, but that
still won't guarantee a shrinker can free anything.......

> > > - The MM should be able to deal with shrinkers doing
> > >   nothing at this call, but having some work pending 
> > >   (eg. waiting on IO completion), without getting a false
> > >   OOM kill. How can we do this best?
> > 
> > By integrating shrinkers into the same feedback loops as page
> > reclaim. i.e. to allow individual shrinker instance state to be
> > visible to the backoff/congestion decisions that the main page
> > reclaim loops make.
> > 
> > i.e. the problem here is that shrinkers only feedback to the main
> > loop is "how many pages were freed" as a whole. They aren't seen as
> > individual reclaim instances like zones for apge reclaim, they are
> > just a huge amorphous blob that "frees some pages". i.e. They sit off
> > to
> > the side and run their own game between main loop scans and have no
> > capability to run individual backoffs, schedule kswapd to do future
> > work, don't have watermarks to provide reclaim goals, can't
> > communicate progress to the main control algorithm, etc.
> > 
> > IOWs, the first step we need to take here is to get rid of
> > the shrink_slab() abstraction and make shrinkers a first class
> > reclaim citizen....
> 
> I completely agree with that. The main reclaim loop
> should be able to make decisions like "there is plenty
> of IO in flight already, I should wait for some to
> complete instead of starting more", which requires the
> kind of visibility you have outlined.
> 
> I guess we should find some whiteboard time at LSF/MM
> to work out the details, after we have a general discussion
> on this in one of the sessions.

I won't be at LSFMM. The location is absolutely awful in terms of
travel - ~6 days travel time for a 3 day conference is just not
worthwhile.

> Given the need for things like lockless data structures
> in some subsystems, I imagine we would want to do a lot
> of the work here with callbacks, rather than standardized
> data structures.

Just another ops structure.... :P

> > > - Related to the above: stalling in the shrinker code is
> > >   unpredictable, and can take an arbitrarily long amount
> > >   of time. Is there a better way we can make reclaimers
> > >   wait for in-flight work to be completed?
> > 
> > Look at it this way: what do you need to do to implement the main
> > zone reclaim loops as individual shrinker instances? Complex
> > shrinker implementations have to deal with all the same issues as
> > the page reclaim loops (including managing cross-cache dependencies
> > and balancing). If we can't answer this question, then we can't
> > answer the questions that are being asked.
> > 
> > So, at this point, I have to ask: if we need the same functionality
> > for both page reclaim and shrinkers, then why shouldn't the goal be
> > to make page reclaim just another set of opaque shrinker
> > implementations?
> 
> I suspect each LRU could be implemented as a shrinker
> today, with some combination of function pointers and
> data pointers (in case of LRUs, to the lruvec) as control
> data structures.
.....
> The logic of which cgroups we should reclaim memory from
> right now, and which we should skip for now, is already
> handled outside of the code that calls both the LRU and
> the slab shrinking code.
> 
> In short, I see no real obstacle to unifying the two.

Neither do I, except that it's a huge amount of work and there's no
guarantee we'll be able to make any better than what we have now....

Cheers,

Dave.
-- 
Dave Chinner
dchinner@xxxxxxxxxx