Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues

Dave Chinner <dchinner@xxxxxxxxxx> · Wed, 20 Feb 2019 10:26:27 +1100

On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote:
> On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > first letter.
> > 
> > Please resend again with linux-fsdevel on the cc list, because this
> > isn't a MM topic given the regressions from the shrinker patches
> > have all been on the filesystem side of the shrinkers....
> 
> It looks like there are two separate things going on here.
> 
> The first are an MM issues, one of potentially leaking memory
> by not scanning slabs with few items on them,

We don't leak memory. Slabs with very few freeable items on them
just don't get scanned when there is only light memory pressure.
That's /by design/ and it is behaviour we've tried hard over many
years to preserve. Once memory pressure ramps up, they'll be
scanned just like all the other slabs.

e.g. commit 0b1fb40a3b12 ("mm: vmscan: shrink all slab objects if
tight on memory") makes this commentary:

    [....] That said, this
    patch shouldn't change the vmscan behaviour if the memory pressure is
    low, but if we are tight on memory, we will do our best by trying to
    reclaim all available objects, which sounds reasonable.

Which is essentially how we've tried to implement shrinker reclaim
for a long, long time (bugs notwithstanding).

> and having
> such slabs stay around forever after the cgroup they were
> created for has disappeared,

That's a cgroup referencing and teardown problem, not a memory
reclaim algorithm problem. To treat it as a memory reclaim problem
smears memcg internal implementation bogosities all over the
independent reclaim infrastructure. It violates the concepts of
isolation, modularity, independence, abstraction layering, etc.

> and the other of various other
> bugs with shrinker invocation behavior (like the nr_deferred
> fixes you posted a patch for). I believe these are MM topics.

Except they interact directly with external shrinker behaviour. the
conditions of deferral and the problems it is solving are a direct
response to shrinker implementation constraints (e.g. GFP_NOFS
deadlock avoidance for filesystems). i.e. we can't talk about the
deferal algorithm without considering why work is deferred, how much
work should be deferred, when it may be safe/best to execute the
deferred work, etc.

This all comes back to the fact that modifying the shrinker
algorithms requires understanding what the shrinker implementations
do and the constraints they operate under. It is not a "purely mm"
discussion, and treating it as such results regressions like the
ones we've recently seen.

> The second is the filesystem (and maybe other) shrinker
> functions' behavior being somewhat fragile and depending
> on closely on current MM behavior, potentially up to
> and including MM bugs.
> 
> The lack of a contract between the MM and the shrinker
> callbacks is a recurring issue, and something we may
> want to discuss in a joint session.
> 
> Some reflections on the shrinker/MM interaction:
> - Since all memory (in a zone) could potentially be in
>   shrinker pools, shrinkers MUST eventually free some
>   memory.

Which they cannot guarantee because all the objects they track may
be in use. As such, shrinkers have never been asked to guarantee
that they can free memory - they've only ever been asked to scan a
number of objects and attempt to free those it can during the scan.

> - Shrinkers should not block kswapd from making progress.
>   If kswapd got stuck in NFS inode writeback, and ended up
>   not being able to free clean pages to receive network
>   packets, that might cause a deadlock.

Same can happen if kswapd got stuck on dirty page writeback from
pageout(). i.e. pageout() can only run from kswapd and it issues IO,
which can then block in the IO submission path waiting for IO to
make progress, which may require substantial amounts of memory
allocation.

Yes, we can try to not block kswapd as much as possible just like
page reclaim does, but the fact is kswapd is the only context where
it is safe to do certain blocking operations to ensure memory
reclaim can actually make progress.

i.e. the rules for blocking kswapd need to be consistent across both
page reclaim and shrinker reclaim, and right now page reclaim can
and does block kswapd when it is necessary for forwards progress....

> - The MM should be able to deal with shrinkers doing
>   nothing at this call, but having some work pending 
>   (eg. waiting on IO completion), without getting a false
>   OOM kill. How can we do this best?

By integrating shrinkers into the same feedback loops as page
reclaim. i.e. to allow individual shrinker instance state to be
visible to the backoff/congestion decisions that the main page
reclaim loops make.

i.e. the problem here is that shrinkers only feedback to the main
loop is "how many pages were freed" as a whole. They aren't seen as
individual reclaim instances like zones for apge reclaim, they are
just a huge amorphous blob that "frees some pages". i.e. They sit off to
the side and run their own game between main loop scans and have no
capability to run individual backoffs, schedule kswapd to do future
work, don't have watermarks to provide reclaim goals, can't
communicate progress to the main control algorithm, etc.

IOWs, the first step we need to take here is to get rid of
the shrink_slab() abstraction and make shrinkers a first class
reclaim citizen....

> - Related to the above: stalling in the shrinker code is
>   unpredictable, and can take an arbitrarily long amount
>   of time. Is there a better way we can make reclaimers
>   wait for in-flight work to be completed?

Look at it this way: what do you need to do to implement the main
zone reclaim loops as individual shrinker instances? Complex
shrinker implementations have to deal with all the same issues as
the page reclaim loops (including managing cross-cache dependencies
and balancing). If we can't answer this question, then we can't
answer the questions that are being asked.

So, at this point, I have to ask: if we need the same functionality
for both page reclaim and shrinkers, then why shouldn't the goal be
to make page reclaim just another set of opaque shrinker
implementations?

Cheers,

Dave.
-- 
Dave Chinner
dchinner@xxxxxxxxxx