Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues

Rik van Riel <riel@xxxxxxxxxxx> · Tue, 19 Feb 2019 21:06:07 -0500

On Wed, 2019-02-20 at 10:26 +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote:
> > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > > first letter.
> > > 
> > > Please resend again with linux-fsdevel on the cc list, because
> > > this
> > > isn't a MM topic given the regressions from the shrinker patches
> > > have all been on the filesystem side of the shrinkers....
> > 
> > It looks like there are two separate things going on here.
> > 
> > The first are an MM issues, one of potentially leaking memory
> > by not scanning slabs with few items on them,
> 
> We don't leak memory. Slabs with very few freeable items on them
> just don't get scanned when there is only light memory pressure.
> That's /by design/ and it is behaviour we've tried hard over many
> years to preserve. Once memory pressure ramps up, they'll be
> scanned just like all the other slabs.

That may have been fine before cgroups, but when
a system can have (tens of) thousands of slab
caches, we DO want to scan slab caches with few
freeable items in them.

The threshold for "few items" is 4096, not some
actually tiny number. That can add up to a lot
of memory if a system has hundreds of cgroups.

Roman's patch, which reclaimed small slabs extra
aggressively, introduced issues, but reclaiming
small slabs at the same pressure/object as large
slabs seems like the desired behavior.

Waiting until "memory pressure ramps up" is very
much the wrong thing to do, since reclaim priority
is not likely to drop to a small number until the
system is under so much memory pressure that the
workloads on the system suffer noticeable slowdowns.

> > and having
> > such slabs stay around forever after the cgroup they were
> > created for has disappeared,
> 
> That's a cgroup referencing and teardown problem, not a memory
> reclaim algorithm problem. To treat it as a memory reclaim problem
> smears memcg internal implementation bogosities all over the
> independent reclaim infrastructure. It violates the concepts of
> isolation, modularity, independence, abstraction layering, etc.

You are overlooking the fact that an inode loaded
into memory by one cgroup (which is getting torn
down) may be in active use by processes in other
cgroups.

That may prevent us from tearing down all of a
cgroup's slab cache memory at cgroup destruction
time, which turns it into a reclaim problem.

> This all comes back to the fact that modifying the shrinker
> algorithms requires understanding what the shrinker implementations
> do and the constraints they operate under. It is not a "purely mm"
> discussion, and treating it as such results regressions like the
> ones we've recently seen.

That's fair, maybe both topics need to be discussed
in a shared MM/FS session, or even a plenary session.

> > The second is the filesystem (and maybe other) shrinker
> > functions' behavior being somewhat fragile and depending
> > on closely on current MM behavior, potentially up to
> > and including MM bugs.
> > 
> > The lack of a contract between the MM and the shrinker
> > callbacks is a recurring issue, and something we may
> > want to discuss in a joint session.
> > 
> > Some reflections on the shrinker/MM interaction:
> > - Since all memory (in a zone) could potentially be in
> >   shrinker pools, shrinkers MUST eventually free some
> >   memory.
> 
> Which they cannot guarantee because all the objects they track may
> be in use. As such, shrinkers have never been asked to guarantee
> that they can free memory - they've only ever been asked to scan a
> number of objects and attempt to free those it can during the scan.

Shrinkers may not be able to free memory NOW, and that
is ok, but shrinkers need to guarantee that they can
free memory eventually.

Without that guarantee, it will be unsafe to ever place
a majority of system memory under the control of shrinker
functions, if only because the subsystems with those shrinker
functions tend to rely on the VM being able to free pages
when the pageout code is called.

> > - Shrinkers should not block kswapd from making progress.
> >   If kswapd got stuck in NFS inode writeback, and ended up
> >   not being able to free clean pages to receive network
> >   packets, that might cause a deadlock.
> 
> Same can happen if kswapd got stuck on dirty page writeback from
> pageout(). i.e. pageout() can only run from kswapd and it issues IO,
> which can then block in the IO submission path waiting for IO to
> make progress, which may require substantial amounts of memory
> allocation.
> 
> Yes, we can try to not block kswapd as much as possible just like
> page reclaim does, but the fact is kswapd is the only context where
> it is safe to do certain blocking operations to ensure memory
> reclaim can actually make progress.
> 
> i.e. the rules for blocking kswapd need to be consistent across both
> page reclaim and shrinker reclaim, and right now page reclaim can
> and does block kswapd when it is necessary for forwards progress....

Agreed, the rules should be the same for both.

It would be good to come to some sort of agreement,
or even a wish list, on what they should be.

> > - The MM should be able to deal with shrinkers doing
> >   nothing at this call, but having some work pending 
> >   (eg. waiting on IO completion), without getting a false
> >   OOM kill. How can we do this best?
> 
> By integrating shrinkers into the same feedback loops as page
> reclaim. i.e. to allow individual shrinker instance state to be
> visible to the backoff/congestion decisions that the main page
> reclaim loops make.
> 
> i.e. the problem here is that shrinkers only feedback to the main
> loop is "how many pages were freed" as a whole. They aren't seen as
> individual reclaim instances like zones for apge reclaim, they are
> just a huge amorphous blob that "frees some pages". i.e. They sit off
> to
> the side and run their own game between main loop scans and have no
> capability to run individual backoffs, schedule kswapd to do future
> work, don't have watermarks to provide reclaim goals, can't
> communicate progress to the main control algorithm, etc.
> 
> IOWs, the first step we need to take here is to get rid of
> the shrink_slab() abstraction and make shrinkers a first class
> reclaim citizen....

I completely agree with that. The main reclaim loop
should be able to make decisions like "there is plenty
of IO in flight already, I should wait for some to
complete instead of starting more", which requires the
kind of visibility you have outlined.

I guess we should find some whiteboard time at LSF/MM
to work out the details, after we have a general discussion
on this in one of the sessions.

Given the need for things like lockless data structures
in some subsystems, I imagine we would want to do a lot
of the work here with callbacks, rather than standardized
data structures.

> > - Related to the above: stalling in the shrinker code is
> >   unpredictable, and can take an arbitrarily long amount
> >   of time. Is there a better way we can make reclaimers
> >   wait for in-flight work to be completed?
> 
> Look at it this way: what do you need to do to implement the main
> zone reclaim loops as individual shrinker instances? Complex
> shrinker implementations have to deal with all the same issues as
> the page reclaim loops (including managing cross-cache dependencies
> and balancing). If we can't answer this question, then we can't
> answer the questions that are being asked.
> 
> So, at this point, I have to ask: if we need the same functionality
> for both page reclaim and shrinkers, then why shouldn't the goal be
> to make page reclaim just another set of opaque shrinker
> implementations?

I suspect each LRU could be implemented as a shrinker
today, with some combination of function pointers and
data pointers (in case of LRUs, to the lruvec) as control
data structures.

Each shrinker would need some callbacks for things like
"lots of work is in flight already, wait instead of starting
more".

The magic of zone balancing could easily be hidden inside
the shrinker function for lruvecs. If a pgdat is balanced,
the shrinkers for each lruvec inside that pgdat could return
that no work is needed, while if work in only one or two
memory zones is needed, the shrinkers for those lruvecs would
do work, while the shrinkers would return "no work needed"
for the other lruvecs in the same pgdat.

The scan_control and shrink_control structs would probably
need to be merged, which is no obstacle at all.

The logic of which cgroups we should reclaim memory from
right now, and which we should skip for now, is already
handled outside of the code that calls both the LRU and
the slab shrinking code.

In short, I see no real obstacle to unifying the two.

-- 
All Rights Reversed.
Attachment:
signature.asc

Description: This is a digitally signed message part