Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 20 Feb 2019 13:47:23 +1100

On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote:
> Sorry, once more, now with fsdevel@ in cc, asked by Dave.
> --
> 
> Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
> with accumulating of dying memory cgroups. This is a serious problem:
> on most of our machines we've seen thousands on dying cgroups, and
> the corresponding memory footprint was measured in hundreds of megabytes.
> The problem was also independently discovered by other companies.
> 
> The fixes were reverted due to xfs regression investigated by Dave Chinner.

Context: it wasn't one regression that I investigated. We had
multiple bug reports with different regressions, and I saw evidence
on my own machines that something wasn't right because of the change
in the IO patterns in certain benchmarks. Some of the problems were
caused by the first patch, some were caused by the second patch.

This also affects ext4 (i.e. it's a general problem, not an XFS
problem) as has been reported a couple of times, including this one
overnight:

https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@xxxxxxx/

> Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
> which caused Rik van Riel to propose a patch [3], which aimed to fix the
> regression. The idea is to accumulate small memory pressure and apply it
> periodically, so that we don't overscan small shrinker lists. According
> to Jan Kara's data [4], Rik's patch partially fixed the regression,
> but not entirely.

Rik's patch was buggy and made an invalid assumptions about how a
cache with a small number of freeable objects is a "small cache", so
any comaprisons made with it are essentially worthless.

More details about the problems with the patch and approach here:

https://lore.kernel.org/stable/20190131224905.GN31397@rh/

> The path forward isn't entirely clear now, and the status quo isn't acceptable
> due to memcg leak bug. Dave and Michal's position is to focus on dying memory
> cgroup case and apply some artificial memory pressure on corresponding slabs
> (probably, during cgroup deletion process). This approach can theoretically
> be less harmful for the subtle scanning balance, and not cause any regressions.

I outlined the dying memcg problem in patch[0] of the revert series:

https://lore.kernel.org/linux-mm/20190130041707.27750-1-david@xxxxxxxxxxxxx/

It basically documents the solution I proposed for dying memcg
cleanup:

dgc> e.g. add a garbage collector via a background workqueue that sits on
dgc> the dying memcg calling something like:
dgc> 
dgc> void drop_slab_memcg(struct mem_cgroup *dying_memcg)
dgc> {
dgc>         unsigned long freed;
dgc> 
dgc>         do {
dgc>                 struct mem_cgroup *memcg = NULL;
dgc> 
dgc>                 freed = 0;
dgc>                 memcg = mem_cgroup_iter(dying_memcg, NULL, NULL);
dgc>                 do {
dgc>                         freed += shrink_slab_memcg(GFP_KERNEL, 0, memcg, 0);
dgc>                 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
dgc>         } while (freed > 0);
dgc> }

This is a pretty trivial piece of code and doesn't requiring
changing the core memory reclaim code at all.

> In my opinion, it's not necessarily true. Slab objects can be shared between
> cgroups, and often can't be reclaimed on cgroup removal without an impact on the
> rest of the system.

I've already pointed out that it is preferable for shared objects
to stay in cache, not face expedited reclaim:

https://lore.kernel.org/linux-mm/20190131221904.GL4205@dastard/

dgc> However, the memcg reaper *doesn't need to be perfect* to solve the
dgc> "takes too long to clean up dying memcgs". Even if it leaves shared
dgc> objects behind (which we want to do!), it still trims those memcgs
dgc> down to /just the shared objects still in use/.  And given that
dgc> objects shared by memcgs are in the minority (according to past
dgc> discussions about the difficulies of accounting them correctly) I
dgc> think this is just fine.
dgc> 
dgc> Besides, those reamining shared objects are the ones we want to
dgc> naturally age out under memory pressure, but otherwise the memcgs
dgc> will have been shaken clean of all other objects accounted to them.
dgc> i.e. the "dying memcg" memory footprint goes down massively and the
dgc> "long term buildup" of dying memcgs basically goes away.

This all seems like pretty desirable cross-memcg working set
maintenance behaviour to me...

> Applying constant artificial memory pressure precisely only
> on objects accounted to dying cgroups is challenging and will likely
> cause a quite significant overhead.

I don't know where you got that from - the above example is clearly
a once-off cleanup.

And executing it via a workqueue in the async memcg cleanup path
(which already runs through multiple work queues to run and wait for
different stages of cleanup) is not complex or challenging, nor is
it likely to add additional overhead because it means we will avoid
the long term shrinker scanning overhead that cleanup currently
requires.

> Also, by "forgetting" of some slab objects
> under light or even moderate memory pressure, we're wasting memory, which can be
> used for something useful.

Cached memory is not "forgotten" or "wasted memory". If the scan is
too small and not used, it is deferred to the next shrinker
invocation. This batching behaviour is intentionally done for scan
efficiency purposes. Don't take my word for it, read the discussion
that went along with commit 0b1fb40a3b12 ("mm: vmscan: shrink all
slab objects if tight on memory")

https://lore.kernel.org/lkml/20140115012541.ad302526.akpm@xxxxxxxxxxxxxxxxxxxx/

>From Andrew:

akpm> Actually, the intent of batching is to limit the number of calls to
akpm> ->scan().  At least, that was the intent when I wrote it!  This is a
akpm> good principle and we should keep doing it.  If we're going to send the
akpm> CPU away to tread on a pile of cold cachelines, we should make sure
akpm> that it does a good amount of work while it's there.

IOWs, the "small scan" proposals defeat existing shrinker efficiency
optimisations. This change in behaviour is where the CPU usage
regressions in "small cache" scanning comes from.  As Andrew said:
scan batching is a good principle and we should keep doing it.

> Dying cgroups are just making this problem more
> obvious because of their size.

Dying cgroups see this as a problem only because they have extremely
poor life cycle management. Expediting dying memcg cache cleanup is
the way to fix this and that does not need us to change global memory
reclaim behaviour.

> So, using "natural" memory pressure in a way, that all slabs objects are scanned
> periodically, seems to me as the best solution. The devil is in details, and how
> to do it without causing any regressions, is an open question now.
> 
> Also, completely re-parenting slabs to parent cgroup (not only shrinker lists)
> is a potential option to consider.

That should be done once the memcg gc thread has shrunk the caches
down to just the shared objects (which we want to keep in cache!)
that reference the dying memcg. That will get rid of all the
remaining references and allow the memcg to be reclaimed completely.

> It will be nice to discuss the problem on LSF/MM, agree on general path and
> make a potential list of benchmarks, which can be used to prove the solution.

In reality, it comes down to this - should we:

	a) add a small amount of code into the subsystem to perform
	expedited reaping of subsystem owned objects and test against
	the known, specific reproducing workload; or

	b) change global memory reclaim algorithms in a way that
	affects every linux machine and workload in some way,
	resulting in us having to revalidate and rebalance memory
	reclaim for a large number of common workloads across all
	filesystems and subsystems that use shrinkers, on a wide
	range of different storage hardware and on both headless and
	desktop machines.

And when we look at it this way, if we end up with option b) as the
preferred solution then we've well and truly jumped the shark.  The
validation effort required for option b) is way out of proportion
with the small niche of machines and environments affected by the
dying memcg problem and the risk of regressions for users outside
these memcg-heavy environments is extremely high (as has already
been proven).

Cheers,

Dave
-- 
Dave Chinner
david@xxxxxxxxxxxxx