Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues

Roman Gushchin <guro@xxxxxx> · Wed, 20 Feb 2019 05:31:01 +0000

On Wed, Feb 20, 2019 at 03:33:32PM +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 09:06:07PM -0500, Rik van Riel wrote:
> > On Wed, 2019-02-20 at 10:26 +1100, Dave Chinner wrote:
> > > On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote:
> > > > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > > > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > > > > first letter.
> > > > > 
> > > > > Please resend again with linux-fsdevel on the cc list, because
> > > > > this
> > > > > isn't a MM topic given the regressions from the shrinker patches
> > > > > have all been on the filesystem side of the shrinkers....
> > > > 
> > > > It looks like there are two separate things going on here.
> > > > 
> > > > The first are an MM issues, one of potentially leaking memory
> > > > by not scanning slabs with few items on them,
> > > 
> > > We don't leak memory. Slabs with very few freeable items on them
> > > just don't get scanned when there is only light memory pressure.
> > > That's /by design/ and it is behaviour we've tried hard over many
> > > years to preserve. Once memory pressure ramps up, they'll be
> > > scanned just like all the other slabs.
> > 
> > That may have been fine before cgroups, but when
> > a system can have (tens of) thousands of slab
> > caches, we DO want to scan slab caches with few
> > freeable items in them.
> > 
> > The threshold for "few items" is 4096, not some
> > actually tiny number. That can add up to a lot
> > of memory if a system has hundreds of cgroups.
> 
> That doesn't sound right. The threshold is supposed to be low single
> digits based on the amount of pressure on the page cache, and it's
> accumulated by deferral until the batch threshold (128) is exceeded.
> 
> Ohhhhh. The penny just dropped - this whole sorry saga has be
> triggered because people are chasing a regression nobody has
> recognised as a regression because they don't actually understand
> how the shrinker algorithms are /supposed/ to work.
> 
> And I'm betting that it's been caused by some other recent FB
> shrinker change.....
> 
> Yup, there it is:
> 
> commit 9092c71bb724dba2ecba849eae69e5c9d39bd3d2
> Author: Josef Bacik <jbacik@xxxxxx>
> Date:   Wed Jan 31 16:16:26 2018 -0800
> 
>     mm: use sc->priority for slab shrink targets
> 
> ....
>     We don't need to know exactly how many pages each shrinker represents,
>     it's objects are all the information we need.  Making this change allows
>     us to place an appropriate amount of pressure on the shrinker pools for
>     their relative size.
> ....
> 
> -       delta = (4 * nr_scanned) / shrinker->seeks;
> -       delta *= freeable;
> -       do_div(delta, nr_eligible + 1);
> +       delta = freeable >> priority;
> +       delta *= 4;
> +       do_div(delta, shrinker->seeks);
> 
> 
> So, prior to this change:
> 
> 	delta ~= (4 * nr_scanned * freeable) / nr_eligible
> 
> IOWs, the ratio of nr_scanned:nr_eligible determined the resolution
> of scan, and that meant delta could (and did!) have values in the
> single digit range.
> 
> The current code introduced by the above patch does:
> 
> 	delta ~= (freeable >> priority) * 4
> 
> Which, as you state, has a threshold of freeable > 4096 to trigger
> scanning under low memory pressure.
> 
> So, that's the original regression that people are trying to fix
> (root cause analysis FTW).  It was introduced in 4.16-rc1. The
> attempts to fix this regression (i.e. the lack of low free object
> shrinker scanning) were introduced into 4.18-rc1, which caused even
> worse regressions and lead us directly to this point.
> 
> Ok, now I see where the real problem people are chasing is, I'll go
> write a patch to fix it.

Sounds good, I'll check if it can prevent the memcg leak.
If it will work, we're fine.

> 
> > Roman's patch, which reclaimed small slabs extra
> > aggressively, introduced issues, but reclaiming
> > small slabs at the same pressure/object as large
> > slabs seems like the desired behavior.
> 
> It's still broken. Both of your patches do the wrong thing because
> they don't address the resolution and accumulation regression and
> instead add another layer of heuristics over the top of the delta
> calculation to hide the lack of resolution.
> 
> > > That's a cgroup referencing and teardown problem, not a memory
> > > reclaim algorithm problem. To treat it as a memory reclaim problem
> > > smears memcg internal implementation bogosities all over the
> > > independent reclaim infrastructure. It violates the concepts of
> > > isolation, modularity, independence, abstraction layering, etc.
> > 
> > You are overlooking the fact that an inode loaded
> > into memory by one cgroup (which is getting torn
> > down) may be in active use by processes in other
> > cgroups.
> 
> No I am not. I am fully aware of this problem (have been since memcg
> day one because of the list_lru tracking issues Glauba and I had to
> sort out when we first realised shared inodes could occur). Sharing
> inodes across cgroups also causes "complexity" in things like cgroup
> writeback control (which cgroup dirty list tracks and does writeback
> of shared inodes?) and so on. Shared inodes across cgroups are
> considered the exception rather than the rule, and they are treated
> in many places with algorithms that assert "this is rare, if it's
> common we're going to be in trouble"....

No, even if sharing inodes can be advertised as a bad practice and
may lead to some sub-optimal results, it shouldn't trigger obvious
kernel issues like memory leaks. Otherwise it becomes a security concern.

Also, in practice, it's common to have a main workload and a couple
of supplementary processes (e.g. monitoring) in sibling cgroups,
which are sharing some inodes (e.g. logs).

Thanks!