Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 20 Feb 2019 16:50:31 +1100

On Wed, Feb 20, 2019 at 01:47:23PM +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote:
> > Sorry, once more, now with fsdevel@ in cc, asked by Dave.
> > --
> > 
> > Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
> > with accumulating of dying memory cgroups. This is a serious problem:
> > on most of our machines we've seen thousands on dying cgroups, and
> > the corresponding memory footprint was measured in hundreds of megabytes.
> > The problem was also independently discovered by other companies.
> > 
> > The fixes were reverted due to xfs regression investigated by Dave Chinner.
> 
> Context: it wasn't one regression that I investigated. We had
> multiple bug reports with different regressions, and I saw evidence
> on my own machines that something wasn't right because of the change
> in the IO patterns in certain benchmarks. Some of the problems were
> caused by the first patch, some were caused by the second patch.
> 
> This also affects ext4 (i.e. it's a general problem, not an XFS
> problem) as has been reported a couple of times, including this one
> overnight:
> 
> https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@xxxxxxx/
> 
> > Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
> > which caused Rik van Riel to propose a patch [3], which aimed to fix the
> > regression. The idea is to accumulate small memory pressure and apply it
> > periodically, so that we don't overscan small shrinker lists. According
> > to Jan Kara's data [4], Rik's patch partially fixed the regression,
> > but not entirely.
> 
> Rik's patch was buggy and made an invalid assumptions about how a
> cache with a small number of freeable objects is a "small cache", so
> any comaprisons made with it are essentially worthless.
> 
> More details about the problems with the patch and approach here:
> 
> https://lore.kernel.org/stable/20190131224905.GN31397@rh/

So, long story short, the dying memcg problem is actually a
regression caused by previous shrinker changes, and the change in
4.18-rc1 was an attempt to fix the regression (which caused evenmore
widespread problems) and Rik's patch is another different attempt to
fix the original regression.

The original regression broke the small scan accumulation algorithm
in the shrinker, but i don't think that anyone actually understood
how this was supposed to work and so the attempts to fix the
regression haven't actually restored the original behaviour. The
problematic commit:

9092c71bb724 ("mm: use sc->priority for slab shrink targets")

which was included in 4.16-rc1.

This changed the delta calculation and so any cache with less than
4096 freeable objects would now end up with a zero delta count.
This means caches with few freeable objects had no scan pressure at
all and nothing would get accumulated for later scanning. Prior to
this change, such scans would result in single digit scan counts,
which would get deferred and acummulated until the overal delta +
deferred count went over the batch size and ti would scan the cache.

IOWs, the above commit prevented accumulation of light pressure on
caches and they'd only get scanned when extreme memory pressure
occurs.

The fix that went into 4.18-rc1 change this to make the minimum scan
pressure the batch size, so instead of 0 pressure, it went to having
extreme pressure on small caches. What wasn't used in a scan got
deferred, and so the shrinker would wind up and keep heavy pressure
on the cache even when there was only light memory pressure. IOWs,
instead of having a scan count in the single digits under light
memory pressure, those caches now had continual scan counts 1-2
orders of magnitude larger. i.e. way more agressive than in 4.15 and
oler kernels. hence it introduced a different, more severe set of
regressions than the one it was trying to fix.

IOWs, the dying memcg issue is irrelevant here. The real problem
that needs fixing is a shrinker regression that occurred in 4.16-rc1,
not 4.18-rc1.

I'm just going to fix the original regression in the shrinker
algorithm by restoring the gradual accumulation behaviour, and this
whole series of problems can be put to bed.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx