On Wed, Feb 20, 2019 at 01:47:23PM +1100, Dave Chinner wrote: > On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote: > > Sorry, once more, now with fsdevel@ in cc, asked by Dave. > > -- > > > > Recent reverts of memcg leak fixes [1, 2] reintroduced the problem > > with accumulating of dying memory cgroups. This is a serious problem: > > on most of our machines we've seen thousands on dying cgroups, and > > the corresponding memory footprint was measured in hundreds of megabytes. > > The problem was also independently discovered by other companies. > > > > The fixes were reverted due to xfs regression investigated by Dave Chinner. > > Context: it wasn't one regression that I investigated. We had > multiple bug reports with different regressions, and I saw evidence > on my own machines that something wasn't right because of the change > in the IO patterns in certain benchmarks. Some of the problems were > caused by the first patch, some were caused by the second patch. > > This also affects ext4 (i.e. it's a general problem, not an XFS > problem) as has been reported a couple of times, including this one > overnight: > > https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@xxxxxxx/ > > > Simultaneously we've seen a very small (0.18%) cpu regression on some hosts, > > which caused Rik van Riel to propose a patch [3], which aimed to fix the > > regression. The idea is to accumulate small memory pressure and apply it > > periodically, so that we don't overscan small shrinker lists. According > > to Jan Kara's data [4], Rik's patch partially fixed the regression, > > but not entirely. > > Rik's patch was buggy and made an invalid assumptions about how a > cache with a small number of freeable objects is a "small cache", so > any comaprisons made with it are essentially worthless. > > More details about the problems with the patch and approach here: > > https://lore.kernel.org/stable/20190131224905.GN31397@rh/ So, long story short, the dying memcg problem is actually a regression caused by previous shrinker changes, and the change in 4.18-rc1 was an attempt to fix the regression (which caused evenmore widespread problems) and Rik's patch is another different attempt to fix the original regression. The original regression broke the small scan accumulation algorithm in the shrinker, but i don't think that anyone actually understood how this was supposed to work and so the attempts to fix the regression haven't actually restored the original behaviour. The problematic commit: 9092c71bb724 ("mm: use sc->priority for slab shrink targets") which was included in 4.16-rc1. This changed the delta calculation and so any cache with less than 4096 freeable objects would now end up with a zero delta count. This means caches with few freeable objects had no scan pressure at all and nothing would get accumulated for later scanning. Prior to this change, such scans would result in single digit scan counts, which would get deferred and acummulated until the overal delta + deferred count went over the batch size and ti would scan the cache. IOWs, the above commit prevented accumulation of light pressure on caches and they'd only get scanned when extreme memory pressure occurs. The fix that went into 4.18-rc1 change this to make the minimum scan pressure the batch size, so instead of 0 pressure, it went to having extreme pressure on small caches. What wasn't used in a scan got deferred, and so the shrinker would wind up and keep heavy pressure on the cache even when there was only light memory pressure. IOWs, instead of having a scan count in the single digits under light memory pressure, those caches now had continual scan counts 1-2 orders of magnitude larger. i.e. way more agressive than in 4.15 and oler kernels. hence it introduced a different, more severe set of regressions than the one it was trying to fix. IOWs, the dying memcg issue is irrelevant here. The real problem that needs fixing is a shrinker regression that occurred in 4.16-rc1, not 4.18-rc1. I'm just going to fix the original regression in the shrinker algorithm by restoring the gradual accumulation behaviour, and this whole series of problems can be put to bed. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx