On Wed, Feb 20, 2019 at 03:33:32PM +1100, Dave Chinner wrote: > On Tue, Feb 19, 2019 at 09:06:07PM -0500, Rik van Riel wrote: > > On Wed, 2019-02-20 at 10:26 +1100, Dave Chinner wrote: > > > On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote: > > > > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote: > > > > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote: > > > > > > Sorry, resending with the fixed to/cc list. Please, ignore the > > > > > > first letter. > > > > > > > > > > Please resend again with linux-fsdevel on the cc list, because > > > > > this > > > > > isn't a MM topic given the regressions from the shrinker patches > > > > > have all been on the filesystem side of the shrinkers.... > > > > > > > > It looks like there are two separate things going on here. > > > > > > > > The first are an MM issues, one of potentially leaking memory > > > > by not scanning slabs with few items on them, > > > > > > We don't leak memory. Slabs with very few freeable items on them > > > just don't get scanned when there is only light memory pressure. > > > That's /by design/ and it is behaviour we've tried hard over many > > > years to preserve. Once memory pressure ramps up, they'll be > > > scanned just like all the other slabs. > > > > That may have been fine before cgroups, but when > > a system can have (tens of) thousands of slab > > caches, we DO want to scan slab caches with few > > freeable items in them. > > > > The threshold for "few items" is 4096, not some > > actually tiny number. That can add up to a lot > > of memory if a system has hundreds of cgroups. > > That doesn't sound right. The threshold is supposed to be low single > digits based on the amount of pressure on the page cache, and it's > accumulated by deferral until the batch threshold (128) is exceeded. > > Ohhhhh. The penny just dropped - this whole sorry saga has be > triggered because people are chasing a regression nobody has > recognised as a regression because they don't actually understand > how the shrinker algorithms are /supposed/ to work. > > And I'm betting that it's been caused by some other recent FB > shrinker change..... > > Yup, there it is: > > commit 9092c71bb724dba2ecba849eae69e5c9d39bd3d2 > Author: Josef Bacik <jbacik@xxxxxx> > Date: Wed Jan 31 16:16:26 2018 -0800 > > mm: use sc->priority for slab shrink targets > > .... > We don't need to know exactly how many pages each shrinker represents, > it's objects are all the information we need. Making this change allows > us to place an appropriate amount of pressure on the shrinker pools for > their relative size. > .... > > - delta = (4 * nr_scanned) / shrinker->seeks; > - delta *= freeable; > - do_div(delta, nr_eligible + 1); > + delta = freeable >> priority; > + delta *= 4; > + do_div(delta, shrinker->seeks); > > > So, prior to this change: > > delta ~= (4 * nr_scanned * freeable) / nr_eligible > > IOWs, the ratio of nr_scanned:nr_eligible determined the resolution > of scan, and that meant delta could (and did!) have values in the > single digit range. > > The current code introduced by the above patch does: > > delta ~= (freeable >> priority) * 4 > > Which, as you state, has a threshold of freeable > 4096 to trigger > scanning under low memory pressure. > > So, that's the original regression that people are trying to fix > (root cause analysis FTW). It was introduced in 4.16-rc1. The > attempts to fix this regression (i.e. the lack of low free object > shrinker scanning) were introduced into 4.18-rc1, which caused even > worse regressions and lead us directly to this point. > > Ok, now I see where the real problem people are chasing is, I'll go > write a patch to fix it. Sounds good, I'll check if it can prevent the memcg leak. If it will work, we're fine. > > > Roman's patch, which reclaimed small slabs extra > > aggressively, introduced issues, but reclaiming > > small slabs at the same pressure/object as large > > slabs seems like the desired behavior. > > It's still broken. Both of your patches do the wrong thing because > they don't address the resolution and accumulation regression and > instead add another layer of heuristics over the top of the delta > calculation to hide the lack of resolution. > > > > That's a cgroup referencing and teardown problem, not a memory > > > reclaim algorithm problem. To treat it as a memory reclaim problem > > > smears memcg internal implementation bogosities all over the > > > independent reclaim infrastructure. It violates the concepts of > > > isolation, modularity, independence, abstraction layering, etc. > > > > You are overlooking the fact that an inode loaded > > into memory by one cgroup (which is getting torn > > down) may be in active use by processes in other > > cgroups. > > No I am not. I am fully aware of this problem (have been since memcg > day one because of the list_lru tracking issues Glauba and I had to > sort out when we first realised shared inodes could occur). Sharing > inodes across cgroups also causes "complexity" in things like cgroup > writeback control (which cgroup dirty list tracks and does writeback > of shared inodes?) and so on. Shared inodes across cgroups are > considered the exception rather than the rule, and they are treated > in many places with algorithms that assert "this is rare, if it's > common we're going to be in trouble".... No, even if sharing inodes can be advertised as a bad practice and may lead to some sub-optimal results, it shouldn't trigger obvious kernel issues like memory leaks. Otherwise it becomes a security concern. Also, in practice, it's common to have a main workload and a couple of supplementary processes (e.g. monitoring) in sibling cgroups, which are sharing some inodes (e.g. logs). Thanks!