On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote: > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote: > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote: > > > Sorry, resending with the fixed to/cc list. Please, ignore the > > > first letter. > > > > Please resend again with linux-fsdevel on the cc list, because this > > isn't a MM topic given the regressions from the shrinker patches > > have all been on the filesystem side of the shrinkers.... > > It looks like there are two separate things going on here. > > The first are an MM issues, one of potentially leaking memory > by not scanning slabs with few items on them, We don't leak memory. Slabs with very few freeable items on them just don't get scanned when there is only light memory pressure. That's /by design/ and it is behaviour we've tried hard over many years to preserve. Once memory pressure ramps up, they'll be scanned just like all the other slabs. e.g. commit 0b1fb40a3b12 ("mm: vmscan: shrink all slab objects if tight on memory") makes this commentary: [....] That said, this patch shouldn't change the vmscan behaviour if the memory pressure is low, but if we are tight on memory, we will do our best by trying to reclaim all available objects, which sounds reasonable. Which is essentially how we've tried to implement shrinker reclaim for a long, long time (bugs notwithstanding). > and having > such slabs stay around forever after the cgroup they were > created for has disappeared, That's a cgroup referencing and teardown problem, not a memory reclaim algorithm problem. To treat it as a memory reclaim problem smears memcg internal implementation bogosities all over the independent reclaim infrastructure. It violates the concepts of isolation, modularity, independence, abstraction layering, etc. > and the other of various other > bugs with shrinker invocation behavior (like the nr_deferred > fixes you posted a patch for). I believe these are MM topics. Except they interact directly with external shrinker behaviour. the conditions of deferral and the problems it is solving are a direct response to shrinker implementation constraints (e.g. GFP_NOFS deadlock avoidance for filesystems). i.e. we can't talk about the deferal algorithm without considering why work is deferred, how much work should be deferred, when it may be safe/best to execute the deferred work, etc. This all comes back to the fact that modifying the shrinker algorithms requires understanding what the shrinker implementations do and the constraints they operate under. It is not a "purely mm" discussion, and treating it as such results regressions like the ones we've recently seen. > The second is the filesystem (and maybe other) shrinker > functions' behavior being somewhat fragile and depending > on closely on current MM behavior, potentially up to > and including MM bugs. > > The lack of a contract between the MM and the shrinker > callbacks is a recurring issue, and something we may > want to discuss in a joint session. > > Some reflections on the shrinker/MM interaction: > - Since all memory (in a zone) could potentially be in > shrinker pools, shrinkers MUST eventually free some > memory. Which they cannot guarantee because all the objects they track may be in use. As such, shrinkers have never been asked to guarantee that they can free memory - they've only ever been asked to scan a number of objects and attempt to free those it can during the scan. > - Shrinkers should not block kswapd from making progress. > If kswapd got stuck in NFS inode writeback, and ended up > not being able to free clean pages to receive network > packets, that might cause a deadlock. Same can happen if kswapd got stuck on dirty page writeback from pageout(). i.e. pageout() can only run from kswapd and it issues IO, which can then block in the IO submission path waiting for IO to make progress, which may require substantial amounts of memory allocation. Yes, we can try to not block kswapd as much as possible just like page reclaim does, but the fact is kswapd is the only context where it is safe to do certain blocking operations to ensure memory reclaim can actually make progress. i.e. the rules for blocking kswapd need to be consistent across both page reclaim and shrinker reclaim, and right now page reclaim can and does block kswapd when it is necessary for forwards progress.... > - The MM should be able to deal with shrinkers doing > nothing at this call, but having some work pending > (eg. waiting on IO completion), without getting a false > OOM kill. How can we do this best? By integrating shrinkers into the same feedback loops as page reclaim. i.e. to allow individual shrinker instance state to be visible to the backoff/congestion decisions that the main page reclaim loops make. i.e. the problem here is that shrinkers only feedback to the main loop is "how many pages were freed" as a whole. They aren't seen as individual reclaim instances like zones for apge reclaim, they are just a huge amorphous blob that "frees some pages". i.e. They sit off to the side and run their own game between main loop scans and have no capability to run individual backoffs, schedule kswapd to do future work, don't have watermarks to provide reclaim goals, can't communicate progress to the main control algorithm, etc. IOWs, the first step we need to take here is to get rid of the shrink_slab() abstraction and make shrinkers a first class reclaim citizen.... > - Related to the above: stalling in the shrinker code is > unpredictable, and can take an arbitrarily long amount > of time. Is there a better way we can make reclaimers > wait for in-flight work to be completed? Look at it this way: what do you need to do to implement the main zone reclaim loops as individual shrinker instances? Complex shrinker implementations have to deal with all the same issues as the page reclaim loops (including managing cross-cache dependencies and balancing). If we can't answer this question, then we can't answer the questions that are being asked. So, at this point, I have to ask: if we need the same functionality for both page reclaim and shrinkers, then why shouldn't the goal be to make page reclaim just another set of opaque shrinker implementations? Cheers, Dave. -- Dave Chinner dchinner@xxxxxxxxxx