On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote: > Sorry, once more, now with fsdevel@ in cc, asked by Dave. > -- > > Recent reverts of memcg leak fixes [1, 2] reintroduced the problem > with accumulating of dying memory cgroups. This is a serious problem: > on most of our machines we've seen thousands on dying cgroups, and > the corresponding memory footprint was measured in hundreds of megabytes. > The problem was also independently discovered by other companies. > > The fixes were reverted due to xfs regression investigated by Dave Chinner. Context: it wasn't one regression that I investigated. We had multiple bug reports with different regressions, and I saw evidence on my own machines that something wasn't right because of the change in the IO patterns in certain benchmarks. Some of the problems were caused by the first patch, some were caused by the second patch. This also affects ext4 (i.e. it's a general problem, not an XFS problem) as has been reported a couple of times, including this one overnight: https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@xxxxxxx/ > Simultaneously we've seen a very small (0.18%) cpu regression on some hosts, > which caused Rik van Riel to propose a patch [3], which aimed to fix the > regression. The idea is to accumulate small memory pressure and apply it > periodically, so that we don't overscan small shrinker lists. According > to Jan Kara's data [4], Rik's patch partially fixed the regression, > but not entirely. Rik's patch was buggy and made an invalid assumptions about how a cache with a small number of freeable objects is a "small cache", so any comaprisons made with it are essentially worthless. More details about the problems with the patch and approach here: https://lore.kernel.org/stable/20190131224905.GN31397@rh/ > The path forward isn't entirely clear now, and the status quo isn't acceptable > due to memcg leak bug. Dave and Michal's position is to focus on dying memory > cgroup case and apply some artificial memory pressure on corresponding slabs > (probably, during cgroup deletion process). This approach can theoretically > be less harmful for the subtle scanning balance, and not cause any regressions. I outlined the dying memcg problem in patch[0] of the revert series: https://lore.kernel.org/linux-mm/20190130041707.27750-1-david@xxxxxxxxxxxxx/ It basically documents the solution I proposed for dying memcg cleanup: dgc> e.g. add a garbage collector via a background workqueue that sits on dgc> the dying memcg calling something like: dgc> dgc> void drop_slab_memcg(struct mem_cgroup *dying_memcg) dgc> { dgc> unsigned long freed; dgc> dgc> do { dgc> struct mem_cgroup *memcg = NULL; dgc> dgc> freed = 0; dgc> memcg = mem_cgroup_iter(dying_memcg, NULL, NULL); dgc> do { dgc> freed += shrink_slab_memcg(GFP_KERNEL, 0, memcg, 0); dgc> } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); dgc> } while (freed > 0); dgc> } This is a pretty trivial piece of code and doesn't requiring changing the core memory reclaim code at all. > In my opinion, it's not necessarily true. Slab objects can be shared between > cgroups, and often can't be reclaimed on cgroup removal without an impact on the > rest of the system. I've already pointed out that it is preferable for shared objects to stay in cache, not face expedited reclaim: https://lore.kernel.org/linux-mm/20190131221904.GL4205@dastard/ dgc> However, the memcg reaper *doesn't need to be perfect* to solve the dgc> "takes too long to clean up dying memcgs". Even if it leaves shared dgc> objects behind (which we want to do!), it still trims those memcgs dgc> down to /just the shared objects still in use/. And given that dgc> objects shared by memcgs are in the minority (according to past dgc> discussions about the difficulies of accounting them correctly) I dgc> think this is just fine. dgc> dgc> Besides, those reamining shared objects are the ones we want to dgc> naturally age out under memory pressure, but otherwise the memcgs dgc> will have been shaken clean of all other objects accounted to them. dgc> i.e. the "dying memcg" memory footprint goes down massively and the dgc> "long term buildup" of dying memcgs basically goes away. This all seems like pretty desirable cross-memcg working set maintenance behaviour to me... > Applying constant artificial memory pressure precisely only > on objects accounted to dying cgroups is challenging and will likely > cause a quite significant overhead. I don't know where you got that from - the above example is clearly a once-off cleanup. And executing it via a workqueue in the async memcg cleanup path (which already runs through multiple work queues to run and wait for different stages of cleanup) is not complex or challenging, nor is it likely to add additional overhead because it means we will avoid the long term shrinker scanning overhead that cleanup currently requires. > Also, by "forgetting" of some slab objects > under light or even moderate memory pressure, we're wasting memory, which can be > used for something useful. Cached memory is not "forgotten" or "wasted memory". If the scan is too small and not used, it is deferred to the next shrinker invocation. This batching behaviour is intentionally done for scan efficiency purposes. Don't take my word for it, read the discussion that went along with commit 0b1fb40a3b12 ("mm: vmscan: shrink all slab objects if tight on memory") https://lore.kernel.org/lkml/20140115012541.ad302526.akpm@xxxxxxxxxxxxxxxxxxxx/ >From Andrew: akpm> Actually, the intent of batching is to limit the number of calls to akpm> ->scan(). At least, that was the intent when I wrote it! This is a akpm> good principle and we should keep doing it. If we're going to send the akpm> CPU away to tread on a pile of cold cachelines, we should make sure akpm> that it does a good amount of work while it's there. IOWs, the "small scan" proposals defeat existing shrinker efficiency optimisations. This change in behaviour is where the CPU usage regressions in "small cache" scanning comes from. As Andrew said: scan batching is a good principle and we should keep doing it. > Dying cgroups are just making this problem more > obvious because of their size. Dying cgroups see this as a problem only because they have extremely poor life cycle management. Expediting dying memcg cache cleanup is the way to fix this and that does not need us to change global memory reclaim behaviour. > So, using "natural" memory pressure in a way, that all slabs objects are scanned > periodically, seems to me as the best solution. The devil is in details, and how > to do it without causing any regressions, is an open question now. > > Also, completely re-parenting slabs to parent cgroup (not only shrinker lists) > is a potential option to consider. That should be done once the memcg gc thread has shrunk the caches down to just the shared objects (which we want to keep in cache!) that reference the dying memcg. That will get rid of all the remaining references and allow the memcg to be reclaimed completely. > It will be nice to discuss the problem on LSF/MM, agree on general path and > make a potential list of benchmarks, which can be used to prove the solution. In reality, it comes down to this - should we: a) add a small amount of code into the subsystem to perform expedited reaping of subsystem owned objects and test against the known, specific reproducing workload; or b) change global memory reclaim algorithms in a way that affects every linux machine and workload in some way, resulting in us having to revalidate and rebalance memory reclaim for a large number of common workloads across all filesystems and subsystems that use shrinkers, on a wide range of different storage hardware and on both headless and desktop machines. And when we look at it this way, if we end up with option b) as the preferred solution then we've well and truly jumped the shark. The validation effort required for option b) is way out of proportion with the small niche of machines and environments affected by the dying memcg problem and the risk of regressions for users outside these memcg-heavy environments is extremely high (as has already been proven). Cheers, Dave -- Dave Chinner david@xxxxxxxxxxxxx