On Thu, Dec 12, 2024 at 10:03 AM Rik van Riel <riel@xxxxxxxxxxx> wrote: > > On Thu, 2024-12-12 at 09:51 -0800, Shakeel Butt wrote: > > > > The fundamental issue is that the exiting process (killed by oomd or > > simple exit) has to allocated memory but the cgroup is at limit and > > the > > reclaim is very very slow. > > > > I can see attacking this issue with multiple angles. > > Besides your proposed ideas, I suppose we could also limit > the gfp_mask of an exiting reclaimer with eg. __GFP_NORETRY, > but I do not know how effective that would be, since a single > pass through the memory reclaim code was still taking dozens > of seconds when I traced the "stuck" workloads. I know we already discussed this, but it'd be nice if we can let the exiting task go ahead with the page fault and bypass the memory limits, if the page fault is crucial for it to make forward progress. Not sure how feasible that is, and how to decide which page fault is really crucial though :) For the pathological memory.zswap.writeback disabling case in particular, another thing we can do here is to make these incompressible pages ineligible for further reclaim attempt, either by putting them on a non-reclaim LRU, or putting them in the zswap LRU to maintain total ordering of the LRUs. That way we can move on to other sources (slab caches for example) quicker, or fail earlier? That said, it remains to be seen what will happen if these incompressible pages are literally all that are left...? I'm biased to this idea though, because they have other benefits. Maybe I'm just looking for excuses to revive the project ;)