On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote: > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@xxxxxxxxxxx> wrote: > > > > A task already in exit can get stuck trying to allocate pages, if its > > cgroup is at the memory.max limit, the cgroup is using zswap, but > > zswap writeback is enabled, and the remaining memory in the cgroup is > > not compressible. > > > > This seems like an unlikely confluence of events, but it can happen > > quite easily if a cgroup is OOM killed due to exceeding its memory.max > > limit, and all the tasks in the cgroup are trying to exit simultaneously. > > > > When this happens, it can sometimes take hours for tasks to exit, > > as they are all trying to squeeze things into zswap to bring the group's > > memory consumption below memory.max. > > > > Allowing these exiting programs to push some memory from their own > > cgroup into swap allows them to quickly bring the cgroup's memory > > consumption below memory.max, and exit in seconds rather than hours. > > > > Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx> > > Thanks for sending a v2. > > I still think maybe this needs to be fixed on the memcg side, at least > by not making exiting tasks try really hard to reclaim memory to the > point where this becomes a problem. IIUC there could be other reasons > why reclaim may take too long, but maybe not as pathological as this > case to be fair. I will let the memcg maintainers chime in for this. > > If there's a fundamental reason why this cannot be fixed on the memcg > side, I don't object to this change. > > Nhat, any objections on your end? I think your fleet workloads were > the first users of this interface. Does this break their expectations? Yes, I don't think we can do this, unfortunately :( There can be a variety of reasons for why a user might want to prohibit disk swap for a certain cgroup, and we can't assume it's okay to make exceptions. There might also not *be* any disk swap to overflow into after Nhat's virtual swap patches. Presumably zram would still have the issue too. So I'm also inclined to think this needs a reclaim/memcg-side fix. We have a somewhat tumultous history of policy in that space: commit 7775face207922ea62a4e96b9cd45abfdc7b9840 Author: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Date: Tue Mar 5 15:46:47 2019 -0800 memcg: killed threads should not invoke memcg OOM killer allowed dying tasks to simply force all charges and move on. This turned out to be too aggressive; there were instances of exiting, uncontained memcg tasks causing global OOMs. This lead to that: commit a4ebf1b6ca1e011289677239a2a361fde4a88076 Author: Vasily Averin <vasily.averin@xxxxxxxxx> Date: Fri Nov 5 13:38:09 2021 -0700 memcg: prohibit unconditional exceeding the limit of dying tasks which reverted the bypass rather thoroughly. Now NO dying tasks, *not even OOM victims*, can force charges. I am not sure this is correct, either: If we return -ENOMEM to an OOM victim in a fault, the fault handler will re-trigger OOM, which will find the existing OOM victim and do nothing, then restart the fault. This is a memory deadlock. The page allocator gives OOM victims access to reserves for that reason. Actually, it looks even worse. For some reason we're not triggering OOM from dying tasks: ret = task_is_dying() || out_of_memory(&oc); Even though dying tasks are in no way privileged or allowed to exit expediently. Why shouldn't they trigger the OOM killer like anybody else trying to allocate memory? As it stands, it seems we have dying tasks getting trapped in an endless fault->reclaim cycle; with no access to the OOM killer and no access to reserves. Presumably this is what's going on here? I think we want something like this: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 53db98d2c4a1..be6b6e72bde5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_margin(memcg) >= (1 << order)) goto unlock; - /* - * A few threads which were not waiting at mutex_lock_killable() can - * fail to bail out. Therefore, check again after holding oom_lock. - */ - ret = task_is_dying() || out_of_memory(&oc); + ret = out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (unlikely(current->flags & PF_MEMALLOC)) goto force; + if (unlikely(tsk_is_oom_victim(current))) + goto force; + if (unlikely(task_in_memcg_oom(current))) goto nomem;