On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote: > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@xxxxxxxxxxx> wrote: > > > > A task already in exit can get stuck trying to allocate pages, if its > > cgroup is at the memory.max limit, the cgroup is using zswap, but > > zswap writeback is enabled, and the remaining memory in the cgroup is > > not compressible. > > > > This seems like an unlikely confluence of events, but it can happen > > quite easily if a cgroup is OOM killed due to exceeding its memory.max > > limit, and all the tasks in the cgroup are trying to exit simultaneously. > > > > When this happens, it can sometimes take hours for tasks to exit, > > as they are all trying to squeeze things into zswap to bring the group's > > memory consumption below memory.max. > > > > Allowing these exiting programs to push some memory from their own > > cgroup into swap allows them to quickly bring the cgroup's memory > > consumption below memory.max, and exit in seconds rather than hours. > > > > Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx> > > Thanks for sending a v2. > > I still think maybe this needs to be fixed on the memcg side, at least > by not making exiting tasks try really hard to reclaim memory to the > point where this becomes a problem. IIUC there could be other reasons > why reclaim may take too long, but maybe not as pathological as this > case to be fair. I will let the memcg maintainers chime in for this. > > If there's a fundamental reason why this cannot be fixed on the memcg > side, I don't object to this change. > > Nhat, any objections on your end? I think your fleet workloads were > the first users of this interface. Does this break their expectations? > Let me give my personal take. This seems like a stopgap or a quick hack to resolve the very specific situation happening in real world. I am ok with having this solution but only temporarily. The reason why I think this is short term fix or a quick hack is because it is not specifically solving the fundamental issue here. The same situation can reoccur if let's say the swap storage was slow or stuck or contended. A somewhat similar situation is when there are lot of unreclaimable memory either through pinning or maybe mlock. The fundamental issue is that the exiting process (killed by oomd or simple exit) has to allocated memory but the cgroup is at limit and the reclaim is very very slow. I can see attacking this issue with multiple angles. Some mixture of reusing kernel's oom reaper and some buffer to allow the exiting process to go over the limit. Let's brainstorm and explore this direction. In the meantime, I think we can have this stopgap solution.