On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@xxxxxxxxxxx> wrote: > > A task already in exit can get stuck trying to allocate pages, if its > cgroup is at the memory.max limit, the cgroup is using zswap, but > zswap writeback is enabled, and the remaining memory in the cgroup is > not compressible. > > This seems like an unlikely confluence of events, but it can happen > quite easily if a cgroup is OOM killed due to exceeding its memory.max > limit, and all the tasks in the cgroup are trying to exit simultaneously. > > When this happens, it can sometimes take hours for tasks to exit, > as they are all trying to squeeze things into zswap to bring the group's > memory consumption below memory.max. > > Allowing these exiting programs to push some memory from their own > cgroup into swap allows them to quickly bring the cgroup's memory > consumption below memory.max, and exit in seconds rather than hours. > > Loading this fix as a live patch on a system where a workload got stuck > exiting allowed the workload to exit within a fraction of a second. > > Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx> > --- > mm/memcontrol.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7b3503d12aaf..03d77e93087e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5371,6 +5371,15 @@ bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > if (!zswap_is_enabled()) > return true; > > + /* > + * Always allow exiting tasks to push data to swap. A process in > + * the middle of exit cannot get OOM killed, but may need to push > + * uncompressible data to swap in order to get the cgroup memory > + * use below the limit, and make progress with the exit. > + */ > + if ((current->flags & PF_EXITING) && memcg == mem_cgroup_from_task(current)) > + return true; > + I have a few questions: (a) If the task is being OOM killed it should be able to charge memory beyond memory.max, so why do we need to get the usage down below the limit? Looking at the other thread with Michal, it looks like it's because we have to go into reclaim first before we get to the point of force charging for dying tasks, and we spend too much time in reclaim. Is that correct? If that's the case, I am wondering if the real problem is that we check mem_cgroup_zswap_writeback_enabled() too late in the process. Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap entries, only to realize it cannot swap in swap_writepage(). Should we check for this in can_reclaim_anon_pages()? If zswap writeback is disabled and we are already at the memcg limit (or zswap limit for that matter), we should avoid scanning anon memory to begin with. The problem is that if we race with memory being freed we may have some extra OOM kills, but I am not sure how common this case would be. (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in case we are reclaiming from an ancestor and we hit the limit of that ancestor? (c) mem_cgroup_from_task() should be called in an RCU read section (or we need something like rcu_access_point() if we are not dereferencing the pointer). > for (; memcg; memcg = parent_mem_cgroup(memcg)) > if (!READ_ONCE(memcg->zswap_writeback)) > return false; > -- > 2.47.0 > > >