On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@xxxxxxxxxxx> > wrote: > > > > +++ b/mm/memcontrol.c > > @@ -5371,6 +5371,15 @@ bool > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > if (!zswap_is_enabled()) > > return true; > > > > + /* > > + * Always allow exiting tasks to push data to swap. A > > process in > > + * the middle of exit cannot get OOM killed, but may need > > to push > > + * uncompressible data to swap in order to get the cgroup > > memory > > + * use below the limit, and make progress with the exit. > > + */ > > + if ((current->flags & PF_EXITING) && memcg == > > mem_cgroup_from_task(current)) > > + return true; > > + > > I have a few questions: > (a) If the task is being OOM killed it should be able to charge > memory > beyond memory.max, so why do we need to get the usage down below the > limit? > If it is a kernel directed memcg OOM kill, that is true. However, if the exit comes from somewhere else, like a userspace oomd kill, we might not hit that code path. > Looking at the other thread with Michal, it looks like it's because > we > have to go into reclaim first before we get to the point of force > charging for dying tasks, and we spend too much time in reclaim. Is > that correct? > > If that's the case, I am wondering if the real problem is that we > check mem_cgroup_zswap_writeback_enabled() too late in the process. > Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap > entries, only to realize it cannot swap in swap_writepage(). > > Should we check for this in can_reclaim_anon_pages()? If zswap > writeback is disabled and we are already at the memcg limit (or zswap > limit for that matter), we should avoid scanning anon memory to begin > with. The problem is that if we race with memory being freed we may > have some extra OOM kills, but I am not sure how common this case > would be. However, we don't know until the attempted zswap write whether the memory is compressible, and whether doing a bunch of zswap writes will help us bring our memcg down below its memory.max limit. > > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in > case we are reclaiming from an ancestor and we hit the limit of that > ancestor? > I don't know if we need or want to reclaim from any other memcgs than those of the exiting process itself. A small blast radius seems like it could be desirable, but I'm open to other ideas :) > (c) mem_cgroup_from_task() should be called in an RCU read section > (or > we need something like rcu_access_point() if we are not dereferencing > the pointer). > I'll add this in v2. -- All Rights Reversed.