On Wed, Dec 11, 2024 at 9:20 AM Rik van Riel <riel@xxxxxxxxxxx> wrote: > > On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote: > > On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@xxxxxxxxxxx> > > wrote: > > > > > > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > > > > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@xxxxxxxxxxx> > > > > wrote: > > > > > > > > > > +++ b/mm/memcontrol.c > > > > > @@ -5371,6 +5371,15 @@ bool > > > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > > > > if (!zswap_is_enabled()) > > > > > return true; > > > > > > > > > > + /* > > > > > + * Always allow exiting tasks to push data to swap. A > > > > > process in > > > > > + * the middle of exit cannot get OOM killed, but may > > > > > need > > > > > to push > > > > > + * uncompressible data to swap in order to get the > > > > > cgroup > > > > > memory > > > > > + * use below the limit, and make progress with the > > > > > exit. > > > > > + */ > > > > > + if ((current->flags & PF_EXITING) && memcg == > > > > > mem_cgroup_from_task(current)) > > > > > + return true; > > > > > + > > > > > > > > I have a few questions: > > > > (a) If the task is being OOM killed it should be able to charge > > > > memory > > > > beyond memory.max, so why do we need to get the usage down below > > > > the > > > > limit? > > > > > > > If it is a kernel directed memcg OOM kill, that is > > > true. > > > > > > However, if the exit comes from somewhere else, > > > like a userspace oomd kill, we might not hit that > > > code path. > > > > Why do we treat dying tasks differently based on the source of the > > kill? > > > Are you saying we should fail allocations for > every dying task, and add a check for PF_EXITING > in here? I am asking, not really suggesting anything :) Does it matter from the kernel perspective if the task is dying due to a kernel OOM kill or a userspace SIGKILL? > > > if (unlikely(task_in_memcg_oom(current))) > goto nomem; > > > > > However, we don't know until the attempted zswap write > > > whether the memory is compressible, and whether doing > > > a bunch of zswap writes will help us bring our memcg > > > down below its memory.max limit. > > > > If we are at memory.max (or memory.zswap.max), we can't compress > > pages > > into zswap anyway, regardless of their compressibility. > > > Wait, this is news to me. > > This seems like something we should fix, rather > than live with, since compressing the data to > a smaller size could bring us below memory.max. > > Is this "cannot compress when at memory.max" > behavior intentional, or just a side effect of > how things happen to be? > > Won't the allocations made from zswap_store > ignore the memory.max limit because PF_MEMALLOC > is set? My bad, obj_cgroup_may_zswap() only checks the zswap limit, not memory.max. Please ignore this. The scenario I described where we scan the LRUs needlessly is if the *zswap limit* is hit, and writeback is disabled. I am guessing this is not the case you're running into. So yeah my only outstanding question is the one above about handling userspace OOM kills differently. Thanks for bearing with me.