On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote: > On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@xxxxxxxxxxx> > wrote: > > > > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > > > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@xxxxxxxxxxx> > > > wrote: > > > > > > > > +++ b/mm/memcontrol.c > > > > @@ -5371,6 +5371,15 @@ bool > > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > > > if (!zswap_is_enabled()) > > > > return true; > > > > > > > > + /* > > > > + * Always allow exiting tasks to push data to swap. A > > > > process in > > > > + * the middle of exit cannot get OOM killed, but may > > > > need > > > > to push > > > > + * uncompressible data to swap in order to get the > > > > cgroup > > > > memory > > > > + * use below the limit, and make progress with the > > > > exit. > > > > + */ > > > > + if ((current->flags & PF_EXITING) && memcg == > > > > mem_cgroup_from_task(current)) > > > > + return true; > > > > + > > > > > > I have a few questions: > > > (a) If the task is being OOM killed it should be able to charge > > > memory > > > beyond memory.max, so why do we need to get the usage down below > > > the > > > limit? > > > > > If it is a kernel directed memcg OOM kill, that is > > true. > > > > However, if the exit comes from somewhere else, > > like a userspace oomd kill, we might not hit that > > code path. > > Why do we treat dying tasks differently based on the source of the > kill? > Are you saying we should fail allocations for every dying task, and add a check for PF_EXITING in here? if (unlikely(task_in_memcg_oom(current))) goto nomem; > > However, we don't know until the attempted zswap write > > whether the memory is compressible, and whether doing > > a bunch of zswap writes will help us bring our memcg > > down below its memory.max limit. > > If we are at memory.max (or memory.zswap.max), we can't compress > pages > into zswap anyway, regardless of their compressibility. > Wait, this is news to me. This seems like something we should fix, rather than live with, since compressing the data to a smaller size could bring us below memory.max. Is this "cannot compress when at memory.max" behavior intentional, or just a side effect of how things happen to be? Won't the allocations made from zswap_store ignore the memory.max limit because PF_MEMALLOC is set? > > > > > > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() > > > in > > > case we are reclaiming from an ancestor and we hit the limit of > > > that > > > ancestor? > > > > > I don't know if we need or want to reclaim from any > > other memcgs than those of the exiting process itself. > > > > A small blast radius seems like it could be desirable, > > but I'm open to other ideas :) > > The exiting process is part of all the ancestor cgroups by the > hierarchy. > > If we have the following hierarchy: > root > | > A > | > B > > Then a process in cgroup B could be getting OOM killed due to hitting > the limit of A, not B. In which case, reclaiming from A helps us get > below the limit. We can check if the cgroup is an ancestor and it hit > its limit, but maybe that's an overkill. Since we're dealing with a corner case anyway, I suppose there's no harm using mm_match_cgroup, which also happens to be cleaner than the code I have right now. -- All Rights Reversed.