On Wed, Oct 05, 2022 at 11:10:37PM -0600, Yu Zhao wrote: > On Wed, Oct 5, 2022 at 10:19 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote: > > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > > > > > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > > > > During page/folio reclaim, we check if a folio is referenced using > > > > > > folio_referenced() to avoid reclaiming folios that have been recently > > > > > > accessed (hot memory). The rationale is that this memory is likely to be > > > > > > accessed soon, and hence reclaiming it will cause a refault. > > > > > > > > > > > > For memcg reclaim, we currently only check accesses to the folio from > > > > > > processes in the subtree of the target memcg. This behavior was > > > > > > originally introduced by commit bed7161a519a ("Memory controller: make > > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted > > > > > > pages would get charged to the memcg of the process that was faulting them > > > > > > in. It made sense to only consider accesses coming from processes in the > > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only > > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is > > > > > > is the reclaim target. memcg B can then fault it back in and get charged > > > > > > for it appropriately. > > > > > > > > > > > > Today, this behavior still makes sense for file pages. However, unlike > > > > > > file pages, when swapbacked pages are refaulted they are charged to the > > > > > > memcg that was originally charged for them during swapping out. Which > > > > > > means that if a swapbacked page is charged to memcg A but only used by > > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back > > > > > > in and charged again to memcg A once memcg B accesses it. In that sense, > > > > > > accesses from all memcgs matter equally when considering if a swapbacked > > > > > > page/folio is a viable reclaim target. > > > > > > > > > > > > Modify folio_referenced() to always consider accesses from all memcgs if > > > > > > the folio is swapbacked. > > > > > > > > > > It seems to me this change can potentially increase the number of > > > > > zombie memcgs. Any risk assessment done on this? > > > > > > > > Do you mind elaborating the case(s) where this could happen? Is this > > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming > > > > from a zombie memcg and swapping out would let us move the charge to > > > > the parent? > > > > > > The scenario is quite straightforward: for a page charged to memcg A > > > and also actively used by memcg B, if we don't ignore the access from > > > memcg B, we won't be able to reclaim it after memcg A is deleted. > > > > This patch changes the behavior of limit-induced reclaim. There is no > > limit reclaim on A after it's been deleted. And parental/global > > reclaim has always recognized outside references. > > We use memory.reclaim to scrape memcgs right before rmdir so that they > are unlikely to stick around. Otherwise our job scheduler would see > less available memory and become less eager to increase load. This in > turn reduces the chance of global reclaim, and deleted memcgs would > stick around even longer. Thanks for the context. It's not great that we have to design reclaim policy around this implementation detail of past-EOF-pins. But such is life until we get rid of them.