Re: [PATCH v2] mm/vmscan: check references from all memcgs for swapbacked memory

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 6 Oct 2022 11:32:45 -0400

On Thu, Oct 06, 2022 at 12:30:45AM -0700, Yosry Ahmed wrote:
> On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > > >
> > > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > > processes in the subtree of the target memcg. This behavior was
> > > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > > for it appropriately.
> > > > > >
> > > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > > memcg that was originally charged for them during swapping out. Which
> > > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > > page/folio is a viable reclaim target.
> > > > > >
> > > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > > the folio is swapbacked.
> > > > >
> > > > > It seems to me this change can potentially increase the number of
> > > > > zombie memcgs. Any risk assessment done on this?
> > > >
> > > > Do you mind elaborating the case(s) where this could happen? Is this
> > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > > from a zombie memcg and swapping out would let us move the charge to
> > > > the parent?
> > >
> > > The scenario is quite straightforward: for a page charged to memcg A
> > > and also actively used by memcg B, if we don't ignore the access from
> > > memcg B, we won't be able to reclaim it after memcg A is deleted.
> >
> > This patch changes the behavior of limit-induced reclaim. There is no
> > limit reclaim on A after it's been deleted. And parental/global
> > reclaim has always recognized outside references.
> 
> Do you mind elaborating on the parental reclaim part?
> 
> I am looking at the code and it looks like memcg reclaim of a parent
> (limit-induced or proactive) will only consider references coming from
> its subtree, even when reclaiming from its dead children. It looks
> like as long as sc->target_mem_cgroup is set, we ignore outside
> references (relative to sc->target_mem_cgroup).

Yes, I was referring to outside of A.

As of today, any siblings of A can already pin its memory after it's
dead. I suppose your patch would add cousins to that list. It doesn't
seem like a categorial difference to me.

> If that is true, maybe we want to keep ignoring outside references for
> swap-backed pages if the folio is charged to a dead memcg? My
> understanding is that in this case we will uncharge the page from the
> dead memcg and charge the swapped entry to the parent, reducing the
> number of refs on the dead memcg. Without this check, this patch might
> prevent the charge from being moved to the parent in this case. WDYT?

I don't think it's worth it. Keeping the semantics simple and behavior
predictable is IMO more valuable.

It also wouldn't fix the scrape-before-rmdir issue Yu points out,
which I think is the more practical concern. In light of that, it
might be best to table the patch for now. (Until we have
reparent-on-delete for anon and file pages...)