Re: [PATCH v2] mm/vmscan: check references from all memcgs for swapbacked memory

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Thu, 6 Oct 2022 11:29:31 -0700

On Thu, Oct 6, 2022 at 8:32 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Thu, Oct 06, 2022 at 12:30:45AM -0700, Yosry Ahmed wrote:
> > On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > >
> > > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > > > >
> > > > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > > > processes in the subtree of the target memcg. This behavior was
> > > > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > > > for it appropriately.
> > > > > > >
> > > > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > > > memcg that was originally charged for them during swapping out. Which
> > > > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > > > page/folio is a viable reclaim target.
> > > > > > >
> > > > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > > > the folio is swapbacked.
> > > > > >
> > > > > > It seems to me this change can potentially increase the number of
> > > > > > zombie memcgs. Any risk assessment done on this?
> > > > >
> > > > > Do you mind elaborating the case(s) where this could happen? Is this
> > > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > > > from a zombie memcg and swapping out would let us move the charge to
> > > > > the parent?
> > > >
> > > > The scenario is quite straightforward: for a page charged to memcg A
> > > > and also actively used by memcg B, if we don't ignore the access from
> > > > memcg B, we won't be able to reclaim it after memcg A is deleted.
> > >
> > > This patch changes the behavior of limit-induced reclaim. There is no
> > > limit reclaim on A after it's been deleted. And parental/global
> > > reclaim has always recognized outside references.
> >
> > Do you mind elaborating on the parental reclaim part?
> >
> > I am looking at the code and it looks like memcg reclaim of a parent
> > (limit-induced or proactive) will only consider references coming from
> > its subtree, even when reclaiming from its dead children. It looks
> > like as long as sc->target_mem_cgroup is set, we ignore outside
> > references (relative to sc->target_mem_cgroup).
>
> Yes, I was referring to outside of A.
>
> As of today, any siblings of A can already pin its memory after it's
> dead. I suppose your patch would add cousins to that list. It doesn't
> seem like a categorial difference to me.
>
> > If that is true, maybe we want to keep ignoring outside references for
> > swap-backed pages if the folio is charged to a dead memcg? My
> > understanding is that in this case we will uncharge the page from the
> > dead memcg and charge the swapped entry to the parent, reducing the
> > number of refs on the dead memcg. Without this check, this patch might
> > prevent the charge from being moved to the parent in this case. WDYT?
>
> I don't think it's worth it. Keeping the semantics simple and behavior
> predictable is IMO more valuable.
>
> It also wouldn't fix the scrape-before-rmdir issue Yu points out,
> which I think is the more practical concern. In light of that, it
> might be best to table the patch for now. (Until we have
> reparent-on-delete for anon and file pages...)

If we add a mem_cgroup_online() check, we partially solve the problem.
Maybe scrape-before-rmdir will not reclaim those pages at once, but
the next time we try to reclaim from the dead memcg (global, limit,
proactive,..) we will reclaim the pages. So we will only be delaying
the freeing of those zombie memcgs.

We have had a variant of this behavior in our production for a while,
and we haven't been having a zombie problem, but I guess the way our
userspace is setup is smart enough to stop referencing shared memory
charged to a dead memcg. One could argue that in general, having
swap-backed pages charged to a dead memcg is less common compared to
file pages.

Anyway, I wanted to give some background about why this is a real
problem that we have been facing. If a tmpfs mount is used by multiple
memcgs (say A and B), and we try to reclaim fromA using
memory.reclaim, we would observe pages that are more heavily used by B
as cold and reclaim them. Afterwards, they get swapped back once B
accesses them again, and charged back to A. We observe increased
refaults and fallback on proactive reclaim, missing the actual cold
memory. With this behavior, reclaim chooses pages that are cold
relative to both A and B, we observe less refaults, and end up
actually trimming cold memory. I suspect the same can happen with
limit-induced reclaim, where memcg A keeps hits its limit, we may
reclaim pages used by B, they get refaulted back in, we go to reclaim
again,... thrashing.

So it is a real problem that we are trying to solve. Given that adding
an online check would partially solves the scrape-before-rmdir problem
(zombie memcgs might stick around a little bit longer, but they should
eventually go away eventually), and that this patch addresses a real
problem and would overall improve reclaim policy for living memcgs
(which is *hopefully* the majority), it would be great if you could
reconsider this. Please let me know if anything that I have said
doesn't make sense to you.