Re: [External] Re: [PATCH v3 2/4] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

Shakeel Butt <shakeelb@xxxxxxxxxx> · Fri, 12 Mar 2021 15:18:38 -0800

On Fri, Mar 12, 2021 at 3:07 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Fri, Mar 12, 2021 at 02:42:45PM -0800, Shakeel Butt wrote:
> > Hi Johannes,
> >
> > On Fri, Mar 12, 2021 at 11:23 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > >
> > [...]
> > >
> > > Longer term we most likely need it there anyway. The issue you are
> > > describing in the cover letter - allocations pinning memcgs for a long
> > > time - it exists at a larger scale and is causing recurring problems
> > > in the real world: page cache doesn't get reclaimed for a long time,
> > > or is used by the second, third, fourth, ... instance of the same job
> > > that was restarted into a new cgroup every time. Unreclaimable dying
> > > cgroups pile up, waste memory, and make page reclaim very inefficient.
> > >
> >
> > For the scenario described above, do we really want to reparent the
> > page cache pages? Shouldn't we recharge the pages to the second,
> > third, fourth and so on, memcgs? My concern is that we will see a big
> > chunk of page cache pages charged to root and will only get reclaimed
> > on global pressure.
>
> Sorry, I'm proposing to reparent to the ancestor, not root. It's an
> optimization, not a change in user-visible behavior.
>
> As far as the user can tell, the pages already belong to the parent
> after deletion: they'll show up in the parent's stats, naturally, and
> they will get reclaimed as part of the parent being reclaimed.
>
> The dead cgroup doesn't even have its own limit anymore after
> .css_reset() has run. And we already physically reparent slab objects
> in memcg_reparent_objcgs() and memcg_drain_all_list_lrus().
>
> I'm just saying we should do the same thing for LRU pages.

I understand the proposal and I agree it makes total sense when a job
is recycling sub-job/sub-container.

I was talking about the (recycling of the) top level cgroups. Though
for that to be an issue, I suppose the file system has to be shared
between the jobs on the system. I was wondering if a page cache
reaches the root memcg on multiple reparenting, should the next access
cause that page to be charged to the accessor?