Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

Shakeel Butt <shakeelb@xxxxxxxxxx> · Tue, 30 Mar 2021 17:28:06 -0700

On Tue, Mar 30, 2021 at 2:10 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
[...]
> > The main concern I have with *just* reparenting LRU pages is that for
> > the long running systems, the root memcg will become a dumping ground.
> > In addition a job running multiple times on a machine will see
> > inconsistent memory usage if it re-accesses the file pages which were
> > reparented to the root memcg.
>
> I don't understand how Muchun's patches are supposed to *change* the
> behavior the way you are describing it. This IS today's behavior.
>
> We have hierarchical accounting, and a page that belongs to a leaf
> cgroup will automatically belong to all its parents.
>
> Further, if you delete a cgroup today, the abandoned cache will stay
> physically linked to that cgroup, but that zombie group no longer acts
> as a control structure: it imposes no limit and no protection; the
> pages will be reclaimed as if it WERE linked to the parent.
>
> For all intents and purposes, when you delete a cgroup today, its
> remaining pages ARE dumped onto the parent.
>
> The only difference is that today they pointlessly pin the leaf cgroup
> as a holding vessel - which is then round-robin'd from the parent
> during reclaim in order to pretend that all these child pages actually
> ARE linked to the parent's LRU list.
>
> Remember how we used to have every page physically linked to multiple
> lrus? The leaf cgroup and the root?
>
> All pages always belong to the (virtual) LRU list of all ancestor
> cgroups. The only thing Muchun changes is that they no longer pin a
> cgroup that has no semantical meaning anymore (because it's neither
> visible to the user nor exerts any contol over the pages anymore).
>

Indeed you are right. Even if the physical representation of the tree
has changed, the logical picture remains the same.

[Subconsciously I was sad that we will lose the information about the
origin memcg of the page for debugging purposes but then I thought if
we really need it then we can just add that metadata in the obj_cgroup
object. So, never mind.]

> Maybe I'm missing something that either you or Roman can explain to
> me. But this series looks like a (rare) pure win.
>
> Whether you like the current semantics is a separate discussion IMO.
>
> > Please note that I do agree with the mentioned problem and we do see
> > this issue in our fleet. Internally we have a "memcg mount option"
> > feature which couples a file system with a memcg and all file pages
> > allocated on that file system will be charged to that memcg. Multiple
> > instances (concurrent or subsequent) of the job will use that file
> > system (with a dedicated memcg) without leaving the zombies behind. I
> > am not pushing for this solution as it comes with its own intricacies
> > (e.g. if memcg coupled with a file system has a limit, the oom
> > behavior would be awkward and therefore internally we don't put a
> > limit on such memcgs). Though I want this to be part of discussion.
>
> Right, you disconnect memory from the tasks that are allocating it,
> and so you can't assign culpability when you need to.
>
> OOM is one thing, but there are also CPU cycles and IO bandwidth
> consumed during reclaim.
>

We didn't really have any issue regarding CPU or IO but that might be
due to our unique setup (i.e. no local disk).

> > I think the underlying reasons behind this issue are:
> >
> > 1) Filesystem shared by disjoint jobs.
> > 2) For job dedicated filesystems, the lifetime of the filesystem is
> > different from the lifetime of the job.
>
> There is also the case of deleting a cgroup just to recreate it right
> after for the same job. Many job managers do this on restart right now
> - like systemd, and what we're using in our fleet. This seems
> avoidable by recycling a group for another instance of the same job.

I was bundling the scenario you mentioned with (2) i.e. the filesystem
persists across multiple subsequent instances of the same job.

>
> Sharing is a more difficult discussion. If you access a page that you
> share with another cgroup, it may or may not be subject to your own or
> your buddy's memory limits. The only limit it is guaranteed to be
> subjected to is that of your parent. So One thing I could imagine is,
> instead of having a separate cgroup outside the hierarchy, we would
> reparent live pages the second they are accessed from a foreign
> cgroup. And reparent them until you reach the first common ancestor.
>
> This way, when you mount a filesystem shared by two jobs, you can put
> them into a joint subtree, and the root level of this subtree captures
> all the memory (as well as the reclaim CPU and IO) used by the two
> jobs - the private portions and the shared portions - and doesn't make
> them the liability of jobs in the system that DON'T share the same fs.

I will give more thought on this idea and see where it goes.

>
> But again, this is a useful discussion to have, but I don't quite see
> why it's relevant to Muchun's patches. They're purely an optimization.
>
> So I'd like to clear that up first before going further.

I think we are on the same page i.e. these patches change the physical
representation of the memcg tree but logically it remains the same and
fixes the zombie memcg issue.