Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/25/23 14:53, Yosry Ahmed wrote:
On Tue, Apr 25, 2023 at 11:42 AM Waiman Long <longman@xxxxxxxxxx> wrote:
On 4/25/23 07:36, Yosry Ahmed wrote:
   +David Rientjes +Greg Thelen +Matthew Wilcox

On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@xxxxxxxxxx> wrote:
When a memcg is removed by userspace it gets offlined by the kernel.
Offline memcgs are hidden from user space, but they still live in the
kernel until their reference count drops to 0. New allocations cannot
be charged to offline memcgs, but existing allocations charged to
offline memcgs remain charged, and hold a reference to the memcg.

As such, an offline memcg can remain in the kernel indefinitely,
becoming a zombie memcg. The accumulation of a large number of zombie
memcgs lead to increased system overhead (mainly percpu data in struct
mem_cgroup). It also causes some kernel operations that scale with the
number of memcgs to become less efficient (e.g. reclaim).

There are currently out-of-tree solutions which attempt to
periodically clean up zombie memcgs by reclaiming from them. However
that is not effective for non-reclaimable memory, which it would be
better to reparent or recharge to an online cgroup. There are also
proposed changes that would benefit from recharging for shared
resources like pinned pages, or DMA buffer pages.
I am very interested in attending this discussion, it's something that
I have been actively looking into -- specifically recharging pages of
offlined memcgs.

Suggested attendees:
Yosry Ahmed <yosryahmed@xxxxxxxxxx>
Yu Zhao <yuzhao@xxxxxxxxxx>
T.J. Mercier <tjmercier@xxxxxxxxxx>
Tejun Heo <tj@xxxxxxxxxx>
Shakeel Butt <shakeelb@xxxxxxxxxx>
Muchun Song <muchun.song@xxxxxxxxx>
Johannes Weiner <hannes@xxxxxxxxxxx>
Roman Gushchin <roman.gushchin@xxxxxxxxx>
Alistair Popple <apopple@xxxxxxxxxx>
Jason Gunthorpe <jgg@xxxxxxxxxx>
Kalesh Singh <kaleshsingh@xxxxxxxxxx>
I was hoping I would bring a more complete idea to this thread, but
here is what I have so far.

The idea is to recharge the memory charged to memcgs when they are
offlined. I like to think of the options we have to deal with memory
charged to offline memcgs as a toolkit. This toolkit includes:

(a) Evict memory.

This is the simplest option, just evict the memory.

For file-backed pages, this writes them back to their backing files,
uncharging and freeing the page. The next access will read the page
again and the faulting process’s memcg will be charged.

For swap-backed pages (anon/shmem), this swaps them out. Swapping out
a page charged to an offline memcg uncharges the page and charges the
swap to its parent. The next access will swap in the page and the
parent will be charged. This is effectively deferred recharging to the
parent.

Pros:
- Simple.

Cons:
- Behavior is different for file-backed vs. swap-backed pages, for
swap-backed pages, the memory is recharged to the parent (aka
reparented), not charged to the "rightful" user.
- Next access will incur higher latency, especially if the pages are active.

(b) Direct recharge to the parent

This can be done for any page and should be simple as the pages are
already hierarchically charged to the parent.

Pros:
- Simple.

Cons:
- If a different memcg is using the memory, it will keep taxing the
parent indefinitely. Same not the "rightful" user argument.
Muchun had actually posted patch to do this last year. See

https://lore.kernel.org/all/20220621125658.64935-10-songmuchun@xxxxxxxxxxxxx/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147

I am wondering if he is going to post an updated version of that or not.
Anyway, I am looking forward to learn about the result of this
discussion even thought I am not a conference invitee.
There are a couple of problems that were brought up back then, mainly
that memory will be reparented to the root memcg eventually,
practically escaping accounting. Shared resources may end up being
eventually unaccounted. Ideally, we can come up with a scheme where
the memory is charged to the real user, instead of just to the parent.

Consider the case where processes in memcg A and B are both using
memory that is charged to memcg A. If memcg A goes offline, and we
reparent the memory, memcg B keeps using the memory for free, taxing
A's parent, or the entire system if that's root.

Also, if there is a kernel bug and a page is being pinned
unnecessarily, those pages will never be reclaimed and will stick
around and eventually be reparented to the root memcg. If being
reparented to the root memcg is a legitimate action, you can't simply
tell apart if pages are sticking around just because they are being
used by someone or if there is a kernel bug.

This is certainly a valid concern. We are currently doing reparenting for slab objects. However physical pages have a higher probability of being shared by different tasks. I do hope that we can come to agreement soon on how best to address this issue.

Thanks,
Longman





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux