On Tue, Apr 25, 2023 at 11:42 AM Waiman Long <longman@xxxxxxxxxx> wrote: > > On 4/25/23 07:36, Yosry Ahmed wrote: > > +David Rientjes +Greg Thelen +Matthew Wilcox > > > > On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > >> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@xxxxxxxxxx> wrote: > >>> When a memcg is removed by userspace it gets offlined by the kernel. > >>> Offline memcgs are hidden from user space, but they still live in the > >>> kernel until their reference count drops to 0. New allocations cannot > >>> be charged to offline memcgs, but existing allocations charged to > >>> offline memcgs remain charged, and hold a reference to the memcg. > >>> > >>> As such, an offline memcg can remain in the kernel indefinitely, > >>> becoming a zombie memcg. The accumulation of a large number of zombie > >>> memcgs lead to increased system overhead (mainly percpu data in struct > >>> mem_cgroup). It also causes some kernel operations that scale with the > >>> number of memcgs to become less efficient (e.g. reclaim). > >>> > >>> There are currently out-of-tree solutions which attempt to > >>> periodically clean up zombie memcgs by reclaiming from them. However > >>> that is not effective for non-reclaimable memory, which it would be > >>> better to reparent or recharge to an online cgroup. There are also > >>> proposed changes that would benefit from recharging for shared > >>> resources like pinned pages, or DMA buffer pages. > >> I am very interested in attending this discussion, it's something that > >> I have been actively looking into -- specifically recharging pages of > >> offlined memcgs. > >> > >>> Suggested attendees: > >>> Yosry Ahmed <yosryahmed@xxxxxxxxxx> > >>> Yu Zhao <yuzhao@xxxxxxxxxx> > >>> T.J. Mercier <tjmercier@xxxxxxxxxx> > >>> Tejun Heo <tj@xxxxxxxxxx> > >>> Shakeel Butt <shakeelb@xxxxxxxxxx> > >>> Muchun Song <muchun.song@xxxxxxxxx> > >>> Johannes Weiner <hannes@xxxxxxxxxxx> > >>> Roman Gushchin <roman.gushchin@xxxxxxxxx> > >>> Alistair Popple <apopple@xxxxxxxxxx> > >>> Jason Gunthorpe <jgg@xxxxxxxxxx> > >>> Kalesh Singh <kaleshsingh@xxxxxxxxxx> > > I was hoping I would bring a more complete idea to this thread, but > > here is what I have so far. > > > > The idea is to recharge the memory charged to memcgs when they are > > offlined. I like to think of the options we have to deal with memory > > charged to offline memcgs as a toolkit. This toolkit includes: > > > > (a) Evict memory. > > > > This is the simplest option, just evict the memory. > > > > For file-backed pages, this writes them back to their backing files, > > uncharging and freeing the page. The next access will read the page > > again and the faulting process’s memcg will be charged. > > > > For swap-backed pages (anon/shmem), this swaps them out. Swapping out > > a page charged to an offline memcg uncharges the page and charges the > > swap to its parent. The next access will swap in the page and the > > parent will be charged. This is effectively deferred recharging to the > > parent. > > > > Pros: > > - Simple. > > > > Cons: > > - Behavior is different for file-backed vs. swap-backed pages, for > > swap-backed pages, the memory is recharged to the parent (aka > > reparented), not charged to the "rightful" user. > > - Next access will incur higher latency, especially if the pages are active. > > > > (b) Direct recharge to the parent > > > > This can be done for any page and should be simple as the pages are > > already hierarchically charged to the parent. > > > > Pros: > > - Simple. > > > > Cons: > > - If a different memcg is using the memory, it will keep taxing the > > parent indefinitely. Same not the "rightful" user argument. > > Muchun had actually posted patch to do this last year. See > > https://lore.kernel.org/all/20220621125658.64935-10-songmuchun@xxxxxxxxxxxxx/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147 > > I am wondering if he is going to post an updated version of that or not. > Anyway, I am looking forward to learn about the result of this > discussion even thought I am not a conference invitee. There are a couple of problems that were brought up back then, mainly that memory will be reparented to the root memcg eventually, practically escaping accounting. Shared resources may end up being eventually unaccounted. Ideally, we can come up with a scheme where the memory is charged to the real user, instead of just to the parent. Consider the case where processes in memcg A and B are both using memory that is charged to memcg A. If memcg A goes offline, and we reparent the memory, memcg B keeps using the memory for free, taxing A's parent, or the entire system if that's root. Also, if there is a kernel bug and a page is being pinned unnecessarily, those pages will never be reclaimed and will stick around and eventually be reparented to the root memcg. If being reparented to the root memcg is a legitimate action, you can't simply tell apart if pages are sticking around just because they are being used by someone or if there is a kernel bug. > > Thanks, > Longman > >