On Fri, Dec 22, 2023 at 7:40 AM Henry Huang <henry.hj@xxxxxxxxxxxx> wrote: > > - are pages ever shared between different memcg hierarchies? You > > mentioned sharing between processes in A and A/B, but I'm wondering > > if there is sharing between two different memcg hierarchies where root > > is the only common ancestor? > > Yes, there is a another really common case: > If docker graph driver is overlayfs, different docker containers use the > same image, or share same low layers, would share file cache of public bin or > lib(i.e libc.so). Does this present a problem with setting memcg limits or OOMs? It seems like deterministically charging shared pages would be highly desirable. Mina Almasry previously proposed a memcg= mount option to implement deterministic charging[1], but it wasn't a generic sharing mechanism. Nonetheless, the problem remains, and it would be interesting to learn if this presents any issues for you. [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@xxxxxxxxxx/ > > > - do you anticipate a shorter scan period at some point? Proactively > > reclaiming all memory colder than one hour is a long time :) Are you > > concerned at all about the cost of doing your current idle bit > > harvesting approach becoming too expensive if you significantly reduce > > the scan period? > > We don't want the owner of the application to feel a significant > performance downgrade when using swap. There is a high risk to reclaim pages > which idle age are less than 1 hour. We have internal test and > data analysis to support it. > > We disabled global swappiness and memcg swapinness. > Only proactive reclaim can swap anon pages. > > What's more, we see that mglru has a more efficient way to scan pte access bit. > We perferred to use mglru scan help us scan and select idle pages. I'm working on a kernel driver/per-memcg interface to perform aging with MGLRU, including configuration for the MGLRU page scanning optimizations. I suspect scanning the PTE accessed bits for pages charged to a foreign memcg ad-hoc has some performance implications, and the more general solution is to charge in a predetermined way, which makes the scanning on behalf of the foreign memcg a bit cleaner. This is possible nonetheless, but a bit hacky. Let me know you have any ideas. > > > - is proactive reclaim being driven by writing to memory.reclaim, by > > enforcing a smaller memory.high, or something else? > > Because all pages info and idle age are stored in userspace, kernel can't get > these information directly. We have a private patch include a new reclaim interface > to support reclaim pages with specific pfns. Thanks for sharing! It's been enlightening to learn about different prod environments.