On Thu, May 23, 2024 at 05:34:27PM GMT, Matthew Wilcox wrote: > On Thu, May 23, 2024 at 08:41:25AM -0700, Shakeel Butt wrote: > > On Thu, May 23, 2024 at 02:31:05PM +0100, Matthew Wilcox wrote: > > > On Tue, May 21, 2024 at 12:29:39PM -0700, Shakeel Butt wrote: > > > > On Tue, May 21, 2024 at 03:44:21PM +0100, Matthew Wilcox wrote: > > > > > The memcg should not be attached to the individual pages that make up a > > > > > vmalloc allocation. Rather, it should be managed by the vmalloc > > > > > allocation itself. I don't have the knowledge to poke around inside > > > > > vmalloc right now, but maybe somebody else could take that on. > > > > > > > > Are you concerned about accessing just memcg or any field of the > > > > sub-page? There are drivers accessing fields of pages allocated through > > > > vmalloc. Some details at 3b8000ae185c ("mm/vmalloc: huge vmalloc backing > > > > pages should be split rather than compound"). > > > > > > Thanks for the pointer, and fb_deferred_io_fault() is already on my > > > hitlist for abusing struct page. > > > > > > My primary concern is that we should track the entire allocation as a > > > single object rather than tracking each page individually. That means > > > assigning the vmalloc allocation to a memcg rather than assigning each > > > page to a memcg. It's a lot less overhead to increment the counter once > > > per allocation rather than once per page in the allocation! > > > > > > But secondarily, yes, pages allocated by vmalloc probably don't need > > > any per-page state, other than tracking the vmalloc allocation they're > > > assigned to. We'll see how that theory turns out. > > > > I think the tricky part would be vmalloc having pages spanning multiple > > nodes which is not an issue for MEMCG_VMALLOC stat but the vmap based > > kernel stack (CONFIG_VMAP_STACK) metric NR_KERNEL_STACK_KB cares about > > that information. > > Yes, we'll have to handle mod_lruvec_page_state() differently since that > stat is tracked per node. Or we could stop tracking that stat per node. > Is it useful to track it per node? Why is it useful to track kernel > stacks per node, but not track vmalloc allocations per node? This is a good question and other than that there are user visible APIs (per numa meminfo & memory.numa_stat), I don't have a good answer.