On Fri, May 17, 2024 at 11:29:57AM +0800, Kairui Song wrote: > Mateusz Guzik <mjguzik@xxxxxxxxx> 于 2024年5月16日周四 23:14写道: > > A part of The Real Solution(tm) would make counter allocations scale > > (including mcid, not just rss) or dodge them (while maintaining the > > per-cpu distribution, see below for one idea), but that boils down to > > balancing scalability versus total memory usage. It is trivial to just > > slap together a per-cpu cache of these allocations and have the problem > > go away for benchmarking purposes, while being probably being too memory > > hungry for actual usage. > > > > I was pondering an allocator with caches per some number of cores (say 4 > > or 8). Microbenchmarks aside I suspect real workloads would not suffer > > from contention at this kind of granularity. This would trivially reduce > > memory usage compared to per-cpu caching. I suspect things like > > mm_struct, task_struct, task stacks and similar would be fine with it. > > > > Suppose mm_struct is allocated from a more coarse grained allocator than > > per-cpu. Total number of cached objects would be lower than it is now. > > That would also mean these allocated but not currently used mms could > > hold on to other stuff, for example per-cpu rss and mcid counters. Then > > should someone fork or exit, alloc/free_percpu would be avoided for most > > cases. This would scale better and be faster single-threaded than the > > current state. > > And what is the issue with using only one CPU cache, and flush on mm > switch? No more alloc after boot, and the total (and fixed) memory > usage is just about a few unsigned long per CPU, which should be even > lower that the old RSS cache solution (4 unsigned long per task). And > it scaled very well with many kinds of microbench or workload I've > tested. > > Unless the workload keeps doing something like "alloc one page then > switch to another mm", I think the performance will be horrible > already due to cache invalidations and many switch_*()s, RSS isn't > really a concern there. > I only skimmed through your patchset. I do think it has a legitimate approach, but personally I would not do it like that due to the extra work on context switches. However, I have 0 say about this, so you will need to prod the mm overlords to get this moving forward. Maybe I was not clear enough in my opening e-mail, so I'm going to reiterate some bits: there are scalability problems in execve even with your patchset or the one which uses atomics. One of them concerns another bit which allocates per-cpu memory (the mcid thing). I note that sorting it out would possibly also take care of the rss problem, outlining an example approach above.