To start I figured I'm going to bench about as friendly case as it gets -- statically linked *separate* binaries all doing execve in a loop. I borrowed the bench from here: http://apollo.backplane.com/DFlyMisc/doexec.c $ cc -static -O2 -o static-doexec doexec.c $ ./static-doexec $(nproc) It prints a result every second. My test box is temporarily only 26 cores and even at this scale I run into massive lock contention stemming from back-to-back calls to percpu_counter_init (and _destroy later). While not a panacea, one simple thing to do here is to batch these ops. Since the term "batching" is already used in the file, I decided to refer to it as "grouping" instead. Even if this code could be patched to dodge these counters, I would argue a high-traffic alloc/free consumer is only a matter of time so it makes sense to facilitate it. With the fix I get an ok win, to quote from the commit: > Even at a very modest scale of 26 cores (ops/s): > before: 133543.63 > after: 186061.81 (+39%) While with the patch these allocations remain a significant problem, the primary bottleneck shifts to: __pv_queued_spin_lock_slowpath+1 _raw_spin_lock_irqsave+57 folio_lruvec_lock_irqsave+91 release_pages+590 tlb_batch_pages_flush+61 tlb_finish_mmu+101 exit_mmap+327 __mmput+61 begin_new_exec+1245 load_elf_binary+712 bprm_execve+644 do_execveat_common.isra.0+429 __x64_sys_execve+50 do_syscall_64+46 entry_SYSCALL_64_after_hwframe+110 I intend to do more work on the area to mostly sort it out, but I would not mind if someone else took the hammer to folio. :) With this out of the way I'll be looking at some form of caching to eliminate these allocs as a problem. v3: - fix !CONFIG_SMP build - drop the backtrace from fork commit message v2: - force bigger alignment on alloc - rename "counters" to "nr_counters" and pass prior to lock key - drop {}'s for single-statement loops Mateusz Guzik (2): pcpcntr: add group allocation/free fork: group allocation of per-cpu counters for mm struct include/linux/percpu_counter.h | 39 ++++++++++++++++++---- kernel/fork.c | 14 ++------ lib/percpu_counter.c | 61 +++++++++++++++++++++++----------- 3 files changed, 77 insertions(+), 37 deletions(-) -- 2.41.0