On 8/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > To start I figured I'm going to bench about as friendly case as it gets > -- statically linked *separate* binaries all doing execve in a loop. > > I borrowed the bench from found here: > http://apollo.backplane.com/DFlyMisc/doexec.c > > $ cc -static -O2 -o static-doexec doexec.c > $ ./static-doexec $(nproc) > > It prints a result every second (warning: first line is garbage). > > My test box is temporarily only 26 cores and even at this scale I run > into massive lock contention stemming from back-to-back calls to > percpu_counter_init (and _destroy later). > > While not a panacea, one simple thing to do here is to batch these ops. > Since the term "batching" is already used in the file, I decided to > refer to it as "grouping" instead. > > Even if this code could be patched to dodge these counters, I would > argue a high-traffic alloc/free consumer is only a matter of time so it > makes sense to facilitate it. > > With the fix I get an ok win, to quote from the commit: >> Even at a very modest scale of 26 cores (ops/s): >> before: 133543.63 >> after: 186061.81 (+39%) > So to sum up, a v3 of the patchset is queued up here: https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-next For interested I temporarily got my hands on something exceeding the hand watch scale benched above -- a 192-way AMD EPYC 7R13 box (2 sockets x 48 cores x 2 threads). A 6.5 kernel + the patchset only gets south of 140k execs/s when running ./static-doexec 192 According to perf top: 51.04% [kernel] [k] osq_lock 6.82% [kernel] [k] __raw_callee_save___kvm_vcpu_is_preempted 2.98% [kernel] [k] _atomic_dec_and_lock_irqsave 1.62% [kernel] [k] rcu_cblist_dequeue 1.54% [kernel] [k] refcount_dec_not_one 1.51% [kernel] [k] __mod_lruvec_page_state 1.46% [kernel] [k] put_cred_rcu 1.34% [kernel] [k] native_queued_spin_lock_slowpath 0.94% [kernel] [k] srso_alias_safe_ret 0.81% [kernel] [k] memset_orig 0.77% [kernel] [k] unmap_page_range 0.73% [kernel] [k] _compound_head 0.72% [kernel] [k] kmem_cache_free Then bpftrace -e 'kprobe:osq_lock { @[kstack()] = count(); }' shows: @[ osq_lock+1 __mutex_lock_killable_slowpath+19 mutex_lock_killable+62 pcpu_alloc+1219 __alloc_percpu_gfp+18 __percpu_counter_init_many+43 mm_init+727 mm_alloc+78 alloc_bprm+138 do_execveat_common.isra.0+103 __x64_sys_execve+55 do_syscall_64+54 entry_SYSCALL_64_after_hwframe+110 ]: 637370 @[ osq_lock+1 __mutex_lock_killable_slowpath+19 mutex_lock_killable+62 pcpu_alloc+1219 __alloc_percpu+21 mm_init+577 mm_alloc+78 alloc_bprm+138 do_execveat_common.isra.0+103 __x64_sys_execve+55 do_syscall_64+54 entry_SYSCALL_64_after_hwframe+110 ]: 638036 That is per-cpu allocation is still on top at this scale. But more importantly there are *TWO* unrelated back-to-back per-cpu allocs -- one by rss counters and one by mm_alloc_cid. That is to say per-cpu alloc scalability definitely needs to get fixed, I'll ponder about it. -- Mateusz Guzik <mjguzik gmail.com>