Hello, On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote: > To start I figured I'm going to bench about as friendly case as it gets > -- statically linked *separate* binaries all doing execve in a loop. > > I borrowed the bench from found here: > http://apollo.backplane.com/DFlyMisc/doexec.c > > $ cc -static -O2 -o static-doexec doexec.c > $ ./static-doexec $(nproc) > > It prints a result every second (warning: first line is garbage). > > My test box is temporarily only 26 cores and even at this scale I run > into massive lock contention stemming from back-to-back calls to > percpu_counter_init (and _destroy later). > > While not a panacea, one simple thing to do here is to batch these ops. > Since the term "batching" is already used in the file, I decided to > refer to it as "grouping" instead. > Unfortunately it's taken me longer to get back to and I'm actually not super happy with the results, I wrote a batch percpu allocation api. It's better than the starting place, but I'm falling short on the free path. I am/was also wrestling with the lifetime ideas (should these lifetimes be percpus problem or call site bound like you've done). What I like about this approach is you group the percpu_counter lock which batching percpu allocations wouldn't be able to solve no matter how well I do. I'll review this more closely today. > Even if this code could be patched to dodge these counters, I would > argue a high-traffic alloc/free consumer is only a matter of time so it > makes sense to facilitate it. > > With the fix I get an ok win, to quote from the commit: > > Even at a very modest scale of 26 cores (ops/s): > > before: 133543.63 > > after: 186061.81 (+39%) > > > While with the patch these allocations remain a significant problem, > > the primary bottleneck shifts to: > > > > __pv_queued_spin_lock_slowpath+1 > > _raw_spin_lock_irqsave+57 > > folio_lruvec_lock_irqsave+91 > > release_pages+590 > > tlb_batch_pages_flush+61 > > tlb_finish_mmu+101 > > exit_mmap+327 > > __mmput+61 > > begin_new_exec+1245 > > load_elf_binary+712 > > bprm_execve+644 > > do_execveat_common.isra.0+429 > > __x64_sys_execve+50 > > do_syscall_64+46 > > entry_SYSCALL_64_after_hwframe+110 > > I intend to do more work on the area to mostly sort it out, but I would > not mind if someone else took the hammer to folio. :) > > With this out of the way I'll be looking at some form of caching to > eliminate these allocs as a problem. > I'm not against caching, this is just my first thought. Caching will have an impact on the backing pages of percpu. All it takes is 1 allocation on a page for the current allocator to pin n pages of memory. A few years ago percpu depopulation was implemented so that limits the amount of resident backing pages. Maybe the right thing to do is preallocate pools of common sized allocations so that way they can be recycled such that we don't have to think too hard about fragmentation that can occur if we populate these pools over time? Also as you've pointed out, it wasn't just the percpu allocation being the bottleneck, but percpu_counter's global lock too for hotplug support. I'm hazarding a guess most use cases of percpu might have additional locking requirements too such as percpu_counter. Thanks, Dennis > Thoughts? > > Mateusz Guzik (2): > pcpcntr: add group allocation/free > fork: group allocation of per-cpu counters for mm struct > > include/linux/percpu_counter.h | 19 ++++++++--- > kernel/fork.c | 13 ++------ > lib/percpu_counter.c | 61 ++++++++++++++++++++++++---------- > 3 files changed, 60 insertions(+), 33 deletions(-) > > -- > 2.39.2 >