API/syscall to alleviate page/memory problem when quickly accessing memory?

Levo D <l-asm@xxxxxxxxxxxxxxxxxxxxxxxx> · Mon, 3 Apr 2023 02:31:35 +0000 (UTC)

I optimized and profile my program to a point that it seems like it's spending more time in kernel than in userspace (likely not true but I'll explain).

Here's one run. I spawn many threads (6 at minimum, more depending on flags). As you can see more than half of the total time is in sys. Is the kernel running on multiple cores simultaneously to give my program pages?

real	0m0.954s
user	0m6.442s
sys 	0m0.607s

The test below is using -test-flags which gets me these numbers, sys is 51% of total time

real	0m0.733s
user	0m3.476s
sys 	0m0.378s

perf record -F 5000 ./myapp -test-flags shows me 61% of the app is in my biggest function and 6% is in `clear_page_rep`. When I record cache misses using `perf record -F 5000 --call-graph=fp -e cache-misses ./myapp -test-flags` I can see that

clear_page_rep takes 40%
clear_huge_page takes 1.2%
My big function self is 8%, while total is 25.5%. the remaining is mostly asm_exc_page_fault (12%) and asm_sysvec_apic_timer_interrupt (2.7%)
That's about 56% (of all misses and waiting) in the kernel

I believe if I can reduce work being done in the kernel and have pages be ready before I fault I'll have less cache misses in my large function and I could be significantly faster. I measured how long my large function takes in single threaded compared to multi. Multithreaded at minimum is 1.5x slower to 2x slower. I spawn 1 thread per core (I'm testing on a zen2, it has 6cores with 12threads, spawning more than 6 threads slow the program down). Each thread is using <100MB.

Is there an API I should look into? What can I do here?