I optimized and profile my program to a point that it seems like it's spending more time in kernel than in userspace (likely not true but I'll explain). Here's one run. I spawn many threads (6 at minimum, more depending on flags). As you can see more than half of the total time is in sys. Is the kernel running on multiple cores simultaneously to give my program pages? real 0m0.954s user 0m6.442s sys 0m0.607s The test below is using -test-flags which gets me these numbers, sys is 51% of total time real 0m0.733s user 0m3.476s sys 0m0.378s perf record -F 5000 ./myapp -test-flags shows me 61% of the app is in my biggest function and 6% is in `clear_page_rep`. When I record cache misses using `perf record -F 5000 --call-graph=fp -e cache-misses ./myapp -test-flags` I can see that clear_page_rep takes 40% clear_huge_page takes 1.2% My big function self is 8%, while total is 25.5%. the remaining is mostly asm_exc_page_fault (12%) and asm_sysvec_apic_timer_interrupt (2.7%) That's about 56% (of all misses and waiting) in the kernel I believe if I can reduce work being done in the kernel and have pages be ready before I fault I'll have less cache misses in my large function and I could be significantly faster. I measured how long my large function takes in single threaded compared to multi. Multithreaded at minimum is 1.5x slower to 2x slower. I spawn 1 thread per core (I'm testing on a zen2, it has 6cores with 12threads, spawning more than 6 threads slow the program down). Each thread is using <100MB. Is there an API I should look into? What can I do here?