On 4/3/23 04:31, Levo D wrote: > I optimized and profile my program to a point that it seems like it's spending more time in kernel than in userspace (likely not true but I'll explain). > > Here's one run. I spawn many threads (6 at minimum, more depending on flags). As you can see more than half of the total time is in sys. Is the kernel running on multiple cores simultaneously to give my program pages? > > real 0m0.954s > user 0m6.442s > sys 0m0.607s > > The test below is using -test-flags which gets me these numbers, sys is 51% of total time > > real 0m0.733s > user 0m3.476s > sys 0m0.378s > > > perf record -F 5000 ./myapp -test-flags shows me 61% of the app is in my biggest function and 6% is in `clear_page_rep`. When I record cache misses using `perf record -F 5000 --call-graph=fp -e cache-misses ./myapp -test-flags` I can see that > > clear_page_rep takes 40% > clear_huge_page takes 1.2% > My big function self is 8%, while total is 25.5%. the remaining is mostly asm_exc_page_fault (12%) and asm_sysvec_apic_timer_interrupt (2.7%) > That's about 56% (of all misses and waiting) in the kernel > > I believe if I can reduce work being done in the kernel That's not possible, the kernel must clear the pages before giving them to a process, for security reasons. > and have pages be ready before I fault That it possible if you use MAP_POPULATE flag of mmap(). Or just write once in each page before starting your large function to pre-fault it. In that case it may also make sense not to measure time of your whole program execution, but only between initialization (including the pre-faulting) and cleanup. The whole runtime already seems very short to profit from further optimizations if the init/cleanup is involved each time. If the runtime of "large function" is important because it would be run many times in practice, then it could also make sense to keep the initialized process running and reusing the allocated memory instead of repeated executions of new processes that include the free and reallocation costs. > I'll have less cache misses in my large function and I could be significantly faster. I measured how long my large function takes in single threaded compared to multi. Multithreaded at minimum is 1.5x slower to 2x slower. I spawn 1 thread per core (I'm testing on a zen2, it has 6cores with 12threads, spawning more than 6 threads slow the program down). Each thread is using <100MB. This seems to be more about how hyperthreading (SMT) doesn't always really results in speed ups, so that's about the CPU vs workload rather than kernel. > Is there an API I should look into? What can I do here? > >