On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote: > dump_user_range() supports sparse core dumps by skipping anonymous pages > which have not been modified. If get_dump_page() returns NULL, the page > is skipped rather than written to the core dump with dump_emit_page(). > > Sadly, dump_emit_page() contains the only check for dump_interrupted(), > so when dumping a very large sparse region, the core dump becomes > effectively uninterruptible. This can be observed with the following > test program: > > #include <stdlib.h> > #include <stdio.h> > #include <sys/mman.h> > > int main(void) { > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); > printf("%p %m\n", mem); > if (mem != MAP_FAILED) { > mem[0] = 1; > } > abort(); > } > > The program allocates 1 TiB of anonymous memory, touches one page of it, > and aborts. During the core dump, SIGKILL has no effect. It takes > about 30 seconds to finish the dump, burning 100% CPU. > While the patch makes sense to me, this should not be taking anywhere near this much time and plausibly after unscrewing it will stop being a factor. So I had a look with a profiler: - 99.89% 0.00% a.out entry_SYSCALL_64_after_hwframe do_syscall_64 syscall_exit_to_user_mode arch_do_signal_or_restart - get_signal - 99.89% do_coredump - 99.88% elf_core_dump - dump_user_range - 98.12% get_dump_page - 64.19% __get_user_pages - 40.92% gup_vma_lookup - find_vma - mt_find 4.21% __rcu_read_lock 1.33% __rcu_read_unlock - 3.14% check_vma_flags 0.68% vma_is_secretmem 0.61% __cond_resched 0.60% vma_pgtable_walk_end 0.59% vma_pgtable_walk_begin 0.58% no_page_table - 15.13% down_read_killable 0.69% __cond_resched 13.84% up_read 0.58% __cond_resched Almost 29% of time is spent relocking the mmap semaphore in __get_user_pages. This most likely can operate locklessly in the fast path. Even if somehow not, chances are the lock can be held across multiple calls. mt_find spends most of it's time issuing a rep stos of 48 bytes (would be faster to rep mov 6 times instead). This is the compiler being nasty, I'll maybe look into it. However, I strongly suspect the current iteration method is just slow due to repeat mt_find calls and The Right Approach(tm) would make this entire thing finish within miliseconds by iterating the maple tree instead, but then the mm folk would have to be consulted on how to approach this and it may be time consuming to implement. Sorting out relocking should be an easily achievable & measurable win (no interest on my end).