On Thu 16-01-25 08:46:48, Mateusz Guzik wrote: > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote: > > dump_user_range() supports sparse core dumps by skipping anonymous pages > > which have not been modified. If get_dump_page() returns NULL, the page > > is skipped rather than written to the core dump with dump_emit_page(). > > > > Sadly, dump_emit_page() contains the only check for dump_interrupted(), > > so when dumping a very large sparse region, the core dump becomes > > effectively uninterruptible. This can be observed with the following > > test program: > > > > #include <stdlib.h> > > #include <stdio.h> > > #include <sys/mman.h> > > > > int main(void) { > > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, > > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); > > printf("%p %m\n", mem); > > if (mem != MAP_FAILED) { > > mem[0] = 1; > > } > > abort(); > > } > > > > The program allocates 1 TiB of anonymous memory, touches one page of it, > > and aborts. During the core dump, SIGKILL has no effect. It takes > > about 30 seconds to finish the dump, burning 100% CPU. > > > > While the patch makes sense to me, this should not be taking anywhere > near this much time and plausibly after unscrewing it will stop being a > factor. > > So I had a look with a profiler: > - 99.89% 0.00% a.out > entry_SYSCALL_64_after_hwframe > do_syscall_64 > syscall_exit_to_user_mode > arch_do_signal_or_restart > - get_signal > - 99.89% do_coredump > - 99.88% elf_core_dump > - dump_user_range > - 98.12% get_dump_page > - 64.19% __get_user_pages > - 40.92% gup_vma_lookup > - find_vma > - mt_find > 4.21% __rcu_read_lock > 1.33% __rcu_read_unlock > - 3.14% check_vma_flags > 0.68% vma_is_secretmem > 0.61% __cond_resched > 0.60% vma_pgtable_walk_end > 0.59% vma_pgtable_walk_begin > 0.58% no_page_table > - 15.13% down_read_killable > 0.69% __cond_resched > 13.84% up_read > 0.58% __cond_resched > > > Almost 29% of time is spent relocking the mmap semaphore in > __get_user_pages. This most likely can operate locklessly in the fast > path. Even if somehow not, chances are the lock can be held across > multiple calls. > > mt_find spends most of it's time issuing a rep stos of 48 bytes (would > be faster to rep mov 6 times instead). This is the compiler being nasty, > I'll maybe look into it. > > However, I strongly suspect the current iteration method is just slow > due to repeat mt_find calls and The Right Approach(tm) would make this > entire thing finish within miliseconds by iterating the maple tree > instead, but then the mm folk would have to be consulted on how to > approach this and it may be time consuming to implement. > > Sorting out relocking should be an easily achievable & measurable win > (no interest on my end). As much as I agree the code is dumb, doing what you suggest with mmap_sem isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held as that will cause lock inversion between mmap_sem and whatever filesystem locks we have to take. So the fix would have to involve processing larger batches of address space at once (which should also somewhat amortize the __get_user_pages() setup costs). Not that hard to do but I wanted to spell it out in case someone wants to pick up this todo item :) Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR