On Thu, Jan 16, 2025 at 10:56 AM Jan Kara <jack@xxxxxxx> wrote: > > On Thu 16-01-25 08:46:48, Mateusz Guzik wrote: > > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote: > > > dump_user_range() supports sparse core dumps by skipping anonymous pages > > > which have not been modified. If get_dump_page() returns NULL, the page > > > is skipped rather than written to the core dump with dump_emit_page(). > > > > > > Sadly, dump_emit_page() contains the only check for dump_interrupted(), > > > so when dumping a very large sparse region, the core dump becomes > > > effectively uninterruptible. This can be observed with the following > > > test program: > > > > > > #include <stdlib.h> > > > #include <stdio.h> > > > #include <sys/mman.h> > > > > > > int main(void) { > > > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, > > > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); > > > printf("%p %m\n", mem); > > > if (mem != MAP_FAILED) { > > > mem[0] = 1; > > > } > > > abort(); > > > } > > > > > > The program allocates 1 TiB of anonymous memory, touches one page of it, > > > and aborts. During the core dump, SIGKILL has no effect. It takes > > > about 30 seconds to finish the dump, burning 100% CPU. > > > > > > > While the patch makes sense to me, this should not be taking anywhere > > near this much time and plausibly after unscrewing it will stop being a > > factor. > > > > So I had a look with a profiler: > > - 99.89% 0.00% a.out > > entry_SYSCALL_64_after_hwframe > > do_syscall_64 > > syscall_exit_to_user_mode > > arch_do_signal_or_restart > > - get_signal > > - 99.89% do_coredump > > - 99.88% elf_core_dump > > - dump_user_range > > - 98.12% get_dump_page > > - 64.19% __get_user_pages > > - 40.92% gup_vma_lookup > > - find_vma > > - mt_find > > 4.21% __rcu_read_lock > > 1.33% __rcu_read_unlock > > - 3.14% check_vma_flags > > 0.68% vma_is_secretmem > > 0.61% __cond_resched > > 0.60% vma_pgtable_walk_end > > 0.59% vma_pgtable_walk_begin > > 0.58% no_page_table > > - 15.13% down_read_killable > > 0.69% __cond_resched > > 13.84% up_read > > 0.58% __cond_resched > > > > > > Almost 29% of time is spent relocking the mmap semaphore in > > __get_user_pages. This most likely can operate locklessly in the fast > > path. Even if somehow not, chances are the lock can be held across > > multiple calls. > > > > mt_find spends most of it's time issuing a rep stos of 48 bytes (would > > be faster to rep mov 6 times instead). This is the compiler being nasty, > > I'll maybe look into it. > > > > However, I strongly suspect the current iteration method is just slow > > due to repeat mt_find calls and The Right Approach(tm) would make this > > entire thing finish within miliseconds by iterating the maple tree > > instead, but then the mm folk would have to be consulted on how to > > approach this and it may be time consuming to implement. > > > > Sorting out relocking should be an easily achievable & measurable win > > (no interest on my end). > > As much as I agree the code is dumb, doing what you suggest with mmap_sem > isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held > as that will cause lock inversion between mmap_sem and whatever filesystem > locks we have to take. So the fix would have to involve processing larger > batches of address space at once (which should also somewhat amortize the > __get_user_pages() setup costs). Not that hard to do but I wanted to spell > it out in case someone wants to pick up this todo item :) > Is the lock really needed to begin with? Suppose it is. In this context there are next to no pages found, but there is a gazillion relocks as the entire VA is being walked. Bare minimum patch which would already significantly help would start with the lock held and only relock if there is a page to dump, should be very easy to add. I however vote for someone mm-savvy to point out an easy way (if any) to just iterate pages which are there instead. -- Mateusz Guzik <mjguzik gmail.com>