On Thu, Jan 16, 2025 at 01:04:42PM +0100, Mateusz Guzik wrote: > On Thu, Jan 16, 2025 at 10:56 AM Jan Kara <jack@xxxxxxx> wrote: > > > > On Thu 16-01-25 08:46:48, Mateusz Guzik wrote: > > > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote: > > > > dump_user_range() supports sparse core dumps by skipping anonymous pages > > > > which have not been modified. If get_dump_page() returns NULL, the page > > > > is skipped rather than written to the core dump with dump_emit_page(). > > > > > > > > Sadly, dump_emit_page() contains the only check for dump_interrupted(), > > > > so when dumping a very large sparse region, the core dump becomes > > > > effectively uninterruptible. This can be observed with the following > > > > test program: > > > > > > > > #include <stdlib.h> > > > > #include <stdio.h> > > > > #include <sys/mman.h> > > > > > > > > int main(void) { > > > > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, > > > > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); > > > > printf("%p %m\n", mem); > > > > if (mem != MAP_FAILED) { > > > > mem[0] = 1; > > > > } > > > > abort(); > > > > } > > > > > > > > The program allocates 1 TiB of anonymous memory, touches one page of it, > > > > and aborts. During the core dump, SIGKILL has no effect. It takes > > > > about 30 seconds to finish the dump, burning 100% CPU. > > > > > > > > > > While the patch makes sense to me, this should not be taking anywhere > > > near this much time and plausibly after unscrewing it will stop being a > > > factor. > > > > > > So I had a look with a profiler: > > > - 99.89% 0.00% a.out > > > entry_SYSCALL_64_after_hwframe > > > do_syscall_64 > > > syscall_exit_to_user_mode > > > arch_do_signal_or_restart > > > - get_signal > > > - 99.89% do_coredump > > > - 99.88% elf_core_dump > > > - dump_user_range > > > - 98.12% get_dump_page > > > - 64.19% __get_user_pages > > > - 40.92% gup_vma_lookup > > > - find_vma > > > - mt_find > > > 4.21% __rcu_read_lock > > > 1.33% __rcu_read_unlock > > > - 3.14% check_vma_flags > > > 0.68% vma_is_secretmem > > > 0.61% __cond_resched > > > 0.60% vma_pgtable_walk_end > > > 0.59% vma_pgtable_walk_begin > > > 0.58% no_page_table > > > - 15.13% down_read_killable > > > 0.69% __cond_resched > > > 13.84% up_read > > > 0.58% __cond_resched > > > > > > > > > Almost 29% of time is spent relocking the mmap semaphore in > > > __get_user_pages. This most likely can operate locklessly in the fast > > > path. Even if somehow not, chances are the lock can be held across > > > multiple calls. > > > > > > mt_find spends most of it's time issuing a rep stos of 48 bytes (would > > > be faster to rep mov 6 times instead). This is the compiler being nasty, > > > I'll maybe look into it. > > > > > > However, I strongly suspect the current iteration method is just slow > > > due to repeat mt_find calls and The Right Approach(tm) would make this > > > entire thing finish within miliseconds by iterating the maple tree > > > instead, but then the mm folk would have to be consulted on how to > > > approach this and it may be time consuming to implement. > > > > > > Sorting out relocking should be an easily achievable & measurable win > > > (no interest on my end). > > > > As much as I agree the code is dumb, doing what you suggest with mmap_sem > > isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held > > as that will cause lock inversion between mmap_sem and whatever filesystem > > locks we have to take. So the fix would have to involve processing larger > > batches of address space at once (which should also somewhat amortize the > > __get_user_pages() setup costs). Not that hard to do but I wanted to spell > > it out in case someone wants to pick up this todo item :) > > > > Is the lock really needed to begin with? > > Suppose it is. > > In this context there are next to no pages found, but there is a > gazillion relocks as the entire VA is being walked. Do I understand correctly that all the relocks are to look up the VMA associated with each address, one page at a time? That's especially wasteful as dump_user_range() is called separately for each VMA, so it's going to find the same VMA every time anyway. > Bare minimum patch which would already significantly help would start > with the lock held and only relock if there is a page to dump, should > be very easy to add. That seems like a good idea. > I however vote for someone mm-savvy to point out an easy way (if any) > to just iterate pages which are there instead. It seems like some of the <linux/pagewalk.h> APIs might be relevant? Not sure which one has the right semantics. Can we just use folio_walk_start()? I guess the main complexity is every time we find a page, we have to stop the walk, unlock mmap_sem, call dump_emit_page(), and restart the walk from the next address. Maybe an mm expert can weigh in. > -- > Mateusz Guzik <mjguzik gmail.com> >