Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

Tavian Barnes <tavianator@xxxxxxxxxxxxxx> · Thu, 16 Jan 2025 12:04:44 -0500

On Thu, Jan 16, 2025 at 01:04:42PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 16, 2025 at 10:56 AM Jan Kara <jack@xxxxxxx> wrote:
> >
> > On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> > > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > > > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > > > which have not been modified.  If get_dump_page() returns NULL, the page
> > > > is skipped rather than written to the core dump with dump_emit_page().
> > > >
> > > > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > > > so when dumping a very large sparse region, the core dump becomes
> > > > effectively uninterruptible.  This can be observed with the following
> > > > test program:
> > > >
> > > >     #include <stdlib.h>
> > > >     #include <stdio.h>
> > > >     #include <sys/mman.h>
> > > >
> > > >     int main(void) {
> > > >         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> > > >                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> > > >         printf("%p %m\n", mem);
> > > >         if (mem != MAP_FAILED) {
> > > >                 mem[0] = 1;
> > > >         }
> > > >         abort();
> > > >     }
> > > >
> > > > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > > > and aborts.  During the core dump, SIGKILL has no effect.  It takes
> > > > about 30 seconds to finish the dump, burning 100% CPU.
> > > >
> > >
> > > While the patch makes sense to me, this should not be taking anywhere
> > > near this much time and plausibly after unscrewing it will stop being a
> > > factor.
> > >
> > > So I had a look with a profiler:
> > > -   99.89%     0.00%  a.out
> > >      entry_SYSCALL_64_after_hwframe
> > >      do_syscall_64
> > >      syscall_exit_to_user_mode
> > >      arch_do_signal_or_restart
> > >    - get_signal
> > >       - 99.89% do_coredump
> > >          - 99.88% elf_core_dump
> > >             - dump_user_range
> > >                - 98.12% get_dump_page
> > >                   - 64.19% __get_user_pages
> > >                      - 40.92% gup_vma_lookup
> > >                         - find_vma
> > >                            - mt_find
> > >                                 4.21% __rcu_read_lock
> > >                                 1.33% __rcu_read_unlock
> > >                      - 3.14% check_vma_flags
> > >                           0.68% vma_is_secretmem
> > >                        0.61% __cond_resched
> > >                        0.60% vma_pgtable_walk_end
> > >                        0.59% vma_pgtable_walk_begin
> > >                        0.58% no_page_table
> > >                   - 15.13% down_read_killable
> > >                        0.69% __cond_resched
> > >                     13.84% up_read
> > >                  0.58% __cond_resched
> > >
> > >
> > > Almost 29% of time is spent relocking the mmap semaphore in
> > > __get_user_pages. This most likely can operate locklessly in the fast
> > > path. Even if somehow not, chances are the lock can be held across
> > > multiple calls.
> > >
> > > mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> > > be faster to rep mov 6 times instead). This is the compiler being nasty,
> > > I'll maybe look into it.
> > >
> > > However, I strongly suspect the current iteration method is just slow
> > > due to repeat mt_find calls and The Right Approach(tm) would make this
> > > entire thing finish within miliseconds by iterating the maple tree
> > > instead, but then the mm folk would have to be consulted on how to
> > > approach this and it may be time consuming to implement.
> > >
> > > Sorting out relocking should be an easily achievable & measurable win
> > > (no interest on my end).
> >
> > As much as I agree the code is dumb, doing what you suggest with mmap_sem
> > isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
> > as that will cause lock inversion between mmap_sem and whatever filesystem
> > locks we have to take. So the fix would have to involve processing larger
> > batches of address space at once (which should also somewhat amortize the
> > __get_user_pages() setup costs). Not that hard to do but I wanted to spell
> > it out in case someone wants to pick up this todo item :)
> >
> 
> Is the lock really needed to begin with?
> 
> Suppose it is.
> 
> In this context there are next to no pages found, but there is a
> gazillion relocks as the entire VA is being walked.

Do I understand correctly that all the relocks are to look up the VMA
associated with each address, one page at a time?  That's especially
wasteful as dump_user_range() is called separately for each VMA, so it's
going to find the same VMA every time anyway.

> Bare minimum patch which would already significantly help would start
> with the lock held and only relock if there is a page to dump, should
> be very easy to add.

That seems like a good idea.

> I however vote for someone mm-savvy to point out an easy way (if any)
> to just iterate pages which are there instead.

It seems like some of the <linux/pagewalk.h> APIs might be relevant?
Not sure which one has the right semantics.  Can we just use
folio_walk_start()?

I guess the main complexity is every time we find a page, we have to
stop the walk, unlock mmap_sem, call dump_emit_page(), and restart the
walk from the next address.  Maybe an mm expert can weigh in.

> -- 
> Mateusz Guzik <mjguzik gmail.com>
>