Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

Jan Kara <jack@xxxxxxx> · Thu, 16 Jan 2025 10:56:53 +0100

On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > which have not been modified.  If get_dump_page() returns NULL, the page
> > is skipped rather than written to the core dump with dump_emit_page().
> > 
> > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > so when dumping a very large sparse region, the core dump becomes
> > effectively uninterruptible.  This can be observed with the following
> > test program:
> > 
> >     #include <stdlib.h>
> >     #include <stdio.h>
> >     #include <sys/mman.h>
> > 
> >     int main(void) {
> >         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> >                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> >         printf("%p %m\n", mem);
> >         if (mem != MAP_FAILED) {
> >                 mem[0] = 1;
> >         }
> >         abort();
> >     }
> > 
> > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > and aborts.  During the core dump, SIGKILL has no effect.  It takes
> > about 30 seconds to finish the dump, burning 100% CPU.
> > 
> 
> While the patch makes sense to me, this should not be taking anywhere
> near this much time and plausibly after unscrewing it will stop being a
> factor.
> 
> So I had a look with a profiler:
> -   99.89%     0.00%  a.out
>      entry_SYSCALL_64_after_hwframe                      
>      do_syscall_64                                       
>      syscall_exit_to_user_mode                           
>      arch_do_signal_or_restart                           
>    - get_signal                                          
>       - 99.89% do_coredump                               
>          - 99.88% elf_core_dump                          
>             - dump_user_range                            
>                - 98.12% get_dump_page                    
>                   - 64.19% __get_user_pages              
>                      - 40.92% gup_vma_lookup             
>                         - find_vma                       
>                            - mt_find                     
>                                 4.21% __rcu_read_lock    
>                                 1.33% __rcu_read_unlock  
>                      - 3.14% check_vma_flags             
>                           0.68% vma_is_secretmem         
>                        0.61% __cond_resched              
>                        0.60% vma_pgtable_walk_end        
>                        0.59% vma_pgtable_walk_begin      
>                        0.58% no_page_table               
>                   - 15.13% down_read_killable            
>                        0.69% __cond_resched              
>                     13.84% up_read                       
>                  0.58% __cond_resched                    
> 
> 
> Almost 29% of time is spent relocking the mmap semaphore in
> __get_user_pages. This most likely can operate locklessly in the fast
> path. Even if somehow not, chances are the lock can be held across
> multiple calls.
> 
> mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> be faster to rep mov 6 times instead). This is the compiler being nasty,
> I'll maybe look into it.
> 
> However, I strongly suspect the current iteration method is just slow
> due to repeat mt_find calls and The Right Approach(tm) would make this
> entire thing finish within miliseconds by iterating the maple tree
> instead, but then the mm folk would have to be consulted on how to
> approach this and it may be time consuming to implement.
> 
> Sorting out relocking should be an easily achievable & measurable win
> (no interest on my end).

As much as I agree the code is dumb, doing what you suggest with mmap_sem
isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
as that will cause lock inversion between mmap_sem and whatever filesystem
locks we have to take. So the fix would have to involve processing larger
batches of address space at once (which should also somewhat amortize the
__get_user_pages() setup costs). Not that hard to do but I wanted to spell
it out in case someone wants to pick up this todo item :)

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR