Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

Mateusz Guzik <mjguzik@xxxxxxxxx> · Thu, 16 Jan 2025 08:46:48 +0100

On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> dump_user_range() supports sparse core dumps by skipping anonymous pages
> which have not been modified.  If get_dump_page() returns NULL, the page
> is skipped rather than written to the core dump with dump_emit_page().
> 
> Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> so when dumping a very large sparse region, the core dump becomes
> effectively uninterruptible.  This can be observed with the following
> test program:
> 
>     #include <stdlib.h>
>     #include <stdio.h>
>     #include <sys/mman.h>
> 
>     int main(void) {
>         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>         printf("%p %m\n", mem);
>         if (mem != MAP_FAILED) {
>                 mem[0] = 1;
>         }
>         abort();
>     }
> 
> The program allocates 1 TiB of anonymous memory, touches one page of it,
> and aborts.  During the core dump, SIGKILL has no effect.  It takes
> about 30 seconds to finish the dump, burning 100% CPU.
> 

While the patch makes sense to me, this should not be taking anywhere
near this much time and plausibly after unscrewing it will stop being a
factor.

So I had a look with a profiler:
-   99.89%     0.00%  a.out
     entry_SYSCALL_64_after_hwframe                      
     do_syscall_64                                       
     syscall_exit_to_user_mode                           
     arch_do_signal_or_restart                           
   - get_signal                                          
      - 99.89% do_coredump                               
         - 99.88% elf_core_dump                          
            - dump_user_range                            
               - 98.12% get_dump_page                    
                  - 64.19% __get_user_pages              
                     - 40.92% gup_vma_lookup             
                        - find_vma                       
                           - mt_find                     
                                4.21% __rcu_read_lock    
                                1.33% __rcu_read_unlock  
                     - 3.14% check_vma_flags             
                          0.68% vma_is_secretmem         
                       0.61% __cond_resched              
                       0.60% vma_pgtable_walk_end        
                       0.59% vma_pgtable_walk_begin      
                       0.58% no_page_table               
                  - 15.13% down_read_killable            
                       0.69% __cond_resched              
                    13.84% up_read                       
                 0.58% __cond_resched                    

Almost 29% of time is spent relocking the mmap semaphore in
__get_user_pages. This most likely can operate locklessly in the fast
path. Even if somehow not, chances are the lock can be held across
multiple calls.

mt_find spends most of it's time issuing a rep stos of 48 bytes (would
be faster to rep mov 6 times instead). This is the compiler being nasty,
I'll maybe look into it.

However, I strongly suspect the current iteration method is just slow
due to repeat mt_find calls and The Right Approach(tm) would make this
entire thing finish within miliseconds by iterating the maple tree
instead, but then the mm folk would have to be consulted on how to
approach this and it may be time consuming to implement.

Sorting out relocking should be an easily achievable & measurable win
(no interest on my end).