Re: [RESEND PATCH] fs: avoid mmap sem relocks when coredumping with many missing pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun 19-01-25 11:32:05, Mateusz Guzik wrote:
> Dumping processes with large allocated and mostly not-faulted areas is
> very slow.
> 
> Borrowing a test case from Tavian Barnes:
> 
> int main(void) {
>     char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>             MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>     printf("%p %m\n", mem);
>     if (mem != MAP_FAILED) {
>             mem[0] = 1;
>     }
>     abort();
> }
> 
> That's 1TB of almost completely not-populated area.
> 
> On my test box it takes 13-14 seconds to dump.
> 
> The profile shows:
> -   99.89%     0.00%  a.out
>      entry_SYSCALL_64_after_hwframe
>      do_syscall_64
>      syscall_exit_to_user_mode
>      arch_do_signal_or_restart
>    - get_signal
>       - 99.89% do_coredump
>          - 99.88% elf_core_dump
>             - dump_user_range
>                - 98.12% get_dump_page
>                   - 64.19% __get_user_pages
>                      - 40.92% gup_vma_lookup
>                         - find_vma
>                            - mt_find
>                                 4.21% __rcu_read_lock
>                                 1.33% __rcu_read_unlock
>                      - 3.14% check_vma_flags
>                           0.68% vma_is_secretmem
>                        0.61% __cond_resched
>                        0.60% vma_pgtable_walk_end
>                        0.59% vma_pgtable_walk_begin
>                        0.58% no_page_table
>                   - 15.13% down_read_killable
>                        0.69% __cond_resched
>                     13.84% up_read
>                  0.58% __cond_resched
> 
> Almost 29% of the time is spent relocking the mmap semaphore between
> calls to get_dump_page() which find nothing.
> 
> Whacking that results in times of 10 seconds (down from 13-14).
> 
> While here make the thing killable.
> 
> The real problem is the page-sized iteration and the real fix would
> patch it up instead. It is left as an exercise for the mm-familiar
> reader.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx>

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

BTW: I don't see how we could fundamentally move away from page-sized
iteration because core dumping is "by definition" walking page tables and
gathering pages there. But it could certainly be much more efficient if
implemented properly (e.g. in the example above we'd see that most of PGD
level tables are not even allocated so we could be skipping 1GB ranges of
address space in one step).

								Honza

> ---
> 
> Minimally tested, very plausible I missed something.
> 
> sent again because the previous thing has myself in To -- i failed to
> fix up the oneliner suggested by lore.kernel.org. it seem the original
> got lost.
> 
>  arch/arm64/kernel/elfcore.c |  3 ++-
>  fs/coredump.c               | 38 +++++++++++++++++++++++++++++++------
>  include/linux/mm.h          |  2 +-
>  mm/gup.c                    |  5 ++---
>  4 files changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
> index 2e94d20c4ac7..b735f4c2fe5e 100644
> --- a/arch/arm64/kernel/elfcore.c
> +++ b/arch/arm64/kernel/elfcore.c
> @@ -27,9 +27,10 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
>  	int ret = 1;
>  	unsigned long addr;
>  	void *tags = NULL;
> +	int locked = 0;
>  
>  	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
> -		struct page *page = get_dump_page(addr);
> +		struct page *page = get_dump_page(addr, &locked);
>  
>  		/*
>  		 * get_dump_page() returns NULL when encountering an empty
> diff --git a/fs/coredump.c b/fs/coredump.c
> index d48edb37bc35..84cf76f0d5b6 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -925,14 +925,23 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
>  {
>  	unsigned long addr;
>  	struct page *dump_page;
> +	int locked, ret;
>  
>  	dump_page = dump_page_alloc();
>  	if (!dump_page)
>  		return 0;
>  
> +	ret = 0;
> +	locked = 0;
>  	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
>  		struct page *page;
>  
> +		if (!locked) {
> +			if (mmap_read_lock_killable(current->mm))
> +				goto out;
> +			locked = 1;
> +		}
> +
>  		/*
>  		 * To avoid having to allocate page tables for virtual address
>  		 * ranges that have never been used yet, and also to make it
> @@ -940,21 +949,38 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
>  		 * NULL when encountering an empty page table entry that would
>  		 * otherwise have been filled with the zero page.
>  		 */
> -		page = get_dump_page(addr);
> +		page = get_dump_page(addr, &locked);
>  		if (page) {
> +			if (locked) {
> +				mmap_read_unlock(current->mm);
> +				locked = 0;
> +			}
>  			int stop = !dump_emit_page(cprm, dump_page_copy(page, dump_page));
>  			put_page(page);
> -			if (stop) {
> -				dump_page_free(dump_page);
> -				return 0;
> -			}
> +			if (stop)
> +				goto out;
>  		} else {
>  			dump_skip(cprm, PAGE_SIZE);
>  		}
> +
> +		if (dump_interrupted())
> +			goto out;
> +
> +		if (!need_resched())
> +			continue;
> +		if (locked) {
> +			mmap_read_unlock(current->mm);
> +			locked = 0;
> +		}
>  		cond_resched();
>  	}
> +	ret = 1;
> +out:
> +	if (locked)
> +		mmap_read_unlock(current->mm);
> +
>  	dump_page_free(dump_page);
> -	return 1;
> +	return ret;
>  }
>  #endif
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 75c9b4f46897..7df0d9200d8c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2633,7 +2633,7 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
>  			struct task_struct *task, bool bypass_rlim);
>  
>  struct kvec;
> -struct page *get_dump_page(unsigned long addr);
> +struct page *get_dump_page(unsigned long addr, int *locked);
>  
>  bool folio_mark_dirty(struct folio *folio);
>  bool folio_mark_dirty_lock(struct folio *folio);
> diff --git a/mm/gup.c b/mm/gup.c
> index 2304175636df..f3be2aa43543 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2266,13 +2266,12 @@ EXPORT_SYMBOL(fault_in_readable);
>   * Called without mmap_lock (takes and releases the mmap_lock by itself).
>   */
>  #ifdef CONFIG_ELF_CORE
> -struct page *get_dump_page(unsigned long addr)
> +struct page *get_dump_page(unsigned long addr, int *locked)
>  {
>  	struct page *page;
> -	int locked = 0;
>  	int ret;
>  
> -	ret = __get_user_pages_locked(current->mm, addr, 1, &page, &locked,
> +	ret = __get_user_pages_locked(current->mm, addr, 1, &page, locked,
>  				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
>  	return (ret == 1) ? page : NULL;
>  }
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux