Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then directly modify the entries of VMAs in the new maple tree, which can get better performance. dup_mmap() is used by fork(), so this patch optimizes fork(). The optimization effect is proportional to the number of VMAs. Due to the introduction of this method, the optimization in (maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer has an effect here, but it is also an optimization of the maple tree. There is a unixbench test suite[2] where 'spawn' is used to test fork(). 'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a bit to use mmap() to control the number of VMAs. Therefore, the performance under different numbers of VMAs can be measured. Insert code like below into 'spawn': for (int i = 0; i < 200; ++i) { size_t size = 10 * getpagesize(); void *addr; if (i & 1) { addr = mmap(NULL, size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); } else { addr = mmap(NULL, size, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); } if (addr == MAP_FAILED) ... } Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test 4 times in 30 seconds each time, and get the following numbers. These numbers are the number of fork() successes in 30s (average of the best 3 out of 4). By the way, based on next-20230725, I reverted [1], and tested it together as a comparison. In order to ensure the reliability of the test results, these tests were run on a physical machine. 23VMAs 223VMAs 4023VMAs revert [1]: 159104.00 73316.33 6787.00 +0.77% +0.42% +0.28% next-20230721: 160321.67 73624.67 6806.33 +2.77% +15.42% +29.86% apply this: 164751.67 84980.33 8838.67 It can be seen that the performance improvement is proportional to the number of VMAs. With 23 VMAs, performance improves by about 3%, with 223 VMAs, performance improves by about 15%, and with 4023 VMAs, performance improves by about 30%. [1] https://lore.kernel.org/lkml/20230628073657.75314-4-zhangpeng.00@xxxxxxxxxxxxx/ [2] https://github.com/kdlucas/byte-unixbench/tree/master Signed-off-by: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> --- kernel/fork.c | 35 +++++++++++++++++++++++++++-------- mm/mmap.c | 14 ++++++++++++-- 2 files changed, 39 insertions(+), 10 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index f81149739eb9..ef80025b62d6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, int retval; unsigned long charge = 0; LIST_HEAD(uf); - VMA_ITERATOR(old_vmi, oldmm, 0); VMA_ITERATOR(vmi, mm, 0); uprobe_start_dup_mmap(); @@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, goto out; khugepaged_fork(mm, oldmm); - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count); - if (retval) + /* Use __mt_dup() to efficiently build an identical maple tree. */ + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN); + if (unlikely(retval)) goto out; mt_clear_in_rcu(vmi.mas.tree); - for_each_vma(old_vmi, mpnt) { + for_each_vma(vmi, mpnt) { struct file *file; vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); + + /* + * Since the new tree is exactly the same as the old one, + * we need to remove the unneeded VMAs. + */ + mas_store(&vmi.mas, NULL); + + /* + * Even removing an entry may require memory allocation, + * and if removal fails, we use XA_ZERO_ENTRY to mark + * from which VMA it failed. The case of encountering + * XA_ZERO_ENTRY will be handled in exit_mmap(). + */ + if (unlikely(mas_is_err(&vmi.mas))) { + retval = xa_err(vmi.mas.node); + mas_reset(&vmi.mas); + if (mas_find(&vmi.mas, ULONG_MAX)) + mas_replace_entry(&vmi.mas, + XA_ZERO_ENTRY); + goto loop_out; + } + continue; } charge = 0; @@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, hugetlb_dup_vma_private(tmp); /* Link the vma into the MT */ - if (vma_iter_bulk_store(&vmi, tmp)) - goto fail_nomem_vmi_store; + mas_replace_entry(&vmi.mas, tmp); mm->map_count++; if (!(tmp->vm_flags & VM_WIPEONFORK)) @@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, uprobe_end_dup_mmap(); return retval; -fail_nomem_vmi_store: - unlink_anon_vmas(tmp); fail_nomem_anon_vma_fork: mpol_put(vma_policy(tmp)); fail_nomem_policy: diff --git a/mm/mmap.c b/mm/mmap.c index bc91d91261ab..5bfba2fb0e39 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm) arch_exit_mmap(mm); vma = mas_find(&mas, ULONG_MAX); - if (!vma) { + /* + * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY, + * xa_is_zero(vma) may be true. + */ + if (!vma || xa_is_zero(vma)) { /* Can happen if dup_mmap() received an OOM */ mmap_read_unlock(mm); return; @@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm) remove_vma(vma, true); count++; cond_resched(); - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL); + vma = mas_find(&mas, ULONG_MAX); + /* + * If xa_is_zero(vma) is true, it means that subsequent VMAs + * donot need to be removed. Can happen if dup_mmap() fails to + * remove a VMA marked VM_DONTCOPY. + */ + } while (vma != NULL && !xa_is_zero(vma)); BUG_ON(count != mm->map_count); -- 2.20.1