On 6/5/23 22:03, Liam R. Howlett wrote: > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230605 03:59]: >> >> >> 在 2023/6/5 14:18, Yin, Fengwei 写道: >>> >>> >>> On 6/5/2023 12:41 PM, Yin Fengwei wrote: >>>> Hi Peng, >>>> >>>> On 6/5/23 11:28, Peng Zhang wrote: >>>>> >>>>> >>>>> 在 2023/6/2 16:10, Yin, Fengwei 写道: >>>>>> Hi Liam, >>>>>> >>>>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote: >>>>>>> Initial work on preallocations showed no regression in performance >>>>>>> during testing, but recently some users (both on [1] and off [android] >>>>>>> list) have reported that preallocating the worst-case number of nodes >>>>>>> has caused some slow down. This patch set addresses the number of >>>>>>> allocations in a few ways. >>>>>>> >>>>>>> During munmap() most munmap() operations will remove a single VMA, so >>>>>>> leverage the fact that the maple tree can place a single pointer at >>>>>>> range 0 - 0 without allocating. This is done by changing the index in >>>>>>> the 'sidetree'. >>>>>>> >>>>>>> Re-introduce the entry argument to mas_preallocate() so that a more >>>>>>> intelligent guess of the node count can be made. >>>>>>> >>>>>>> Patches are in the following order: >>>>>>> 0001-0002: Testing framework for benchmarking some operations >>>>>>> 0003-0004: Reduction of maple node allocation in sidetree >>>>>>> 0005: Small cleanup of do_vmi_align_munmap() >>>>>>> 0006-0013: mas_preallocate() calculation change >>>>>>> 0014: Change the vma iterator order >>>>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with >>>>>> this patchset. >>>>>> >>>>>> The result has a little bit improvement: >>>>>> Base (next-20230602): >>>>>> 503880 >>>>>> Base with this patchset: >>>>>> 519501 >>>>>> >>>>>> But they are far from the none-regression result (commit 7be1c1a3c7b1): >>>>>> 718080 >>>>>> >>>>>> >>>>>> Some other information I collected: >>>>>> With Base, the mas_alloc_nodes are always hit with request: 7. >>>>>> With this patchset, the request are 1 or 5. >>>>>> >>>>>> I suppose this is the reason for improvement from 503880 to 519501. >>>>>> >>>>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered >>>>>> mas_alloc_nodes() call. Thanks. >>>>> Hi Fengwei, >>>>> >>>>> I think it may be related to the inaccurate number of nodes allocated >>>>> in the pre-allocation. I slightly modified the pre-allocation in this >>>>> patchset, but I don't know if it works. It would be great if you could >>>>> help test it, and help pinpoint the cause. Below is the diff, which can >>>>> be applied based on this pachset. >>>> I tried the patch, it could eliminate the call of mas_alloc_nodes() during >>>> the test. But the result of benchmark got a little bit improvement: >>>> 529040 >>>> >>>> But it's still much less than none-regression result. I will also double >>>> confirm the none-regression result. >>> Just noticed that the commit f5715584af95 make validate_mm() two implementation >>> based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have >>> CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an >>> apple to apple. > > You mean "mm: update validate_mm() to use vma iterator" here I guess. I > have it as a different commit id in my branch. Yes. The f5715584af95 was from next-20230602. I will include the subject of the patch in the future to avoid such confusion. > > I 'restored' some of the checking because I was able to work around not > having the mt_dump() definition with the vma iterator. I'm now > wondering how wide spread CONFIG_DEBUG_VM is used and if I should not > have added these extra checks. No worries here. The problem is in my local config. I enabled the CONFIG_DEBUG_VM to capture VM asserts/warnings. It should be disabled for performance testing. LKP does disable it for performance check. Regards Yin, Fengwei > >>> >>> >>> I disable CONFIG_DEBUG_VM and re-run the test and got: >>> Before preallocation change (7be1c1a3c7b1): >>> 770100 >>> After preallocation change (28c5609fb236): >>> 680000 >>> With liam's fix: >>> 702100 >>> plus Peng's fix: >>> 725900 >> Thank you for your test, now it seems that the performance >> regression is not so much. > > We are also too strict on the reset during mas_store_prealloc() checking > for a spanning write. I have a fix for this for v2 of the patch set, > although I suspect it will not make a huge difference. > >>> >>> >>> Regards >>> Yin, Fengwei >>> >>>> >>>> >>>> Regards >>>> Yin, Fengwei >>>> >>>>> >>>>> Thanks, >>>>> Peng >>>>> >>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c >>>>> index 5ea211c3f186..e67bf2744384 100644 >>>>> --- a/lib/maple_tree.c >>>>> +++ b/lib/maple_tree.c >>>>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>>> goto ask_now; >>>>> } >>>>> >>>>> - /* New root needs a singe node */ >>>>> - if (unlikely(mte_is_root(mas->node))) >>>>> - goto ask_now; > > Why did you drop this? If we are creating a new root we will only need > one node. > >>>>> + if ((node_size == wr_mas.node_end + 1 && >>>>> + mas->offset == wr_mas.node_end) || >>>>> + (node_size == wr_mas.node_end && >>>>> + wr_mas.offset_end - mas->offset == 1)) >>>>> + return 0; > > I will add this to v2 as well, or something similar. > >>>>> >>>>> /* Potential spanning rebalance collapsing a node, use worst-case */ >>>>> if (node_size - 1 <= mt_min_slots[wr_mas.type]) >>>>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>>> if (likely(!mas_is_err(mas))) >>>>> return 0; >>>>> >>>>> - mas_set_alloc_req(mas, 0); > > Why did you drop this? It seems like a worth while cleanup on failure. > >>>>> ret = xa_err(mas->node); >>>>> mas_reset(mas); >>>>> mas_destroy(mas); >>>>> >>>>> >>>>>> >>>>>> >>>>>> Regards >>>>>> Yin, Fengwei >>>>>> >>>>>>> >>>>>>> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@xxxxxxxxx/ >>>>>>> >>>>>>> Liam R. Howlett (14): >>>>>>> maple_tree: Add benchmarking for mas_for_each >>>>>>> maple_tree: Add benchmarking for mas_prev() >>>>>>> mm: Move unmap_vmas() declaration to internal header >>>>>>> mm: Change do_vmi_align_munmap() side tree index >>>>>>> mm: Remove prev check from do_vmi_align_munmap() >>>>>>> maple_tree: Introduce __mas_set_range() >>>>>>> mm: Remove re-walk from mmap_region() >>>>>>> maple_tree: Re-introduce entry to mas_preallocate() arguments >>>>>>> mm: Use vma_iter_clear_gfp() in nommu >>>>>>> mm: Set up vma iterator for vma_iter_prealloc() calls >>>>>>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() >>>>>>> maple_tree: Update mas_preallocate() testing >>>>>>> maple_tree: Refine mas_preallocate() node calculations >>>>>>> mm/mmap: Change vma iteration order in do_vmi_align_munmap() >>>>>>> >>>>>>> fs/exec.c | 1 + >>>>>>> include/linux/maple_tree.h | 23 ++++- >>>>>>> include/linux/mm.h | 4 - >>>>>>> lib/maple_tree.c | 78 ++++++++++---- >>>>>>> lib/test_maple_tree.c | 74 +++++++++++++ >>>>>>> mm/internal.h | 40 ++++++-- >>>>>>> mm/memory.c | 16 ++- >>>>>>> mm/mmap.c | 171 ++++++++++++++++--------------- >>>>>>> mm/nommu.c | 45 ++++---- >>>>>>> tools/testing/radix-tree/maple.c | 59 ++++++----- >>>>>>> 10 files changed, 331 insertions(+), 180 deletions(-) >>>>>>>