Re: [linus:master] [filemap] c8be038067: vm-scalability.throughput -27.3% regression

"Yin, Fengwei" <fengwei.yin@xxxxxxxxx> · Fri, 24 Nov 2023 14:42:24 +0800



On 11/24/2023 12:05 AM, Yin, Fengwei wrote:
> 
> On 11/20/2023 9:48 PM, kernel test robot wrote:
>>
>> hi, Fengwei,
>>
>> we noticed c8be038067 is the fix commit for
>> de74976eb65151a2f568e477fc2e0032df5b22b4 ("filemap: add filemap_map_folio_range()")
>>
>> and we captured numbers of improvement for this commit
>> (refer to below
>> "In addition to that, the commit also has significant impact on the following tests"
>> part which includes several examples)
>>
>> however, recently, we found a regression as title mentioned.
> I can reproduce the regression on an Ice Lake platform with 256G memory + 72C/144T.
> 
>>
>> the extra information we want to share is we also tested on mainline tip when
>> this bisect done, and noticed the regression disappear:
> I can also reproduce this "regression disappear on latest mainline". And I found
> the related commit was:
> 
> commit f5617ffeb450f84c57f7eba1a3524a29955d42b7
> Author: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> Date:   Mon Jul 24 19:54:08 2023 +0100
> 
>     mm: run the fault-around code under the VMA lock
> 
> With this commit, the map_pages() could be called twice. The first is with VMA lock hold.
> The second one is with mmap_lock (even no set_pte because of !pte_none()).
> 
> With this commit reverted, the regression can be restored to some level.
> 
> 
> And The reason that the "regression disappear on latest mainline" is related with
> 
> commit 12214eba1992642eee5813a9cc9f626e5b2d1815 (test6)
> Author: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> Date:   Fri Oct 6 20:53:17 2023 +0100
> 
>     mm: handle read faults under the VMA lock
> 
> 
> This commit eliminates the second call of map_pages() and the regression can be
> restored to some level.
> 
> We may still need to move following code block before fault around:
>         ret = vmf_can_call_fault(vmf);
>         if (ret)
>                 return ret;
> 
>>
>> # extra tests on head commit of linus/master
>> # good: [9bacdd8996c77c42ca004440be610692275ff9d0] Merge tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
>>
>> the data is even better than parent:
>>
>>   "vm-scalability.throughput": [
>>     54122867,
>>     58465096,
>>     55888729,
>>     56960948,
>>     56385702,
>>     56392203
>>   ],
>>
>> and we also reverted c8be038067 on maineline tip, but found no big diff:
>>
>> # extra tests on revert first bad commit
>> # good: [f13a82be4c3252dabd1328a437c309d84ec7a872] Revert "filemap: add filemap_map_order0_folio() to handle order0 folio"
>>
>>   "vm-scalability.throughput": [
>>     56434337,
>>     56199754,
>>     56214041,
>>     55308070,
>>     55401115,
>>     55709753
>>   ],
>>
>>
>> commit: 
>>   578d7699e5 ("proc: nommu: /proc/<pid>/maps: release mmap read lock")
>>   c8be038067 ("filemap: add filemap_map_order0_folio() to handle order0 folio")
>>
>> 578d7699e5c2add8 c8be03806738c86521dbf1e0503 
>> ---------------- --------------------------- 
>>          %stddev     %change         %stddev
>>              \          |                \  
>>     146.95 ±  8%     -83.0%      24.99 ±  3%  vm-scalability.free_time
>>     233050           -28.1%     167484        vm-scalability.median
>>     590.30 ± 12%    -590.2        0.06 ± 45%  vm-scalability.stddev%
>>   51589606           -27.3%   37516397        vm-scalability.throughput
> 
> I found very interesting behavior:
> 1. I am sure the filemap_map_order0_folio() is faster than filemap_map_folio_range()
>    if the folio has order 0 (true for shmem for now).
> 
> 2. If I use tool ebpf_trace to get how many times the filemap_map_order0_folio() is
>    called during vm-scalability is running, the test result become better. In general,
>    the test result should become worse.
> 
>    It looks slower filemap_map_order0_folio() can make better vm-scalability result.
>    I did another testing by adding 2us delay in filemap_map_order0_folio(), the
>    vm-scalability result get a little bit improved.
> 
> 3. using perf with 578d7699e5 and c8be038067:
>    for do_read_fault() with 578d7699e5 :
>         -   48.58%     0.04%  usemem           [kernel.vmlinux]            [k] do_read_fault
>            - 48.54% do_read_fault
>               - 44.34% filemap_map_pages
>                    19.45% filemap_map_folio_range
>                    3.22% next_uptodate_folio
>                  + 1.72% folio_wake_bit
>               + 3.29% __do_fault
>                 0.65% folio_wake_bit
> 
>    for do_read_fault() with c8be038067:
>         -   72.98%     0.09%  usemem           [kernel.vmlinux]            [k] do_read_fault
>            - 72.89% do_read_fault
>               - 52.70% filemap_map_pages
>                    32.10% next_uptodate_folio   <----- much higher than 578d7699e5
>                  + 12.35% folio_wake_bit
>                  + 1.53% set_pte_range
>               + 12.43% __do_fault
>               + 6.36% folio_wake_bit		<----- higher than 578d7699e5
>               + 0.97% finish_fault
> 
>    I have theory that faster filemap_map_order0_folio() brings more contention in next_uptodate_folio().
> 
> 4. I finally located what part of code brought higher contentions in next_uptodate_folio(). It's related with
>    following change:
>         diff --git a/mm/filemap.c b/mm/filemap.c
>         index 582f5317ff71..056a2d2e2428 100644
>         --- a/mm/filemap.c
>         +++ b/mm/filemap.c
>         @@ -3481,7 +3481,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>                 struct vm_area_struct *vma = vmf->vma;
>                 struct file *file = vma->vm_file;
>                 struct page *page = folio_page(folio, start);
>         -       unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
>                 unsigned int count = 0;
>                 pte_t *old_ptep = vmf->pte;
>         
>         @@ -3489,9 +3488,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>                         if (PageHWPoison(page + count))
>                                 goto skip;
>         
>         -               if (mmap_miss > 0)
>         -                       mmap_miss--;
>         -
>                         /*
>                          * NOTE: If there're PTE markers, we'll leave them to be
>                          * handled in the specific fault path, and it'll prohibit the
>         @@ -3525,7 +3521,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>                 }
>         
>                 vmf->pte = old_ptep;
>         -       WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
>         
>                 return ret;
>          }
> 
>    If apply above change to 578d7699e5, the next_uptodate_folio() raised. perf command got:
>         -   68.83%     0.08%  usemem           [kernel.vmlinux]            [k] do_read_fault
>            - 68.75% do_read_fault
>               - 49.34% filemap_map_pages
>                    29.71% next_uptodate_folio
>                  + 11.82% folio_wake_bit
>                  + 2.34% filemap_map_folio_range
>               - 11.97% __do_fault
>                  + 11.93% shmem_fault
>               + 6.12% folio_wake_bit
>               + 0.92% finish_fault
> 
>    And vm-scalability dropped to same level as latest mainline.
> 
>    IMHO, we should slowdown filemap_map_order0_folio() because other benchmark can get benefit
Sorry for the typo. I meant "we should not slowdown filemap_map_order0_folio()..".

>    from faster filemap_map_order0_folio(). Thanks.
> 
> 
> Regards
> Yin, Fengwei
>