On 11/20/2023 9:48 PM, kernel test robot wrote: > > hi, Fengwei, > > we noticed c8be038067 is the fix commit for > de74976eb65151a2f568e477fc2e0032df5b22b4 ("filemap: add filemap_map_folio_range()") > > and we captured numbers of improvement for this commit > (refer to below > "In addition to that, the commit also has significant impact on the following tests" > part which includes several examples) > > however, recently, we found a regression as title mentioned. I can reproduce the regression on an Ice Lake platform with 256G memory + 72C/144T. > > the extra information we want to share is we also tested on mainline tip when > this bisect done, and noticed the regression disappear: I can also reproduce this "regression disappear on latest mainline". And I found the related commit was: commit f5617ffeb450f84c57f7eba1a3524a29955d42b7 Author: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> Date: Mon Jul 24 19:54:08 2023 +0100 mm: run the fault-around code under the VMA lock With this commit, the map_pages() could be called twice. The first is with VMA lock hold. The second one is with mmap_lock (even no set_pte because of !pte_none()). With this commit reverted, the regression can be restored to some level. And The reason that the "regression disappear on latest mainline" is related with commit 12214eba1992642eee5813a9cc9f626e5b2d1815 (test6) Author: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> Date: Fri Oct 6 20:53:17 2023 +0100 mm: handle read faults under the VMA lock This commit eliminates the second call of map_pages() and the regression can be restored to some level. We may still need to move following code block before fault around: ret = vmf_can_call_fault(vmf); if (ret) return ret; > > # extra tests on head commit of linus/master > # good: [9bacdd8996c77c42ca004440be610692275ff9d0] Merge tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux > > the data is even better than parent: > > "vm-scalability.throughput": [ > 54122867, > 58465096, > 55888729, > 56960948, > 56385702, > 56392203 > ], > > and we also reverted c8be038067 on maineline tip, but found no big diff: > > # extra tests on revert first bad commit > # good: [f13a82be4c3252dabd1328a437c309d84ec7a872] Revert "filemap: add filemap_map_order0_folio() to handle order0 folio" > > "vm-scalability.throughput": [ > 56434337, > 56199754, > 56214041, > 55308070, > 55401115, > 55709753 > ], > > > commit: > 578d7699e5 ("proc: nommu: /proc/<pid>/maps: release mmap read lock") > c8be038067 ("filemap: add filemap_map_order0_folio() to handle order0 folio") > > 578d7699e5c2add8 c8be03806738c86521dbf1e0503 > ---------------- --------------------------- > %stddev %change %stddev > \ | \ > 146.95 ± 8% -83.0% 24.99 ± 3% vm-scalability.free_time > 233050 -28.1% 167484 vm-scalability.median > 590.30 ± 12% -590.2 0.06 ± 45% vm-scalability.stddev% > 51589606 -27.3% 37516397 vm-scalability.throughput I found very interesting behavior: 1. I am sure the filemap_map_order0_folio() is faster than filemap_map_folio_range() if the folio has order 0 (true for shmem for now). 2. If I use tool ebpf_trace to get how many times the filemap_map_order0_folio() is called during vm-scalability is running, the test result become better. In general, the test result should become worse. It looks slower filemap_map_order0_folio() can make better vm-scalability result. I did another testing by adding 2us delay in filemap_map_order0_folio(), the vm-scalability result get a little bit improved. 3. using perf with 578d7699e5 and c8be038067: for do_read_fault() with 578d7699e5 : - 48.58% 0.04% usemem [kernel.vmlinux] [k] do_read_fault - 48.54% do_read_fault - 44.34% filemap_map_pages 19.45% filemap_map_folio_range 3.22% next_uptodate_folio + 1.72% folio_wake_bit + 3.29% __do_fault 0.65% folio_wake_bit for do_read_fault() with c8be038067: - 72.98% 0.09% usemem [kernel.vmlinux] [k] do_read_fault - 72.89% do_read_fault - 52.70% filemap_map_pages 32.10% next_uptodate_folio <----- much higher than 578d7699e5 + 12.35% folio_wake_bit + 1.53% set_pte_range + 12.43% __do_fault + 6.36% folio_wake_bit <----- higher than 578d7699e5 + 0.97% finish_fault I have theory that faster filemap_map_order0_folio() brings more contention in next_uptodate_folio(). 4. I finally located what part of code brought higher contentions in next_uptodate_folio(). It's related with following change: diff --git a/mm/filemap.c b/mm/filemap.c index 582f5317ff71..056a2d2e2428 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3481,7 +3481,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, struct vm_area_struct *vma = vmf->vma; struct file *file = vma->vm_file; struct page *page = folio_page(folio, start); - unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); unsigned int count = 0; pte_t *old_ptep = vmf->pte; @@ -3489,9 +3488,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, if (PageHWPoison(page + count)) goto skip; - if (mmap_miss > 0) - mmap_miss--; - /* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit the @@ -3525,7 +3521,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, } vmf->pte = old_ptep; - WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); return ret; } If apply above change to 578d7699e5, the next_uptodate_folio() raised. perf command got: - 68.83% 0.08% usemem [kernel.vmlinux] [k] do_read_fault - 68.75% do_read_fault - 49.34% filemap_map_pages 29.71% next_uptodate_folio + 11.82% folio_wake_bit + 2.34% filemap_map_folio_range - 11.97% __do_fault + 11.93% shmem_fault + 6.12% folio_wake_bit + 0.92% finish_fault And vm-scalability dropped to same level as latest mainline. IMHO, we should slowdown filemap_map_order0_folio() because other benchmark can get benefit from faster filemap_map_order0_folio(). Thanks. Regards Yin, Fengwei