This series is trying to bring the batched rmap removing to try_to_unmap_one(). It's expected that the batched rmap removing bring performance gain than remove rmap per page. This series reconstruct the try_to_unmap_one() from: loop: clear and update PTE unmap one page goto loop to: loop: clear and update PTE goto loop unmap the range of folio in one call It is one step to always map/unmap the entire folio in one call. Which can simplify the folio mapcount handling by avoid dealing with each page map/unmap. The changes are organized as: Patch1/2 move the hugetlb and normal page unmap to dedicated functions to make try_to_unmap_one() logic clearer and easy to add batched rmap removing. To make code review easier, no function change. Patch3 cleanup the try_to_unmap_one_page(). Try to removed some duplicated function calls. Patch4 adds folio_remove_rmap_range() which batched remove rmap. Patch5 make try_to_unmap_one() to batched remove rmap. Functional testing done with the V3 patchset in a qemu guest with 4G mem: - kernel mm selftest to trigger vmscan() and final hit try_to_unmap_one(). - Inject hwpoison to hugetlb page to trigger try_to_unmap_one() call against hugetlb. - 8 hours stress testing: Firefox + kernel mm selftest + kernel build. For performance gain demonstration, changed the MADV_PAGEOUT not to split the large folio for page cache and created a micro benchmark mainly as following: #define FILESIZE (2 * 1024 * 1024) char *c = mmap(NULL, FILESIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); count = 0; while (1) { unsigned long i; for (i = 0; i < FILESIZE; i += pgsize) { cc = *(volatile char *)(c + i); } madvise(c, FILESIZE, MADV_PAGEOUT); count++; } munmap(c, FILESIZE); Run it with 96 instances + 96 files on xfs file system for 1 second. The test platform was IceLake with 48C/96T + 192G memory. Test result (number count) got around %7 (58865 -> 63247) improvement with this patch series. And perf shows following: Without this series: 18.26%--try_to_unmap_one | |--10.71%--page_remove_rmap | | | --9.81%--__mod_lruvec_page_state | | | |--1.36%--__mod_memcg_lruvec_state | | | | | --0.80%--cgroup_rstat_updated | | | --0.67%--__mod_lruvec_state | | | --0.59%--__mod_node_page_state | |--5.41%--ptep_clear_flush | | | --4.64%--flush_tlb_mm_range | | | --3.88%--flush_tlb_func | | | --3.56%--native_flush_tlb_one_user | |--0.75%--percpu_counter_add_batch | --0.53%--PageHeadHuge With this series: 9.87%--try_to_unmap_one | |--7.14%--try_to_unmap_one_page.constprop.0.isra.0 | | | |--5.21%--ptep_clear_flush | | | | | --4.36%--flush_tlb_mm_range | | | | | --3.54%--flush_tlb_func | | | | | --3.17%--native_flush_tlb_one_user | | | --0.82%--percpu_counter_add_batch | |--1.18%--folio_remove_rmap_and_update_count.part.0 | | | --1.11%--folio_remove_rmap_range | | | --0.53%--__mod_lruvec_page_state | --0.57%--PageHeadHuge As expected, the cost of __mod_lruvec_page_state is reduced significantly with batched folio_remove_rmap_range. Suppose the page reclaim path can get same benefit also. This series based on next-20230310. Changes from v3: - General - Rebase to next-20230310 - Add performance testing result - Patch1 - Fixed incorrect comments as Mike Kravetz pointed out - Use huge_pte_dirty() as Mike Kravetz suggested - Use true instead of folio_test_hugetlb() in try_to_unmap_one_hugetlb() as it's hugetlb page for sure as Mike Kravetz suggested Changes from v2: - General - Rebase the patch to next-20230303 - Update cover letter about the preparation to unmap the entire folio in one call - No code change comparing to V2. But fix the patch applying conflict because of wrong patch order in V2. Changes from v1: - General - Rebase the patch to next-20230228 - Patch1 - Removed the if (PageHWPoison(page) && !(flags & TTU_HWPOISON) as suggestion from Mike Kravetz and HORIGUCHI NAOYA - Removed the mlock_drain_local() as suggestion from Mike Kravetz _ Removed the comments about the mm counter change as suggestion from Mike Kravetz Yin Fengwei (5): rmap: move hugetlb try_to_unmap to dedicated function rmap: move page unmap operation to dedicated function rmap: cleanup exit path of try_to_unmap_one_page() rmap:addd folio_remove_rmap_range() try_to_unmap_one: batched remove rmap, update folio refcount include/linux/rmap.h | 5 + mm/page_vma_mapped.c | 30 +++ mm/rmap.c | 623 +++++++++++++++++++++++++------------------ 3 files changed, 398 insertions(+), 260 deletions(-) -- 2.30.2