[PATCH v4 0/5] batched remove rmap in try_to_unmap_one()

Yin Fengwei <fengwei.yin@xxxxxxxxx> · Mon, 13 Mar 2023 20:45:21 +0800

This series is trying to bring the batched rmap removing to
try_to_unmap_one(). It's expected that the batched rmap
removing bring performance gain than remove rmap per page.

This series reconstruct the try_to_unmap_one() from:
  loop:
     clear and update PTE
     unmap one page
     goto loop
to:
  loop:
     clear and update PTE
     goto loop
  unmap the range of folio in one call
It is one step to always map/unmap the entire folio in one call.
Which can simplify the folio mapcount handling by avoid dealing
with each page map/unmap.

The changes are organized as:
Patch1/2 move the hugetlb and normal page unmap to dedicated
functions to make try_to_unmap_one() logic clearer and easy
to add batched rmap removing. To make code review easier, no
function change.

Patch3 cleanup the try_to_unmap_one_page(). Try to removed
some duplicated function calls.

Patch4 adds folio_remove_rmap_range() which batched remove rmap.

Patch5 make try_to_unmap_one() to batched remove rmap.

Functional testing done with the V3 patchset in a qemu guest
with 4G mem:
  - kernel mm selftest to trigger vmscan() and final hit
    try_to_unmap_one().
  - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
    call against hugetlb.
  - 8 hours stress testing: Firefox + kernel mm selftest + kernel
    build.

For performance gain demonstration, changed the MADV_PAGEOUT not
to split the large folio for page cache and created a micro
benchmark mainly as following:

        #define FILESIZE (2 * 1024 * 1024)
        char *c = mmap(NULL, FILESIZE, PROT_READ|PROT_WRITE,
                       MAP_PRIVATE, fd, 0);
	count = 0;
        while (1) {
                unsigned long i;

                for (i = 0; i < FILESIZE; i += pgsize) {
                        cc = *(volatile char *)(c + i);
                }
                madvise(c, FILESIZE, MADV_PAGEOUT);
		count++;
        }
        munmap(c, FILESIZE);

Run it with 96 instances + 96 files on xfs file system for 1
second. The test platform was IceLake with 48C/96T + 192G memory.

Test result (number count) got around %7 (58865 -> 63247) improvement
with this patch series. And perf shows following:

Without this series:
18.26%--try_to_unmap_one
        |          
        |--10.71%--page_remove_rmap
        |          |          
        |           --9.81%--__mod_lruvec_page_state
        |                     |          
        |                     |--1.36%--__mod_memcg_lruvec_state
        |                     |          |          
        |                     |           --0.80%--cgroup_rstat_updated
        |                     |          
        |                      --0.67%--__mod_lruvec_state
        |                                |          
        |                                 --0.59%--__mod_node_page_state
        |          
        |--5.41%--ptep_clear_flush
        |          |          
        |           --4.64%--flush_tlb_mm_range
        |                     |          
        |                      --3.88%--flush_tlb_func
        |                                |          
        |                                 --3.56%--native_flush_tlb_one_user
        |          
        |--0.75%--percpu_counter_add_batch
        |          
         --0.53%--PageHeadHuge

With this series:
9.87%--try_to_unmap_one
        |          
        |--7.14%--try_to_unmap_one_page.constprop.0.isra.0
        |          |          
        |          |--5.21%--ptep_clear_flush
        |          |          |          
        |          |           --4.36%--flush_tlb_mm_range
        |          |                     |          
        |          |                      --3.54%--flush_tlb_func
        |          |                                |          
        |          |                                 --3.17%--native_flush_tlb_one_user
        |          |          
        |           --0.82%--percpu_counter_add_batch
        |          
        |--1.18%--folio_remove_rmap_and_update_count.part.0
        |          |          
        |           --1.11%--folio_remove_rmap_range
        |                     |          
        |                      --0.53%--__mod_lruvec_page_state
        |          
         --0.57%--PageHeadHuge

As expected, the cost of __mod_lruvec_page_state is reduced significantly
with batched folio_remove_rmap_range. Suppose the page reclaim path can
get same benefit also.

This series based on next-20230310.

Changes from v3:
  - General
    - Rebase to next-20230310
    - Add performance testing result

  - Patch1
    - Fixed incorrect comments as Mike Kravetz pointed out
    - Use huge_pte_dirty() as Mike Kravetz suggested
    - Use true instead of folio_test_hugetlb() in
      try_to_unmap_one_hugetlb() as it's hugetlb page
      for sure as Mike Kravetz suggested

Changes from v2:
  - General
    - Rebase the patch to next-20230303
    - Update cover letter about the preparation to unmap
      the entire folio in one call
    - No code change comparing to V2. But fix the patch applying
      conflict because of wrong patch order in V2.

Changes from v1:
  - General
    - Rebase the patch to next-20230228

  - Patch1
    - Removed the if (PageHWPoison(page) && !(flags & TTU_HWPOISON)
      as suggestion from Mike Kravetz and HORIGUCHI NAOYA
    - Removed the mlock_drain_local() as suggestion from Mike Kravetz
    _ Removed the comments about the mm counter change as suggestion
      from Mike Kravetz

Yin Fengwei (5):
  rmap: move hugetlb try_to_unmap to dedicated function
  rmap: move page unmap operation to dedicated function
  rmap: cleanup exit path of try_to_unmap_one_page()
  rmap:addd folio_remove_rmap_range()
  try_to_unmap_one: batched remove rmap, update folio refcount

 include/linux/rmap.h |   5 +
 mm/page_vma_mapped.c |  30 +++
 mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
 3 files changed, 398 insertions(+), 260 deletions(-)

-- 
2.30.2