Re: [PATCH v3 0/5] batched remove rmap in try_to_unmap_one()

"Yin, Fengwei" <fengwei.yin@xxxxxxxxx> · Thu, 9 Mar 2023 21:56:35 +0800

On 3/7/2023 5:12 AM, Andrew Morton wrote:
> On Mon,  6 Mar 2023 17:22:54 +0800 Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
> 
>> This series is trying to bring the batched rmap removing to
>> try_to_unmap_one(). It's expected that the batched rmap
>> removing bring performance gain than remove rmap per page.
>>
>> ...
>>
>>  include/linux/rmap.h |   5 +
>>  mm/page_vma_mapped.c |  30 +++
>>  mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
>>  3 files changed, 398 insertions(+), 260 deletions(-)
> 
> As was discussed in v2's review, if no performance benefit has been
> demonstrated, why make this change?
> 
I changed the MADV_PAGEOUT not to split the large folio for page cache
and created a micro benchmark mainly as following:

        char *c = mmap(NULL, FILESIZE, PROT_READ|PROT_WRITE,
                       MAP_PRIVATE, fd, 0);
	count = 0;
        while (1) {
                unsigned long i;

                for (i = 0; i < FILESIZE; i += pgsize) {
                        cc = *(volatile char *)(c + i);
                }
                madvise(c, FILESIZE, MADV_PAGEOUT);
		count++;
        }
        munmap(c, FILESIZE);

Run it with 96 instances + 96 files for 1 second. The test platform was on
an IceLake with 48C/96T + 192G memory.

Test result (number count) got 10% improvement with this patch series. And
perf shows following:

Before the patch:
--19.97%--try_to_unmap_one
          |          
          |--12.35%--page_remove_rmap
          |          |          
          |           --11.39%--__mod_lruvec_page_state
          |                     |          
          |                     |--1.51%--__mod_memcg_lruvec_state
          |                     |          |          
          |                     |           --0.91%--cgroup_rstat_updated
          |                     |          
          |                      --0.70%--__mod_lruvec_state
          |                                |          
          |                                 --0.63%--__mod_node_page_state
          |          
          |--5.41%--ptep_clear_flush
          |          |          
          |           --4.65%--flush_tlb_mm_range
          |                     |          
          |                      --3.83%--flush_tlb_func
          |                                |          
          |                                 --3.51%--native_flush_tlb_one_user
          |          
          |--0.75%--percpu_counter_add_batch
          |          
           --0.55%--PageHeadHuge

After the patch:
--9.50%--try_to_unmap_one
          |          
          |--6.94%--try_to_unmap_one_page.constprop.0.isra.0
          |          |          
          |          |--5.07%--ptep_clear_flush
          |          |          |          
          |          |           --4.25%--flush_tlb_mm_range
          |          |                     |          
          |          |                      --3.44%--flush_tlb_func
          |          |                                |          
          |          |                                 --3.05%--native_flush_tlb_one_user
          |          |          
          |           --0.80%--percpu_counter_add_batch
          |          
          |--1.22%--folio_remove_rmap_and_update_count.part.0
          |          |          
          |           --1.16%--folio_remove_rmap_range
          |                     |          
          |                      --0.62%--__mod_lruvec_page_state
          |          
           --0.56%--PageHeadHuge

As expected, the cost of __mod_lruvec_page_state is reduced a lot with batched
folio_remove_rmap_range.

I believe the same benefit is there for page reclaim path also. Thanks.

Regards
Yin, Fengwei