Re: [PATCH 3/3] mm/lru_gen: Don't build multi-gen LRU page table walk code on architecture not supported

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/26/23 1:04 AM, Yu Zhao wrote:
> On Sat, Jun 24, 2023 at 8:54 AM Aneesh Kumar K.V
> <aneesh.kumar@xxxxxxxxxxxxx> wrote:
>>
>> Hi Yu Zhao,
>>
>> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes:
>>
>>> Not all architecture supports hardware atomic updates of access bits. On
>>> such an arch, we don't use page table walk to classify pages into
>>> generations. Add a kernel config option and remove adding all the page
>>> table walk code on such architecture.
>>>
>>> No preformance change observed with mongodb ycsb test:
>>>
>>> Patch details        Throughput(Ops/sec)
>>> without patch         93278
>>> With patch            93400
>>>
>>> Without patch:
>>> $ size mm/vmscan.o
>>>    text    data     bss     dec     hex filename
>>>  112102   42721      40  154863   25cef mm/vmscan.o
>>>
>>> With patch
>>>
>>> $ size  mm/vmscan.o
>>>    text    data     bss     dec     hex filename
>>>  105430   41333      24  146787   23d63 mm/vmscan.o
>>>
>>
>> Any feedback on this patch? Can we look at merging this change?
> 
> Just want to make sure I fully understand the motivation: are there
> any other end goals besides reducing the footprint mentioned above?
> E.g., preparing for HCA, etc. (My current understanding is that HCA
> shouldn't care about it, since it's already runtime disabled if HCA
> doesn't want to use it.)
> 

My goal with this change was to remove all those dead code from getting complied
in for ppc64. 


> Also as explained offline, solely relying on folio_activate() in
> lru_gen_look_around() can cause a measure regression on powerpc,
> because
> 1. PAGEVEC_SIZE is 15 whereas pglist_data->mm_walk.batched is
> virtually unlimited.
> 2. Once folio_activate() reaches that limit, it takes the LRU lock on
> top of the PTL, which can be shared by multiple page tables on
> powerpc.
> 
> In fact, I think we try the opposite direction first, before arriving
> at any conclusions, i.e.,
>     #define arch_has_hw_pte_young() radix_enabled()

The reason it is disabled on powerpc was that a reference bit update takes a pagefault
on powerpc irrespective of the translation mode. 

> on powerpc. This might benefit platforms with the A-bit but not HCA,
> e.g., POWER9. I just ran a quick test (memcached/memtier I previously
> shared with you) and it showed far less PTL contention in kswapd. I'm
> attaching the flamegraphs for you to analyze. Could you try some
> benchmarks with the above change on your end as well?
> 

The ptl lock is a valid concern even though i didn't observe the contention increasing with
the change. I will rerun the test to verify. We have possibly two options here

1) Delay the lruvec->nr_pages update until the sort phase. But as you explained earlier, that
can impact should_run_aging().


2) Add another batching mechanism similar to pglist_data->mm_walk which can be used on architecture
that don't support hardware update of access/reference bit to be used only by lru_gen_look_around()

-aneesh





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux