On 6/26/23 1:04 AM, Yu Zhao wrote: > On Sat, Jun 24, 2023 at 8:54 AM Aneesh Kumar K.V > <aneesh.kumar@xxxxxxxxxxxxx> wrote: >> >> Hi Yu Zhao, >> >> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: >> >>> Not all architecture supports hardware atomic updates of access bits. On >>> such an arch, we don't use page table walk to classify pages into >>> generations. Add a kernel config option and remove adding all the page >>> table walk code on such architecture. >>> >>> No preformance change observed with mongodb ycsb test: >>> >>> Patch details Throughput(Ops/sec) >>> without patch 93278 >>> With patch 93400 >>> >>> Without patch: >>> $ size mm/vmscan.o >>> text data bss dec hex filename >>> 112102 42721 40 154863 25cef mm/vmscan.o >>> >>> With patch >>> >>> $ size mm/vmscan.o >>> text data bss dec hex filename >>> 105430 41333 24 146787 23d63 mm/vmscan.o >>> >> >> Any feedback on this patch? Can we look at merging this change? > > Just want to make sure I fully understand the motivation: are there > any other end goals besides reducing the footprint mentioned above? > E.g., preparing for HCA, etc. (My current understanding is that HCA > shouldn't care about it, since it's already runtime disabled if HCA > doesn't want to use it.) > My goal with this change was to remove all those dead code from getting complied in for ppc64. > Also as explained offline, solely relying on folio_activate() in > lru_gen_look_around() can cause a measure regression on powerpc, > because > 1. PAGEVEC_SIZE is 15 whereas pglist_data->mm_walk.batched is > virtually unlimited. > 2. Once folio_activate() reaches that limit, it takes the LRU lock on > top of the PTL, which can be shared by multiple page tables on > powerpc. > > In fact, I think we try the opposite direction first, before arriving > at any conclusions, i.e., > #define arch_has_hw_pte_young() radix_enabled() The reason it is disabled on powerpc was that a reference bit update takes a pagefault on powerpc irrespective of the translation mode. > on powerpc. This might benefit platforms with the A-bit but not HCA, > e.g., POWER9. I just ran a quick test (memcached/memtier I previously > shared with you) and it showed far less PTL contention in kswapd. I'm > attaching the flamegraphs for you to analyze. Could you try some > benchmarks with the above change on your end as well? > The ptl lock is a valid concern even though i didn't observe the contention increasing with the change. I will rerun the test to verify. We have possibly two options here 1) Delay the lruvec->nr_pages update until the sort phase. But as you explained earlier, that can impact should_run_aging(). 2) Add another batching mechanism similar to pglist_data->mm_walk which can be used on architecture that don't support hardware update of access/reference bit to be used only by lru_gen_look_around() -aneesh