On Mon, Jun 26, 2023 at 4:52 AM Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> wrote: > > On 6/26/23 1:04 AM, Yu Zhao wrote: > > On Sat, Jun 24, 2023 at 8:54 AM Aneesh Kumar K.V > > <aneesh.kumar@xxxxxxxxxxxxx> wrote: > >> > >> Hi Yu Zhao, > >> > >> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: > >> > >>> Not all architecture supports hardware atomic updates of access bits. On > >>> such an arch, we don't use page table walk to classify pages into > >>> generations. Add a kernel config option and remove adding all the page > >>> table walk code on such architecture. > >>> > >>> No preformance change observed with mongodb ycsb test: > >>> > >>> Patch details Throughput(Ops/sec) > >>> without patch 93278 > >>> With patch 93400 > >>> > >>> Without patch: > >>> $ size mm/vmscan.o > >>> text data bss dec hex filename > >>> 112102 42721 40 154863 25cef mm/vmscan.o > >>> > >>> With patch > >>> > >>> $ size mm/vmscan.o > >>> text data bss dec hex filename > >>> 105430 41333 24 146787 23d63 mm/vmscan.o > >>> > >> > >> Any feedback on this patch? Can we look at merging this change? > > > > Just want to make sure I fully understand the motivation: are there > > any other end goals besides reducing the footprint mentioned above? > > E.g., preparing for HCA, etc. (My current understanding is that HCA > > shouldn't care about it, since it's already runtime disabled if HCA > > doesn't want to use it.) > > > > My goal with this change was to remove all those dead code from getting complied > in for ppc64. I see. But the first thing (lru_gen_add_folio()) you moved has nothing to do with this goal, because it's still compiled after the entire series. > > Also as explained offline, solely relying on folio_activate() in > > lru_gen_look_around() can cause a measure regression on powerpc, > > because > > 1. PAGEVEC_SIZE is 15 whereas pglist_data->mm_walk.batched is > > virtually unlimited. > > 2. Once folio_activate() reaches that limit, it takes the LRU lock on > > top of the PTL, which can be shared by multiple page tables on > > powerpc. > > > > In fact, I think we try the opposite direction first, before arriving > > at any conclusions, i.e., > > #define arch_has_hw_pte_young() radix_enabled() > > The reason it is disabled on powerpc was that a reference bit update takes a pagefault > on powerpc irrespective of the translation mode. This is not true. >From "IBM POWER9 Processor User Manual": https://openpowerfoundation.org/resources/ibmpower9usermanual/ 4.10.14 Reference and Change Bits ... When performing HPT translation, the hardware performs the R and C bit updates nonatomically. ... The radix case is more complex, and I'll leave it to you to interpret what it means: >From "Power ISA Version 3.0 B": https://openpowerfoundation.org/specifications/isa/ 5.7.12 Reference and Change Recording ... For Radix Tree translation, the Reference and Change bits are set atomically. ... > > on powerpc. This might benefit platforms with the A-bit but not HCA, > > e.g., POWER9. I just ran a quick test (memcached/memtier I previously > > shared with you) and it showed far less PTL contention in kswapd. I'm > > attaching the flamegraphs for you to analyze. Could you try some > > benchmarks with the above change on your end as well? > > > > The ptl lock is a valid concern even though i didn't observe the contention increasing with > the change. I will rerun the test to verify. We have possibly two options here > > 1) Delay the lruvec->nr_pages update until the sort phase. But as you explained earlier, that > can impact should_run_aging(). > > > 2) Add another batching mechanism similar to pglist_data->mm_walk which can be used on architecture > that don't support hardware update of access/reference bit to be used only by lru_gen_look_around() Sounds good. Thanks.