On 6/26/23 10:34 PM, Yu Zhao wrote: > On Mon, Jun 26, 2023 at 4:52 AM Aneesh Kumar K V > <aneesh.kumar@xxxxxxxxxxxxx> wrote: >> >> On 6/26/23 1:04 AM, Yu Zhao wrote: >>> On Sat, Jun 24, 2023 at 8:54 AM Aneesh Kumar K.V >>> <aneesh.kumar@xxxxxxxxxxxxx> wrote: >>>> >>>> Hi Yu Zhao, >>>> >>>> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: >>>> >>>>> Not all architecture supports hardware atomic updates of access bits. On >>>>> such an arch, we don't use page table walk to classify pages into >>>>> generations. Add a kernel config option and remove adding all the page >>>>> table walk code on such architecture. >>>>> >>>>> No preformance change observed with mongodb ycsb test: >>>>> >>>>> Patch details Throughput(Ops/sec) >>>>> without patch 93278 >>>>> With patch 93400 >>>>> >>>>> Without patch: >>>>> $ size mm/vmscan.o >>>>> text data bss dec hex filename >>>>> 112102 42721 40 154863 25cef mm/vmscan.o >>>>> >>>>> With patch >>>>> >>>>> $ size mm/vmscan.o >>>>> text data bss dec hex filename >>>>> 105430 41333 24 146787 23d63 mm/vmscan.o >>>>> >>>> >>>> Any feedback on this patch? Can we look at merging this change? >>> >>> Just want to make sure I fully understand the motivation: are there >>> any other end goals besides reducing the footprint mentioned above? >>> E.g., preparing for HCA, etc. (My current understanding is that HCA >>> shouldn't care about it, since it's already runtime disabled if HCA >>> doesn't want to use it.) >>> >> >> My goal with this change was to remove all those dead code from getting complied >> in for ppc64. > > I see. But the first thing (lru_gen_add_folio()) you moved has nothing > to do with this goal, because it's still compiled after the entire > series. > Sure. will drop that change. >>> Also as explained offline, solely relying on folio_activate() in >>> lru_gen_look_around() can cause a measure regression on powerpc, >>> because >>> 1. PAGEVEC_SIZE is 15 whereas pglist_data->mm_walk.batched is >>> virtually unlimited. >>> 2. Once folio_activate() reaches that limit, it takes the LRU lock on >>> top of the PTL, which can be shared by multiple page tables on >>> powerpc. >>> >>> In fact, I think we try the opposite direction first, before arriving >>> at any conclusions, i.e., >>> #define arch_has_hw_pte_young() radix_enabled() >> >> The reason it is disabled on powerpc was that a reference bit update takes a pagefault >> on powerpc irrespective of the translation mode. > > This is not true. > > From "IBM POWER9 Processor User Manual": > https://openpowerfoundation.org/resources/ibmpower9usermanual/ > > 4.10.14 Reference and Change Bits > ... > When performing HPT translation, the hardware performs the R and C > bit updates nonatomically. > ... > > The radix case is more complex, and I'll leave it to you to interpret > what it means: > > From "Power ISA Version 3.0 B": > https://openpowerfoundation.org/specifications/isa/ > > 5.7.12 Reference and Change Recording > ... > For Radix Tree translation, the Reference and Change bits are set atomically. > ... > it is atomic in that software use ldarx/stdcx to update these bits. Hardware/core won't update this directly even though Nest can update this directly without taking a fault. So for all purpose we can assume that on radix R/C bit is updated by page fault handler. Generic page table update sequence are slightly different with hash translation in that some page table field updates requires marking the page table entry invalid. >>> on powerpc. This might benefit platforms with the A-bit but not HCA, >>> e.g., POWER9. I just ran a quick test (memcached/memtier I previously >>> shared with you) and it showed far less PTL contention in kswapd. I'm >>> attaching the flamegraphs for you to analyze. Could you try some >>> benchmarks with the above change on your end as well? >>> >> >> The ptl lock is a valid concern even though i didn't observe the contention increasing with >> the change. I will rerun the test to verify. We have possibly two options here >> >> 1) Delay the lruvec->nr_pages update until the sort phase. But as you explained earlier, that >> can impact should_run_aging(). >> >> >> 2) Add another batching mechanism similar to pglist_data->mm_walk which can be used on architecture >> that don't support hardware update of access/reference bit to be used only by lru_gen_look_around() > > Sounds good. Thanks. I will go ahead working on the approach 2 I outlined above? -aneesh