On Tue, Jun 27, 2023 at 5:48 AM Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> wrote: > > On 6/26/23 10:34 PM, Yu Zhao wrote: > > On Mon, Jun 26, 2023 at 4:52 AM Aneesh Kumar K V > > <aneesh.kumar@xxxxxxxxxxxxx> wrote: > >> > >> On 6/26/23 1:04 AM, Yu Zhao wrote: > >>> On Sat, Jun 24, 2023 at 8:54 AM Aneesh Kumar K.V > >>> <aneesh.kumar@xxxxxxxxxxxxx> wrote: > >>>> > >>>> Hi Yu Zhao, > >>>> > >>>> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: > >>>> > >>>>> Not all architecture supports hardware atomic updates of access bits. On > >>>>> such an arch, we don't use page table walk to classify pages into > >>>>> generations. Add a kernel config option and remove adding all the page > >>>>> table walk code on such architecture. > >>>>> > >>>>> No preformance change observed with mongodb ycsb test: > >>>>> > >>>>> Patch details Throughput(Ops/sec) > >>>>> without patch 93278 > >>>>> With patch 93400 > >>>>> > >>>>> Without patch: > >>>>> $ size mm/vmscan.o > >>>>> text data bss dec hex filename > >>>>> 112102 42721 40 154863 25cef mm/vmscan.o > >>>>> > >>>>> With patch > >>>>> > >>>>> $ size mm/vmscan.o > >>>>> text data bss dec hex filename > >>>>> 105430 41333 24 146787 23d63 mm/vmscan.o > >>>>> > >>>> > >>>> Any feedback on this patch? Can we look at merging this change? > >>> > >>> Just want to make sure I fully understand the motivation: are there > >>> any other end goals besides reducing the footprint mentioned above? > >>> E.g., preparing for HCA, etc. (My current understanding is that HCA > >>> shouldn't care about it, since it's already runtime disabled if HCA > >>> doesn't want to use it.) > >>> > >> > >> My goal with this change was to remove all those dead code from getting complied > >> in for ppc64. > > > > I see. But the first thing (lru_gen_add_folio()) you moved has nothing > > to do with this goal, because it's still compiled after the entire > > series. > > > > Sure. will drop that change. > > >>> Also as explained offline, solely relying on folio_activate() in > >>> lru_gen_look_around() can cause a measure regression on powerpc, > >>> because > >>> 1. PAGEVEC_SIZE is 15 whereas pglist_data->mm_walk.batched is > >>> virtually unlimited. > >>> 2. Once folio_activate() reaches that limit, it takes the LRU lock on > >>> top of the PTL, which can be shared by multiple page tables on > >>> powerpc. > >>> > >>> In fact, I think we try the opposite direction first, before arriving > >>> at any conclusions, i.e., > >>> #define arch_has_hw_pte_young() radix_enabled() > >> > >> The reason it is disabled on powerpc was that a reference bit update takes a pagefault > >> on powerpc irrespective of the translation mode. > > > > This is not true. > > > > From "IBM POWER9 Processor User Manual": > > https://openpowerfoundation.org/resources/ibmpower9usermanual/ > > > > 4.10.14 Reference and Change Bits > > ... > > When performing HPT translation, the hardware performs the R and C > > bit updates nonatomically. > > ... > > > > The radix case is more complex, and I'll leave it to you to interpret > > what it means: > > > > From "Power ISA Version 3.0 B": > > https://openpowerfoundation.org/specifications/isa/ > > > > 5.7.12 Reference and Change Recording > > ... > > For Radix Tree translation, the Reference and Change bits are set atomically. > > ... > > > > it is atomic in that software use ldarx/stdcx to update these bits. Hardware/core won't > update this directly even though Nest can update this directly without taking a fault. So > for all purpose we can assume that on radix R/C bit is updated by page fault handler. Thanks. To me, it sounds like stating a function provided by h/w, not a requirement for s/w. (IMO, the latter would be something like "software must/should set the bits atomically.) But I'll take your word for it.