Re: [PATCH 3/3] mm/lru_gen: Don't build multi-gen LRU page table walk code on architecture not supported

Yu Zhao <yuzhao@xxxxxxxxxx> · Mon, 26 Jun 2023 11:04:51 -0600

On Mon, Jun 26, 2023 at 4:52 AM Aneesh Kumar K V
<aneesh.kumar@xxxxxxxxxxxxx> wrote:
>
> On 6/26/23 1:04 AM, Yu Zhao wrote:
> > On Sat, Jun 24, 2023 at 8:54 AM Aneesh Kumar K.V
> > <aneesh.kumar@xxxxxxxxxxxxx> wrote:
> >>
> >> Hi Yu Zhao,
> >>
> >> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes:
> >>
> >>> Not all architecture supports hardware atomic updates of access bits. On
> >>> such an arch, we don't use page table walk to classify pages into
> >>> generations. Add a kernel config option and remove adding all the page
> >>> table walk code on such architecture.
> >>>
> >>> No preformance change observed with mongodb ycsb test:
> >>>
> >>> Patch details        Throughput(Ops/sec)
> >>> without patch         93278
> >>> With patch            93400
> >>>
> >>> Without patch:
> >>> $ size mm/vmscan.o
> >>>    text    data     bss     dec     hex filename
> >>>  112102   42721      40  154863   25cef mm/vmscan.o
> >>>
> >>> With patch
> >>>
> >>> $ size  mm/vmscan.o
> >>>    text    data     bss     dec     hex filename
> >>>  105430   41333      24  146787   23d63 mm/vmscan.o
> >>>
> >>
> >> Any feedback on this patch? Can we look at merging this change?
> >
> > Just want to make sure I fully understand the motivation: are there
> > any other end goals besides reducing the footprint mentioned above?
> > E.g., preparing for HCA, etc. (My current understanding is that HCA
> > shouldn't care about it, since it's already runtime disabled if HCA
> > doesn't want to use it.)
> >
>
> My goal with this change was to remove all those dead code from getting complied
> in for ppc64.

I see. But the first thing (lru_gen_add_folio()) you moved has nothing
to do with this goal, because it's still compiled after the entire
series.

> > Also as explained offline, solely relying on folio_activate() in
> > lru_gen_look_around() can cause a measure regression on powerpc,
> > because
> > 1. PAGEVEC_SIZE is 15 whereas pglist_data->mm_walk.batched is
> > virtually unlimited.
> > 2. Once folio_activate() reaches that limit, it takes the LRU lock on
> > top of the PTL, which can be shared by multiple page tables on
> > powerpc.
> >
> > In fact, I think we try the opposite direction first, before arriving
> > at any conclusions, i.e.,
> >     #define arch_has_hw_pte_young() radix_enabled()
>
> The reason it is disabled on powerpc was that a reference bit update takes a pagefault
> on powerpc irrespective of the translation mode.

This is not true.

>From "IBM POWER9 Processor User Manual":
https://openpowerfoundation.org/resources/ibmpower9usermanual/

  4.10.14 Reference and Change Bits
  ...
  When performing HPT translation, the hardware performs the R and C
bit updates nonatomically.
  ...

The radix case is more complex, and I'll leave it to you to interpret
what it means:

>From "Power ISA Version 3.0 B":
https://openpowerfoundation.org/specifications/isa/

  5.7.12 Reference and Change Recording
  ...
  For Radix Tree translation, the Reference and Change bits are set atomically.
  ...

> > on powerpc. This might benefit platforms with the A-bit but not HCA,
> > e.g., POWER9. I just ran a quick test (memcached/memtier I previously
> > shared with you) and it showed far less PTL contention in kswapd. I'm
> > attaching the flamegraphs for you to analyze. Could you try some
> > benchmarks with the above change on your end as well?
> >
>
> The ptl lock is a valid concern even though i didn't observe the contention increasing with
> the change. I will rerun the test to verify. We have possibly two options here
>
> 1) Delay the lruvec->nr_pages update until the sort phase. But as you explained earlier, that
> can impact should_run_aging().
>
>
> 2) Add another batching mechanism similar to pglist_data->mm_walk which can be used on architecture
> that don't support hardware update of access/reference bit to be used only by lru_gen_look_around()

Sounds good. Thanks.