Re: [PATCH 3/3] mm/lru_gen: Don't build multi-gen LRU page table walk code on architecture not supported

Yu Zhao <yuzhao@xxxxxxxxxx> · Sun, 25 Jun 2023 13:34:01 -0600

On Sat, Jun 24, 2023 at 8:54 AM Aneesh Kumar K.V
<aneesh.kumar@xxxxxxxxxxxxx> wrote:
>
> Hi Yu Zhao,
>
> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes:
>
> > Not all architecture supports hardware atomic updates of access bits. On
> > such an arch, we don't use page table walk to classify pages into
> > generations. Add a kernel config option and remove adding all the page
> > table walk code on such architecture.
> >
> > No preformance change observed with mongodb ycsb test:
> >
> > Patch details        Throughput(Ops/sec)
> > without patch         93278
> > With patch            93400
> >
> > Without patch:
> > $ size mm/vmscan.o
> >    text    data     bss     dec     hex filename
> >  112102   42721      40  154863   25cef mm/vmscan.o
> >
> > With patch
> >
> > $ size  mm/vmscan.o
> >    text    data     bss     dec     hex filename
> >  105430   41333      24  146787   23d63 mm/vmscan.o
> >
>
> Any feedback on this patch? Can we look at merging this change?

Just want to make sure I fully understand the motivation: are there
any other end goals besides reducing the footprint mentioned above?
E.g., preparing for HCA, etc. (My current understanding is that HCA
shouldn't care about it, since it's already runtime disabled if HCA
doesn't want to use it.)

Also as explained offline, solely relying on folio_activate() in
lru_gen_look_around() can cause a measure regression on powerpc,
because
1. PAGEVEC_SIZE is 15 whereas pglist_data->mm_walk.batched is
virtually unlimited.
2. Once folio_activate() reaches that limit, it takes the LRU lock on
top of the PTL, which can be shared by multiple page tables on
powerpc.

In fact, I think we try the opposite direction first, before arriving
at any conclusions, i.e.,
    #define arch_has_hw_pte_young() radix_enabled()
on powerpc. This might benefit platforms with the A-bit but not HCA,
e.g., POWER9. I just ran a quick test (memcached/memtier I previously
shared with you) and it showed far less PTL contention in kswapd. I'm
attaching the flamegraphs for you to analyze. Could you try some
benchmarks with the above change on your end as well?

Thanks.
Attachment:
a.svg

Description: image/svg
Attachment:
b.svg

Description: image/svg