Re: [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order

"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> · Fri, 14 Apr 2023 18:37:47 +0300

On Fri, Apr 14, 2023 at 03:38:35PM +0100, Ryan Roberts wrote:
> On 14/04/2023 15:09, Kirill A. Shutemov wrote:
> > On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
> >> For variable-order anonymous folios, we want to tune the order that we
> >> prefer to allocate based on the vma. Add the routines to manage that
> >> heuristic.
> >>
> >> TODO: Currently we always use the global maximum. Add per-vma logic!
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
> >> ---
> >>  include/linux/mm.h | 5 +++++
> >>  mm/memory.c        | 8 ++++++++
> >>  2 files changed, 13 insertions(+)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index cdb8c6031d0f..cc8d0b239116 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >>  }
> >>  #endif
> >>
> >> +/*
> >> + * TODO: Should this be set per-architecture?
> >> + */
> >> +#define ANON_FOLIO_ORDER_MAX	4
> >> +
> > 
> > I think it has to be derived from size in bytes, not directly specifies
> > page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.
> > 
> 
> Yes I see where you are coming from. What's your feel for what a sensible upper
> bound in bytes is?
> 
> My difficulty is that I would like to be able to use this allocation mechanism
> to enable using the "contiguous bit" on arm64; that's a set of contiguous PTEs
> that are mapped to physically contiguous memory, and the HW can use that hint to
> coalesce the TLB entries.
> 
> For 4KB pages, the contig size is 64KB (order-4), so that works nicely. But for
> 16KB and 64KB pages, its 2MB (order-7 and order-5 respectively). Do you think
> allocating 2MB pages here is going to lead to too much memory wastage?

I think it boils down to the specifics of the microarchitecture.

We can justify 2M PMD-mapped THP in many cases. But PMD-mapped THP is not
only reduces TLB pressure (that contiguous bit does too, I believe), but
also saves one more memory access on page table walk.

It may or may not matter for the processor. It has to be evaluated.

Maybe moving it to per-arch is the right way. With default in generic code
to be ilog2(SZ_64K >> PAGE_SIZE) or something.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov