On Fri, Apr 14, 2023 at 03:38:35PM +0100, Ryan Roberts wrote: > On 14/04/2023 15:09, Kirill A. Shutemov wrote: > > On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote: > >> For variable-order anonymous folios, we want to tune the order that we > >> prefer to allocate based on the vma. Add the routines to manage that > >> heuristic. > >> > >> TODO: Currently we always use the global maximum. Add per-vma logic! > >> > >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > >> --- > >> include/linux/mm.h | 5 +++++ > >> mm/memory.c | 8 ++++++++ > >> 2 files changed, 13 insertions(+) > >> > >> diff --git a/include/linux/mm.h b/include/linux/mm.h > >> index cdb8c6031d0f..cc8d0b239116 100644 > >> --- a/include/linux/mm.h > >> +++ b/include/linux/mm.h > >> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > >> } > >> #endif > >> > >> +/* > >> + * TODO: Should this be set per-architecture? > >> + */ > >> +#define ANON_FOLIO_ORDER_MAX 4 > >> + > > > > I think it has to be derived from size in bytes, not directly specifies > > page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M. > > > > Yes I see where you are coming from. What's your feel for what a sensible upper > bound in bytes is? > > My difficulty is that I would like to be able to use this allocation mechanism > to enable using the "contiguous bit" on arm64; that's a set of contiguous PTEs > that are mapped to physically contiguous memory, and the HW can use that hint to > coalesce the TLB entries. > > For 4KB pages, the contig size is 64KB (order-4), so that works nicely. But for > 16KB and 64KB pages, its 2MB (order-7 and order-5 respectively). Do you think > allocating 2MB pages here is going to lead to too much memory wastage? I think it boils down to the specifics of the microarchitecture. We can justify 2M PMD-mapped THP in many cases. But PMD-mapped THP is not only reduces TLB pressure (that contiguous bit does too, I believe), but also saves one more memory access on page table walk. It may or may not matter for the processor. It has to be evaluated. Maybe moving it to per-arch is the right way. With default in generic code to be ilog2(SZ_64K >> PAGE_SIZE) or something. -- Kiryl Shutsemau / Kirill A. Shutemov