On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: > > > > On 5/5/21 7:44 PM, Dan Williams wrote: > > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: > >> > >> Add a new align property for struct dev_pagemap which specifies that a > >> pagemap is composed of a set of compound pages of size @align, instead > >> of base pages. When these pages are initialised, most are initialised as > >> tail pages instead of order-0 pages. > >> > >> For certain ZONE_DEVICE users like device-dax which have a fixed page > >> size, this creates an opportunity to optimize GUP and GUP-fast walkers, > >> treating it the same way as THP or hugetlb pages. > >> > >> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> > >> --- > >> include/linux/memremap.h | 13 +++++++++++++ > >> mm/memremap.c | 8 ++++++-- > >> mm/page_alloc.c | 24 +++++++++++++++++++++++- > >> 3 files changed, 42 insertions(+), 3 deletions(-) > >> > >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h > >> index b46f63dcaed3..bb28d82dda5e 100644 > >> --- a/include/linux/memremap.h > >> +++ b/include/linux/memremap.h > >> @@ -114,6 +114,7 @@ struct dev_pagemap { > >> struct completion done; > >> enum memory_type type; > >> unsigned int flags; > >> + unsigned long align; > > > > I think this wants some kernel-doc above to indicate that non-zero > > means "use compound pages with tail-page dedup" and zero / PAGE_SIZE > > means "use non-compound base pages". > > Got it. Are you thinking a kernel_doc on top of the variable above, or preferably on top > of pgmap_align()? I was thinking in dev_pagemap because this value is more than just a plain alignment its restructuring the layout and construction of the memmap(), but when I say it that way it seems much less like a vanilla align value compared to a construction / geometry mode setting. > > > The non-zero value must be > > PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. > > Hmm, maybe it should be an > > enum: > > > > enum devmap_geometry { > > DEVMAP_PTE, > > DEVMAP_PMD, > > DEVMAP_PUD, > > } > > > I suppose a converter between devmap_geometry and page_size would be needed too? And maybe > the whole dax/nvdimm align values change meanwhile (as a followup improvement)? I think it is ok for dax/nvdimm to continue to maintain their align value because it should be ok to have 4MB align if the device really wanted. However, when it goes to map that alignment with memremap_pages() it can pick a mode. For example, it's already the case that dax->align == 1GB is mapped with DEVMAP_PTE today, so they're already separate concepts that can stay separate. > > Although to be fair we only ever care about compound page size in this series (and > similarly dax/nvdimm @align properties). > > > ...because it's more than just an alignment it's a structural > > definition of how the memmap is laid out. > > > >> const struct dev_pagemap_ops *ops; > >> void *owner; > >> int nr_range; > >> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) > >> return NULL; > >> } > >> > >> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap) > >> +{ > >> + if (!pgmap || !pgmap->align) > >> + return PAGE_SIZE; > >> + return pgmap->align; > >> +} > >> + > >> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap) > >> +{ > >> + return PHYS_PFN(pgmap_align(pgmap)); > >> +} > >> + > >> #ifdef CONFIG_ZONE_DEVICE > >> bool pfn_zone_device_reserved(unsigned long pfn); > >> void *memremap_pages(struct dev_pagemap *pgmap, int nid); > >> diff --git a/mm/memremap.c b/mm/memremap.c > >> index 805d761740c4..d160853670c4 100644 > >> --- a/mm/memremap.c > >> +++ b/mm/memremap.c > >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, > >> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], > >> PHYS_PFN(range->start), > >> PHYS_PFN(range_len(range)), pgmap); > >> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > >> - - pfn_first(pgmap, range_id)); > >> + if (pgmap_align(pgmap) > PAGE_SIZE) > >> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > >> + - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap)); > >> + else > >> + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > >> + - pfn_first(pgmap, range_id)); > >> return 0; > >> > >> err_add_memory: > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >> index 58974067bbd4..3a77f9e43f3a 100644 > >> --- a/mm/page_alloc.c > >> +++ b/mm/page_alloc.c > >> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone, > >> unsigned long pfn, end_pfn = start_pfn + nr_pages; > >> struct pglist_data *pgdat = zone->zone_pgdat; > >> struct vmem_altmap *altmap = pgmap_altmap(pgmap); > >> + unsigned int pfn_align = pgmap_pfn_align(pgmap); > >> + unsigned int order_align = order_base_2(pfn_align); > >> unsigned long zone_idx = zone_idx(zone); > >> unsigned long start = jiffies; > >> int nid = pgdat->node_id; > >> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone, > >> nr_pages = end_pfn - start_pfn; > >> } > >> > >> - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > >> + for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) { > > > > pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>= > > PAGE_SHIFT" I missed somewhere? > > > @pfn_align is in pages too. It's pgmap_align() which is in bytes: > > +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap) > +{ > + return PHYS_PFN(pgmap_align(pgmap)); > +} Ah yup, my eyes glazed over that. I think this is another place that benefits from a more specific name than "align". "pfns_per_compound" "compound_pfns"?