On 5/5/21 11:20 PM, Dan Williams wrote: > On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: >> On 5/5/21 7:44 PM, Dan Williams wrote: >>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: >>>> >>>> Add a new align property for struct dev_pagemap which specifies that a >>>> pagemap is composed of a set of compound pages of size @align, instead >>>> of base pages. When these pages are initialised, most are initialised as >>>> tail pages instead of order-0 pages. >>>> >>>> For certain ZONE_DEVICE users like device-dax which have a fixed page >>>> size, this creates an opportunity to optimize GUP and GUP-fast walkers, >>>> treating it the same way as THP or hugetlb pages. >>>> >>>> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> >>>> --- >>>> include/linux/memremap.h | 13 +++++++++++++ >>>> mm/memremap.c | 8 ++++++-- >>>> mm/page_alloc.c | 24 +++++++++++++++++++++++- >>>> 3 files changed, 42 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>> index b46f63dcaed3..bb28d82dda5e 100644 >>>> --- a/include/linux/memremap.h >>>> +++ b/include/linux/memremap.h >>>> @@ -114,6 +114,7 @@ struct dev_pagemap { >>>> struct completion done; >>>> enum memory_type type; >>>> unsigned int flags; >>>> + unsigned long align; >>> >>> I think this wants some kernel-doc above to indicate that non-zero >>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE >>> means "use non-compound base pages". >> >> Got it. Are you thinking a kernel_doc on top of the variable above, or preferably on top >> of pgmap_align()? > > I was thinking in dev_pagemap because this value is more than just a > plain alignment its restructuring the layout and construction of the > memmap(), but when I say it that way it seems much less like a vanilla > align value compared to a construction / geometry mode setting. > /me nods >> >>> The non-zero value must be >>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. >>> Hmm, maybe it should be an >>> enum: >>> >>> enum devmap_geometry { >>> DEVMAP_PTE, >>> DEVMAP_PMD, >>> DEVMAP_PUD, >>> } >>> >> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe >> the whole dax/nvdimm align values change meanwhile (as a followup improvement)? > > I think it is ok for dax/nvdimm to continue to maintain their align > value because it should be ok to have 4MB align if the device really > wanted. However, when it goes to map that alignment with > memremap_pages() it can pick a mode. For example, it's already the > case that dax->align == 1GB is mapped with DEVMAP_PTE today, so > they're already separate concepts that can stay separate. > Gotcha. >> >> Although to be fair we only ever care about compound page size in this series (and >> similarly dax/nvdimm @align properties). >> >>> ...because it's more than just an alignment it's a structural >>> definition of how the memmap is laid out. >>> Somehow I missed this other part of your response. >>>> const struct dev_pagemap_ops *ops; >>>> void *owner; >>>> int nr_range; >>>> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) >>>> return NULL; >>>> } >>>> >>>> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap) >>>> +{ >>>> + if (!pgmap || !pgmap->align) >>>> + return PAGE_SIZE; >>>> + return pgmap->align; >>>> +} >>>> + >>>> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap) >>>> +{ >>>> + return PHYS_PFN(pgmap_align(pgmap)); >>>> +} >>>> + >>>> #ifdef CONFIG_ZONE_DEVICE >>>> bool pfn_zone_device_reserved(unsigned long pfn); >>>> void *memremap_pages(struct dev_pagemap *pgmap, int nid); >>>> diff --git a/mm/memremap.c b/mm/memremap.c >>>> index 805d761740c4..d160853670c4 100644 >>>> --- a/mm/memremap.c >>>> +++ b/mm/memremap.c >>>> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, >>>> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], >>>> PHYS_PFN(range->start), >>>> PHYS_PFN(range_len(range)), pgmap); >>>> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) >>>> - - pfn_first(pgmap, range_id)); >>>> + if (pgmap_align(pgmap) > PAGE_SIZE) >>>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) >>>> + - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap)); >>>> + else >>>> + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) >>>> + - pfn_first(pgmap, range_id)); >>>> return 0; >>>> >>>> err_add_memory: >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index 58974067bbd4..3a77f9e43f3a 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone, >>>> unsigned long pfn, end_pfn = start_pfn + nr_pages; >>>> struct pglist_data *pgdat = zone->zone_pgdat; >>>> struct vmem_altmap *altmap = pgmap_altmap(pgmap); >>>> + unsigned int pfn_align = pgmap_pfn_align(pgmap); >>>> + unsigned int order_align = order_base_2(pfn_align); >>>> unsigned long zone_idx = zone_idx(zone); >>>> unsigned long start = jiffies; >>>> int nid = pgdat->node_id; >>>> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone, >>>> nr_pages = end_pfn - start_pfn; >>>> } >>>> >>>> - for (pfn = start_pfn; pfn < end_pfn; pfn++) { >>>> + for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) { >>> >>> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>= >>> PAGE_SHIFT" I missed somewhere? >>> >> @pfn_align is in pages too. It's pgmap_align() which is in bytes: >> >> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap) >> +{ >> + return PHYS_PFN(pgmap_align(pgmap)); >> +} > > Ah yup, my eyes glazed over that. I think this is another place that > benefits from a more specific name than "align". "pfns_per_compound" > "compound_pfns"? > We are still describing a page, just not a base page. So perhaps @pfns_per_hpage ? I am fine with @pfns_per_compound or @compound_pfns as well.