On 7/15/21 2:08 AM, Dan Williams wrote: > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: >> >> Add a new align property for struct dev_pagemap which specifies that a > > s/align/@geometry/ > Yeap, updated. >> pagemap is composed of a set of compound pages of size @align, > > s/@align/@geometry/ > Yeap, updated. >> instead of >> base pages. When a compound page geometry is requested, all but the first >> page are initialised as tail pages instead of order-0 pages. >> >> For certain ZONE_DEVICE users like device-dax which have a fixed page size, >> this creates an opportunity to optimize GUP and GUP-fast walkers, treating >> it the same way as THP or hugetlb pages. >> >> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> >> --- >> include/linux/memremap.h | 17 +++++++++++++++++ >> mm/memremap.c | 8 ++++++-- >> mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++++- >> 3 files changed, 56 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >> index 119f130ef8f1..e5ab6d4525c1 100644 >> --- a/include/linux/memremap.h >> +++ b/include/linux/memremap.h >> @@ -99,6 +99,10 @@ struct dev_pagemap_ops { >> * @done: completion for @internal_ref >> * @type: memory type: see MEMORY_* in memory_hotplug.h >> * @flags: PGMAP_* flags to specify defailed behavior >> + * @geometry: structural definition of how the vmemmap metadata is populated. >> + * A zero or PAGE_SIZE defaults to using base pages as the memmap metadata >> + * representation. A bigger value but also multiple of PAGE_SIZE will set >> + * up compound struct pages representative of the requested geometry size. >> * @ops: method table >> * @owner: an opaque pointer identifying the entity that manages this >> * instance. Used by various helpers to make sure that no >> @@ -114,6 +118,7 @@ struct dev_pagemap { >> struct completion done; >> enum memory_type type; >> unsigned int flags; >> + unsigned long geometry; >> const struct dev_pagemap_ops *ops; >> void *owner; >> int nr_range; >> @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) >> return NULL; >> } >> >> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) >> +{ >> + if (!pgmap || !pgmap->geometry) >> + return PAGE_SIZE; >> + return pgmap->geometry; >> +} >> + >> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap) >> +{ >> + return PHYS_PFN(pgmap_geometry(pgmap)); >> +} > > Are both needed? Maybe just have ->geometry natively be in nr_pages > units directly, because pgmap_pfn_geometry() makes it confusing > whether it's a geometry of the pfn or the geometry of the pgmap. > I use pgmap_geometry() largelly when we manipulate memmap in sparse-vmemmap code, as we deal with addresses/offsets/subsection-size. While using pgmap_pfn_geometry for code that deals with PFN initialization. For this patch I could remove the confusion. And actually maybe I can just store the pgmap_geometry() value in bytes locally in vmemmap_populate_compound_pages() and we can remove this extra helper. >> + >> #ifdef CONFIG_ZONE_DEVICE >> bool pfn_zone_device_reserved(unsigned long pfn); >> void *memremap_pages(struct dev_pagemap *pgmap, int nid); >> diff --git a/mm/memremap.c b/mm/memremap.c >> index 805d761740c4..ffcb924eb6a5 100644 >> --- a/mm/memremap.c >> +++ b/mm/memremap.c >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, >> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], >> PHYS_PFN(range->start), >> PHYS_PFN(range_len(range)), pgmap); >> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) >> - - pfn_first(pgmap, range_id)); >> + if (pgmap_geometry(pgmap) > PAGE_SIZE) > > This would become > > if (pgmap_geometry(pgmap) > 1) > >> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) >> + - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap)); > > ...and this would be pgmap_geometry() > >> + else >> + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) >> + - pfn_first(pgmap, range_id)); >> return 0; >> Let me adjust accordingly. >> err_add_memory: >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 79f3b38afeca..188cb5f8c308 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, >> } >> } >> >> +static void __ref memmap_init_compound(struct page *page, unsigned long pfn, > > I'd feel better if @page was renamed @head... more below: > Oh yeah -- definitely more readable. >> + unsigned long zone_idx, int nid, >> + struct dev_pagemap *pgmap, >> + unsigned long nr_pages) >> +{ >> + unsigned int order_align = order_base_2(nr_pages); >> + unsigned long i; >> + >> + __SetPageHead(page); >> + >> + for (i = 1; i < nr_pages; i++) { > > The switch of loop styles is jarring. I.e. the switch from > memmap_init_zone_device() that is using pfn, end_pfn, and a local > 'struct page *' variable to this helper using pfn + i and a mix of > helpers (__init_zone_device_page, prep_compound_tail) that have > different expectations of head page + tail_idx and current page. > > I.e. this reads more obviously correct to me, but maybe I'm just in > the wrong headspace: > > for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { > struct page *page = pfn_to_page(pfn); > > __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); > prep_compound_tail(head, pfn - head_pfn); > Personally -- and I am dubious given I have been staring at this code -- I find that what I wrote a little better as it follows more what compound page initialization does. Like it's easier for me to read that I am initializing a number of tail pages and a head page (for a known geometry size). Additionally, it's unnecessary (and a tiny ineficient?) to keep doing pfn_to_page(pfn) provided ZONE_DEVICE requires SPARSEMEM_VMEMMAP and so your page pointers are all contiguous and so for any given PFN we can avoid having deref vmemmap vaddrs back and forth. Which is the second reason I pass a page, and iterate over its tails based on a head page pointer. But I was at too minds when writing this, so if the there's no added inefficiency I can rewrite like the above. >> + __init_zone_device_page(page + i, pfn + i, zone_idx, >> + nid, pgmap); >> + prep_compound_tail(page, i); >> + >> + /* >> + * The first and second tail pages need to >> + * initialized first, hence the head page is >> + * prepared last. > > I'd change this comment to say why rather than restate what can be > gleaned from the code. It's actually not clear to me why this order is > necessary. > So the first tail page stores mapcount_ptr and compound order, and the second tail page stores pincount_ptr. prep_compound_head() does this: set_compound_order(page, order); atomic_set(compound_mapcount_ptr(page), -1); if (hpage_pincount_available(page)) atomic_set(compound_pincount_ptr(page), 0); So we need those tail pages initialized first prior to initializing the head. I can expand the comment above to make it clear why we need first and second tail pages. >> + */ >> + if (i == 2) >> + prep_compound_head(page, order_align); >> + } >> +} >> + >> void __ref memmap_init_zone_device(struct zone *zone, >> unsigned long start_pfn, >> unsigned long nr_pages, >> @@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone, >> unsigned long pfn, end_pfn = start_pfn + nr_pages; >> struct pglist_data *pgdat = zone->zone_pgdat; >> struct vmem_altmap *altmap = pgmap_altmap(pgmap); >> + unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap); >> unsigned long zone_idx = zone_idx(zone); >> unsigned long start = jiffies; >> int nid = pgdat->node_id; >> @@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone, >> nr_pages = end_pfn - start_pfn; >> } >> >> - for (pfn = start_pfn; pfn < end_pfn; pfn++) { >> + for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) { >> struct page *page = pfn_to_page(pfn); >> >> __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); >> + >> + if (pfns_per_compound == 1) >> + continue; >> + >> + memmap_init_compound(page, pfn, zone_idx, nid, pgmap, >> + pfns_per_compound); > > I otherwise don't see anything broken with this patch, so feel free to include: > > Reviewed-by: Dan Williams <dan.j.williams@xxxxxxxxx> > > ...on the resend with the fixups. > Thanks. I will wait whether you still want to retain the tag provided the implied changes fixing the failure you reported.