Joao Martins <joao.m.martins@xxxxxxxxxx> writes: > On 07/04/2023 13:23, Aneesh Kumar K.V wrote: >> commit c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages") >> added support for using optimized vmmemap for devdax devices. > > Not really. It added *compound pages* for devdax ZONE device which is what the > commit describes. It is meant to represent the non base page metadata. Should > fsdax support the equivalent, we can have GUP performance on the dax path and > remove today's special zone device case of ref-per-base-page GUP-fast path. > > One was tied when it was brought in but in theory[*] we didn't require anything > from arch-es to do this (contrary to vmemmap_populate()). The commit you should > refer to is this instead: ok the compound page dependency came via commit 6fd3620b3428 ("mm/page_alloc: reuse tail struct pages for compound devmaps") I have now reworked it such that we do static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap, unsigned long nr_pages) { #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP /* * this depends on how vmemmap is populated by * vmemmap_populate_compound_pages() */ return is_power_of_2(sizeof(struct page)) && !altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages; #else return nr_pages; #endif } > > 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") > > [*] or so I thought before next paragraph > >> But how vmemmap >> mappings are created are architecture specific. For example, powerpc with hash >> translation doesn't have vmemmap mappings in init_mm page table instead they are >> bolted table entries in the hardware page table >> > > Very interesting, I didn't know this (no init_mm) was an option. And my > apologies for regressing s390 :/ So, s390 does not support > vmemmap_populate_basepages() then and always need to go to arch > vmemmap_populate() to create the vmemmap entries. > >> vmemmap_populate_compound_pages() used by vmemmap optimization code is not aware >> of these architecture-specific mapping. Hence allow architecture to opt for this >> feature. I selected architectures supporting HUGETLB_PAGE_OPTIMIZE_VMEMMAP >> option as also supporting this feature. I added vmemmap_can_optimize() even >> though page_vmemmap_nr(pgmap) > 1 check should filter architecture not >> supporting this. > IMHO that brings clarity to the code where we are populating >> vmemmap. >> > I am not sure the last two sentences are right. I would remove, see above and > below comments at the end on why. That is true because i had vmemmap_shift = 0 if arch didn't support vmemmap optimization. Based on your reply above, I have now moved that out and kept vmemmap_can_optimize() modified mm/page_alloc.c @@ -6846,8 +6846,16 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap, unsigned long nr_pages) { +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP + /* + * this depends on how vmemmap is populated by + * vmemmap_populate_compound_pages() + */ return is_power_of_2(sizeof(struct page)) && !altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages; +#else + return nr_pages; +#endif } static void __ref memmap_init_compound(struct page *head, modified mm/sparse-vmemmap.c @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION))) return NULL; - if (is_power_of_2(sizeof(struct page)) && - pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap) + if (vmemmap_can_optimize(altmap, pgmap)) r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); else r = vmemmap_populate(start, end, nid, altmap); .... > >> Fixes: c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages") > > It should be: > > Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound > devmaps") updated. > > But if what you included in your patch was what lead to the crash, then your > problem is not the vmemmap optimization ... as the latter came after. If it's > the hash I included above, it would reduce the affected trees to v5.19+ > >> Cc: Joao Martins <joao.m.martins@xxxxxxxxxx> >> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx> >> Cc: Dan Williams <dan.j.williams@xxxxxxxxx> >> Reported-by: Tarun Sahu <tsahu@xxxxxxxxxxxxx> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx> >> --- >> arch/arm64/Kconfig | 1 + >> arch/loongarch/Kconfig | 1 + >> arch/s390/Kconfig | 1 + >> arch/x86/Kconfig | 1 + >> drivers/dax/device.c | 3 ++- >> include/linux/mm.h | 16 ++++++++++++++++ >> mm/Kconfig | 3 +++ >> mm/sparse-vmemmap.c | 3 +-- >> 8 files changed, 26 insertions(+), 3 deletions(-) >> >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >> index 27b2592698b0..d3f5945f0aff 100644 >> --- a/arch/arm64/Kconfig >> +++ b/arch/arm64/Kconfig >> @@ -103,6 +103,7 @@ config ARM64 >> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >> select ARCH_WANT_LD_ORPHAN_WARN >> select ARCH_WANTS_NO_INSTR >> + select ARCH_WANT_OPTIMIZE_VMEMMAP >> select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES >> select ARCH_HAS_UBSAN_SANITIZE_ALL >> select ARM_AMBA >> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig >> index 9cc8b84f7eb0..ce5802066d0e 100644 >> --- a/arch/loongarch/Kconfig >> +++ b/arch/loongarch/Kconfig >> @@ -56,6 +56,7 @@ config LOONGARCH >> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >> select ARCH_WANT_LD_ORPHAN_WARN >> select ARCH_WANTS_NO_INSTR >> + select ARCH_WANT_OPTIMIZE_VMEMMAP >> select BUILDTIME_TABLE_SORT >> select COMMON_CLK >> select CPU_PM >> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig >> index 933771b0b07a..abffccd937b2 100644 >> --- a/arch/s390/Kconfig >> +++ b/arch/s390/Kconfig >> @@ -127,6 +127,7 @@ config S390 >> select ARCH_WANT_DEFAULT_BPF_JIT >> select ARCH_WANT_IPC_PARSE_VERSION >> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >> + select ARCH_WANT_OPTIMIZE_VMEMMAP >> select BUILDTIME_TABLE_SORT >> select CLONE_BACKWARDS2 >> select DMA_OPS if PCI >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index a825bf031f49..e8d66d834b4f 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -127,6 +127,7 @@ config X86 >> select ARCH_WANT_HUGE_PMD_SHARE >> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP if X86_64 >> select ARCH_WANT_LD_ORPHAN_WARN >> + select ARCH_WANT_OPTIMIZE_VMEMMAP if X86_64 >> select ARCH_WANTS_THP_SWAP if X86_64 >> select ARCH_HAS_PARANOID_L1D_FLUSH >> select BUILDTIME_TABLE_SORT > > I would remove these and instead use the > ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, and have your other snip be a second > patch to drop the 'HUGETLB_' part > >> diff --git a/drivers/dax/device.c b/drivers/dax/device.c >> index 5494d745ced5..05be8e79d64b 100644 >> --- a/drivers/dax/device.c >> +++ b/drivers/dax/device.c >> @@ -448,7 +448,8 @@ int dev_dax_probe(struct dev_dax *dev_dax) >> } >> >> pgmap->type = MEMORY_DEVICE_GENERIC; >> - if (dev_dax->align > PAGE_SIZE) >> + if (dev_dax->align > PAGE_SIZE && >> + IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP)) >> pgmap->vmemmap_shift = >> order_base_2(dev_dax->align >> PAGE_SHIFT); > > vmemmap_shift relates to the compound page size we will use in > memmap_init_zone_device(). Otherwise we are back to the base struct page on > PMD/PUD case. Instead move the: > > IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP) > > further below to __populate_section_memmap() ... > >> addr = devm_memremap_pages(dev, pgmap); >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 716d30d93616..fb71e21df23d 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -3442,6 +3442,22 @@ void vmemmap_populate_print_last(void); >> void vmemmap_free(unsigned long start, unsigned long end, >> struct vmem_altmap *altmap); >> #endif >> + >> +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP >> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, >> + struct dev_pagemap *pgmap) >> +{ >> + return is_power_of_2(sizeof(struct page)) && >> + pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap; >> +} >> +#else >> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, >> + struct dev_pagemap *pgmap) >> +{ >> + return false; >> +} >> +#endif >> + >> void register_page_bootmem_memmap(unsigned long section_nr, struct page *map, >> unsigned long nr_pages); >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index ff7b209dec05..99f87c1be1e8 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -461,6 +461,9 @@ config SPARSEMEM_VMEMMAP >> pfn_to_page and page_to_pfn operations. This is the most >> efficient option when sufficient kernel resources are available. >> >> +config ARCH_WANT_OPTIMIZE_VMEMMAP >> + bool >> + >> config HAVE_MEMBLOCK_PHYS_MAP >> bool >> > As you mentioned in your other followup email it probably makes sense that you > just use ARCH_HUGETLB_WANT_OPTIMIZE_VMEMMAP if we switch this to per-arch. And > then have that kconfig name be renamed into the above. It simplifies the fix in > case it pickups stable trees (the dax vmemmap deduplication came after hugetlb). > The case for the head page reusal case for hugetlb is anyways captured by a > hugetlb static key. will do > > >> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c >> index c5398a5960d0..10d73a0dfcec 100644 >> --- a/mm/sparse-vmemmap.c >> +++ b/mm/sparse-vmemmap.c >> @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, >> !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION))) >> return NULL; >> >> - if (is_power_of_2(sizeof(struct page)) && >> - pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap) >> + if (vmemmap_can_optimize(altmap, pgmap)) >> r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); >> else >> r = vmemmap_populate(start, end, nid, altmap); > > Same comment(s) as further above. I think it should be: > > if (IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP)) > r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); > else > r = vmemmap_populate(start, end, nid, altmap); Even if arch want vmemmap optimization the ability to use compound vmemmap_populate is further restricted by if (is_power_of_2(sizeof(struct page)) && pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap) Hence we still want if (vmemmap_can_optimize(..)) right? > > In principle all these arches below can support > vmemmap_populate_compound_pages() but now it would require them enable it > explicitly, which I guess makes sense considering this bug you are fixing. But > the arch-requirement is not really if we support optimized-vmemmap or not but if > we support the init-mm at all. OTOH tieing both hugetlb and dax additionally > eliminates confusion around an optimization shared by both, even if the dax > could encompass more. I am not sure what is the arch requirement for hugetlb > vmemmap opt these days. > > arch/arm64/mm/mmu.c: return vmemmap_populate_basepages(start, end, > node, altmap); > arch/ia64/mm/discontig.c: return vmemmap_populate_basepages(start, end, > node, NULL); > arch/loongarch/mm/init.c: return vmemmap_populate_basepages(start, end, > node, NULL); > arch/riscv/mm/init.c: return vmemmap_populate_basepages(start, end, node, NULL); > arch/x86/mm/init_64.c: err = vmemmap_populate_basepages(start, end, > node, NULL); > arch/x86/mm/init_64.c: err = vmemmap_populate_basepages(start, end, > node, NULL);