On 11/04/2023 09:07, Aneesh Kumar K.V wrote: > Joao Martins <joao.m.martins@xxxxxxxxxx> writes: > >> On 07/04/2023 13:23, Aneesh Kumar K.V wrote: >>> commit c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages") >>> added support for using optimized vmmemap for devdax devices. >> >> Not really. It added *compound pages* for devdax ZONE device which is what the >> commit describes. It is meant to represent the non base page metadata. Should >> fsdax support the equivalent, we can have GUP performance on the dax path and >> remove today's special zone device case of ref-per-base-page GUP-fast path. >> >> One was tied when it was brought in but in theory[*] we didn't require anything >> from arch-es to do this (contrary to vmemmap_populate()). The commit you should >> refer to is this instead: > > ok the compound page dependency came via > commit 6fd3620b3428 ("mm/page_alloc: reuse tail struct pages for compound devmaps") > It's an optimization to remove unnecessary work, not exactly a real dependency. But I understand what you mean. > I have now reworked it such that we do > > static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap, > unsigned long nr_pages) > { > #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP > /* > * this depends on how vmemmap is populated by > * vmemmap_populate_compound_pages() > */ > return is_power_of_2(sizeof(struct page)) && > !altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages; > #else > return nr_pages; > #endif > } > I would instead just simplify it to: static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap, unsigned long nr_pages) { return IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP) && is_power_of_2(sizeof(struct page)) && !altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages; } As the ternary false condition already handles the non-optimized-vmemmap case. With your vmemmap_can_optimize() helper perhaps: static inline unsigned long compound_nr_pages(struct dev_pagemap *pgmap, struct vmem_altmap *altmap, unsigned long nr_pages) { return vmemmap_can_optimize(pgmap, altmap) ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages; } >> >> 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") >> >> [*] or so I thought before next paragraph >> >>> But how vmemmap >>> mappings are created are architecture specific. For example, powerpc with hash >>> translation doesn't have vmemmap mappings in init_mm page table instead they are >>> bolted table entries in the hardware page table >>> >> >> Very interesting, I didn't know this (no init_mm) was an option. And my >> apologies for regressing s390 :/ So, s390 does not support >> vmemmap_populate_basepages() then and always need to go to arch >> vmemmap_populate() to create the vmemmap entries. >> >>> vmemmap_populate_compound_pages() used by vmemmap optimization code is not aware >>> of these architecture-specific mapping. Hence allow architecture to opt for this >>> feature. I selected architectures supporting HUGETLB_PAGE_OPTIMIZE_VMEMMAP >>> option as also supporting this feature. I added vmemmap_can_optimize() even >>> though page_vmemmap_nr(pgmap) > 1 check should filter architecture not >>> supporting this. > IMHO that brings clarity to the code where we are populating >>> vmemmap. >>> >> I am not sure the last two sentences are right. I would remove, see above and >> below comments at the end on why. > > That is true because i had vmemmap_shift = 0 if arch didn't support > vmemmap optimization. Based on your reply above, I have now moved that > out and kept vmemmap_can_optimize() > /me nods > modified mm/page_alloc.c > @@ -6846,8 +6846,16 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, > static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap, > unsigned long nr_pages) > { > +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP > + /* > + * this depends on how vmemmap is populated by > + * vmemmap_populate_compound_pages() > + */ > return is_power_of_2(sizeof(struct page)) && > !altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages; > +#else > + return nr_pages; > +#endif > } > > static void __ref memmap_init_compound(struct page *head, > modified mm/sparse-vmemmap.c > @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, > !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION))) > return NULL; > > - if (is_power_of_2(sizeof(struct page)) && > - pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap) > + if (vmemmap_can_optimize(altmap, pgmap)) > r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); > else > r = vmemmap_populate(start, end, nid, altmap); > > .... > >> >>> Fixes: c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages") >> >> It should be: >> >> Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound >> devmaps") > > updated. > >> >> But if what you included in your patch was what lead to the crash, then your >> problem is not the vmemmap optimization ... as the latter came after. If it's >> the hash I included above, it would reduce the affected trees to v5.19+ >> >>> Cc: Joao Martins <joao.m.martins@xxxxxxxxxx> >>> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx> >>> Cc: Dan Williams <dan.j.williams@xxxxxxxxx> >>> Reported-by: Tarun Sahu <tsahu@xxxxxxxxxxxxx> >>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx> >>> --- >>> arch/arm64/Kconfig | 1 + >>> arch/loongarch/Kconfig | 1 + >>> arch/s390/Kconfig | 1 + >>> arch/x86/Kconfig | 1 + >>> drivers/dax/device.c | 3 ++- >>> include/linux/mm.h | 16 ++++++++++++++++ >>> mm/Kconfig | 3 +++ >>> mm/sparse-vmemmap.c | 3 +-- >>> 8 files changed, 26 insertions(+), 3 deletions(-) >>> >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>> index 27b2592698b0..d3f5945f0aff 100644 >>> --- a/arch/arm64/Kconfig >>> +++ b/arch/arm64/Kconfig >>> @@ -103,6 +103,7 @@ config ARM64 >>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >>> select ARCH_WANT_LD_ORPHAN_WARN >>> select ARCH_WANTS_NO_INSTR >>> + select ARCH_WANT_OPTIMIZE_VMEMMAP >>> select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES >>> select ARCH_HAS_UBSAN_SANITIZE_ALL >>> select ARM_AMBA >>> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig >>> index 9cc8b84f7eb0..ce5802066d0e 100644 >>> --- a/arch/loongarch/Kconfig >>> +++ b/arch/loongarch/Kconfig >>> @@ -56,6 +56,7 @@ config LOONGARCH >>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >>> select ARCH_WANT_LD_ORPHAN_WARN >>> select ARCH_WANTS_NO_INSTR >>> + select ARCH_WANT_OPTIMIZE_VMEMMAP >>> select BUILDTIME_TABLE_SORT >>> select COMMON_CLK >>> select CPU_PM >>> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig >>> index 933771b0b07a..abffccd937b2 100644 >>> --- a/arch/s390/Kconfig >>> +++ b/arch/s390/Kconfig >>> @@ -127,6 +127,7 @@ config S390 >>> select ARCH_WANT_DEFAULT_BPF_JIT >>> select ARCH_WANT_IPC_PARSE_VERSION >>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >>> + select ARCH_WANT_OPTIMIZE_VMEMMAP >>> select BUILDTIME_TABLE_SORT >>> select CLONE_BACKWARDS2 >>> select DMA_OPS if PCI >>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >>> index a825bf031f49..e8d66d834b4f 100644 >>> --- a/arch/x86/Kconfig >>> +++ b/arch/x86/Kconfig >>> @@ -127,6 +127,7 @@ config X86 >>> select ARCH_WANT_HUGE_PMD_SHARE >>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP if X86_64 >>> select ARCH_WANT_LD_ORPHAN_WARN >>> + select ARCH_WANT_OPTIMIZE_VMEMMAP if X86_64 >>> select ARCH_WANTS_THP_SWAP if X86_64 >>> select ARCH_HAS_PARANOID_L1D_FLUSH >>> select BUILDTIME_TABLE_SORT >> >> I would remove these and instead use the >> ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, and have your other snip be a second >> patch to drop the 'HUGETLB_' part >> >>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c >>> index 5494d745ced5..05be8e79d64b 100644 >>> --- a/drivers/dax/device.c >>> +++ b/drivers/dax/device.c >>> @@ -448,7 +448,8 @@ int dev_dax_probe(struct dev_dax *dev_dax) >>> } >>> >>> pgmap->type = MEMORY_DEVICE_GENERIC; >>> - if (dev_dax->align > PAGE_SIZE) >>> + if (dev_dax->align > PAGE_SIZE && >>> + IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP)) >>> pgmap->vmemmap_shift = >>> order_base_2(dev_dax->align >> PAGE_SHIFT); >> >> vmemmap_shift relates to the compound page size we will use in >> memmap_init_zone_device(). Otherwise we are back to the base struct page on >> PMD/PUD case. Instead move the: >> >> IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP) >> >> further below to __populate_section_memmap() ... >> >>> addr = devm_memremap_pages(dev, pgmap); >>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>> index 716d30d93616..fb71e21df23d 100644 >>> --- a/include/linux/mm.h >>> +++ b/include/linux/mm.h >>> @@ -3442,6 +3442,22 @@ void vmemmap_populate_print_last(void); >>> void vmemmap_free(unsigned long start, unsigned long end, >>> struct vmem_altmap *altmap); >>> #endif >>> + >>> +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP >>> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, >>> + struct dev_pagemap *pgmap) >>> +{ >>> + return is_power_of_2(sizeof(struct page)) && >>> + pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap; >>> +} >>> +#else >>> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, >>> + struct dev_pagemap *pgmap) >>> +{ >>> + return false; >>> +} >>> +#endif >>> + >>> void register_page_bootmem_memmap(unsigned long section_nr, struct page *map, >>> unsigned long nr_pages); >>> >>> diff --git a/mm/Kconfig b/mm/Kconfig >>> index ff7b209dec05..99f87c1be1e8 100644 >>> --- a/mm/Kconfig >>> +++ b/mm/Kconfig >>> @@ -461,6 +461,9 @@ config SPARSEMEM_VMEMMAP >>> pfn_to_page and page_to_pfn operations. This is the most >>> efficient option when sufficient kernel resources are available. >>> >>> +config ARCH_WANT_OPTIMIZE_VMEMMAP >>> + bool >>> + >>> config HAVE_MEMBLOCK_PHYS_MAP >>> bool >>> >> As you mentioned in your other followup email it probably makes sense that you >> just use ARCH_HUGETLB_WANT_OPTIMIZE_VMEMMAP if we switch this to per-arch. And >> then have that kconfig name be renamed into the above. It simplifies the fix in >> case it pickups stable trees (the dax vmemmap deduplication came after hugetlb). >> The case for the head page reusal case for hugetlb is anyways captured by a >> hugetlb static key. > > will do > >> >> >>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c >>> index c5398a5960d0..10d73a0dfcec 100644 >>> --- a/mm/sparse-vmemmap.c >>> +++ b/mm/sparse-vmemmap.c >>> @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, >>> !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION))) >>> return NULL; >>> >>> - if (is_power_of_2(sizeof(struct page)) && >>> - pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap) >>> + if (vmemmap_can_optimize(altmap, pgmap)) >>> r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); >>> else >>> r = vmemmap_populate(start, end, nid, altmap); >> >> Same comment(s) as further above. I think it should be: >> >> if (IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP)) >> r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); >> else >> r = vmemmap_populate(start, end, nid, altmap); > > Even if arch want vmemmap optimization the ability to use compound > vmemmap_populate is further restricted by > > if (is_power_of_2(sizeof(struct page)) && > pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap) > > Hence we still want if (vmemmap_can_optimize(..)) right? > Correct. It would become: if (IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP) && is_power_of_2(sizeof(struct page)) && pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap) r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); I suppose a helper could have this in vmemmap_can_optimize() as you suggest, as the condition is getting more unreadable. Probably worth using it too in compound_nr_pages(), but you would need to add a pgmap argument to that function and change its caller. vmemmap_can_optimize() is not really the greatest name, but I can't think of anything better that aligns with the rest of naming of this feature. >> >> In principle all these arches below can support >> vmemmap_populate_compound_pages() but now it would require them enable it >> explicitly, which I guess makes sense considering this bug you are fixing. But >> the arch-requirement is not really if we support optimized-vmemmap or not but if >> we support the init-mm at all. OTOH tieing both hugetlb and dax additionally >> eliminates confusion around an optimization shared by both, even if the dax >> could encompass more. I am not sure what is the arch requirement for hugetlb >> vmemmap opt these days. >> >> arch/arm64/mm/mmu.c: return vmemmap_populate_basepages(start, end, >> node, altmap); >> arch/ia64/mm/discontig.c: return vmemmap_populate_basepages(start, end, >> node, NULL); >> arch/loongarch/mm/init.c: return vmemmap_populate_basepages(start, end, >> node, NULL); >> arch/riscv/mm/init.c: return vmemmap_populate_basepages(start, end, node, NULL); >> arch/x86/mm/init_64.c: err = vmemmap_populate_basepages(start, end, >> node, NULL); >> arch/x86/mm/init_64.c: err = vmemmap_populate_basepages(start, end, >> node, NULL);