Re: [PATCH] mm/vmemmap/devdax: Fix kernel crash when probing devdax devices

"Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> · Tue, 11 Apr 2023 13:37:41 +0530




Joao Martins <joao.m.martins@xxxxxxxxxx> writes:

> On 07/04/2023 13:23, Aneesh Kumar K.V wrote:
>> commit c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")
>> added support for using optimized vmmemap for devdax devices. 
>
> Not really. It added *compound pages* for devdax ZONE device which is what the
> commit describes. It is meant to represent the non base page metadata. Should
> fsdax support the equivalent, we can have GUP performance on the dax path and
> remove today's special zone device case of ref-per-base-page GUP-fast path.
>
> One was tied when it was brought in but in theory[*] we didn't require anything
> from arch-es to do this (contrary to vmemmap_populate()). The commit you should
> refer to is this instead:

ok the compound page dependency came via
commit 6fd3620b3428 ("mm/page_alloc: reuse tail struct pages for compound devmaps")

I have now reworked it such that we do

static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
					      unsigned long nr_pages)
{
#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
	/*
	 * this depends on how vmemmap is populated by
	 * vmemmap_populate_compound_pages()
	 */
	return is_power_of_2(sizeof(struct page)) &&
		!altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
#else
	return nr_pages;
#endif
}

>
> 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
>
> [*] or so I thought before next paragraph
>
>> But how vmemmap
>> mappings are created are architecture specific. For example, powerpc with hash
>> translation doesn't have vmemmap mappings in init_mm page table instead they are
>> bolted table entries in the hardware page table
>> 
>
> Very interesting, I didn't know this (no init_mm) was an option. And my
> apologies for regressing s390 :/ So, s390 does not support
> vmemmap_populate_basepages() then and always need to go to arch
> vmemmap_populate() to create the vmemmap entries.
>
>> vmemmap_populate_compound_pages() used by vmemmap optimization code is not aware
>> of these architecture-specific mapping. Hence allow architecture to opt for this
>> feature. I selected architectures supporting HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>> option as also supporting this feature. I added vmemmap_can_optimize() even
>> though page_vmemmap_nr(pgmap) > 1 check should filter architecture not
>> supporting this. > IMHO that brings clarity to the code where we are populating
>> vmemmap.
>> 
> I am not sure the last two sentences are right. I would remove, see above and
> below comments at the end on why.

That is true because i had vmemmap_shift = 0 if arch didn't support
vmemmap optimization. Based on your reply above, I have now moved that
out and kept vmemmap_can_optimize()

modified   mm/page_alloc.c
@@ -6846,8 +6846,16 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
 					      unsigned long nr_pages)
 {
+#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
+	/*
+	 * this depends on how vmemmap is populated by
+	 * vmemmap_populate_compound_pages()
+	 */
 	return is_power_of_2(sizeof(struct page)) &&
 		!altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
+#else
+	return nr_pages;
+#endif
 }
 
 static void __ref memmap_init_compound(struct page *head,
modified   mm/sparse-vmemmap.c
@@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (is_power_of_2(sizeof(struct page)) &&
-	    pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
+	if (vmemmap_can_optimize(altmap, pgmap))
 		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
 	else
 		r = vmemmap_populate(start, end, nid, altmap);

....

>
>> Fixes: c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")
>
> It should be:
>
> Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound
> devmaps")

updated.

>
> But if what you included in your patch was what lead to the crash, then your
> problem is not the vmemmap optimization ... as the latter came after. If it's
> the hash I included above, it would reduce the affected trees to v5.19+
>
>> Cc: Joao Martins <joao.m.martins@xxxxxxxxxx>
>> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx>
>> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
>> Reported-by: Tarun Sahu <tsahu@xxxxxxxxxxxxx>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx>
>> ---
>>  arch/arm64/Kconfig     |  1 +
>>  arch/loongarch/Kconfig |  1 +
>>  arch/s390/Kconfig      |  1 +
>>  arch/x86/Kconfig       |  1 +
>>  drivers/dax/device.c   |  3 ++-
>>  include/linux/mm.h     | 16 ++++++++++++++++
>>  mm/Kconfig             |  3 +++
>>  mm/sparse-vmemmap.c    |  3 +--
>>  8 files changed, 26 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 27b2592698b0..d3f5945f0aff 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -103,6 +103,7 @@ config ARM64
>>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>  	select ARCH_WANT_LD_ORPHAN_WARN
>>  	select ARCH_WANTS_NO_INSTR
>> +	select ARCH_WANT_OPTIMIZE_VMEMMAP
>>  	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>  	select ARCH_HAS_UBSAN_SANITIZE_ALL
>>  	select ARM_AMBA
>> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
>> index 9cc8b84f7eb0..ce5802066d0e 100644
>> --- a/arch/loongarch/Kconfig
>> +++ b/arch/loongarch/Kconfig
>> @@ -56,6 +56,7 @@ config LOONGARCH
>>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>  	select ARCH_WANT_LD_ORPHAN_WARN
>>  	select ARCH_WANTS_NO_INSTR
>> +	select ARCH_WANT_OPTIMIZE_VMEMMAP
>>  	select BUILDTIME_TABLE_SORT
>>  	select COMMON_CLK
>>  	select CPU_PM
>> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
>> index 933771b0b07a..abffccd937b2 100644
>> --- a/arch/s390/Kconfig
>> +++ b/arch/s390/Kconfig
>> @@ -127,6 +127,7 @@ config S390
>>  	select ARCH_WANT_DEFAULT_BPF_JIT
>>  	select ARCH_WANT_IPC_PARSE_VERSION
>>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>> +	select ARCH_WANT_OPTIMIZE_VMEMMAP
>>  	select BUILDTIME_TABLE_SORT
>>  	select CLONE_BACKWARDS2
>>  	select DMA_OPS if PCI
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index a825bf031f49..e8d66d834b4f 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -127,6 +127,7 @@ config X86
>>  	select ARCH_WANT_HUGE_PMD_SHARE
>>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP	if X86_64
>>  	select ARCH_WANT_LD_ORPHAN_WARN
>> +	select ARCH_WANT_OPTIMIZE_VMEMMAP	if X86_64
>>  	select ARCH_WANTS_THP_SWAP		if X86_64
>>  	select ARCH_HAS_PARANOID_L1D_FLUSH
>>  	select BUILDTIME_TABLE_SORT
>
> I would remove these and instead use the
> ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, and have your other snip be a second
> patch to drop the 'HUGETLB_' part
>
>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index 5494d745ced5..05be8e79d64b 100644
>> --- a/drivers/dax/device.c
>> +++ b/drivers/dax/device.c
>> @@ -448,7 +448,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>  	}
>>  
>>  	pgmap->type = MEMORY_DEVICE_GENERIC;
>> -	if (dev_dax->align > PAGE_SIZE)
>> +	if (dev_dax->align > PAGE_SIZE &&
>> +	    IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP))
>>  		pgmap->vmemmap_shift =
>>  			order_base_2(dev_dax->align >> PAGE_SHIFT);
>
> vmemmap_shift relates to the compound page size we will use in
> memmap_init_zone_device(). Otherwise we are back to the base struct page on
> PMD/PUD case. Instead move the:
>
> 	IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP)
>
> further below to __populate_section_memmap() ...
>
>>  	addr = devm_memremap_pages(dev, pgmap);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 716d30d93616..fb71e21df23d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3442,6 +3442,22 @@ void vmemmap_populate_print_last(void);
>>  void vmemmap_free(unsigned long start, unsigned long end,
>>  		struct vmem_altmap *altmap);
>>  #endif
>> +
>> +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
>> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
>> +					   struct dev_pagemap *pgmap)
>> +{
>> +	return is_power_of_2(sizeof(struct page)) &&
>> +		pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
>> +}
>> +#else
>> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
>> +					   struct dev_pagemap *pgmap)
>> +{
>> +	return false;
>> +}
>> +#endif
>> +
>>  void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>>  				  unsigned long nr_pages);
>>  
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index ff7b209dec05..99f87c1be1e8 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -461,6 +461,9 @@ config SPARSEMEM_VMEMMAP
>>  	  pfn_to_page and page_to_pfn operations.  This is the most
>>  	  efficient option when sufficient kernel resources are available.
>>  
>> +config ARCH_WANT_OPTIMIZE_VMEMMAP
>> +	bool
>> +
>>  config HAVE_MEMBLOCK_PHYS_MAP
>>  	bool
>>  
> As you mentioned in your other followup email it probably makes sense that you
> just use ARCH_HUGETLB_WANT_OPTIMIZE_VMEMMAP if we switch this to per-arch. And
> then have that kconfig name be renamed into the above. It simplifies the fix in
> case it pickups stable trees (the dax vmemmap deduplication came after hugetlb).
> The case for the head page reusal case for hugetlb is anyways captured by a
> hugetlb static key.

will do

>
>
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index c5398a5960d0..10d73a0dfcec 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>>  		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>>  		return NULL;
>>  
>> -	if (is_power_of_2(sizeof(struct page)) &&
>> -	    pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
>> +	if (vmemmap_can_optimize(altmap, pgmap))
>>  		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>>  	else
>>  		r = vmemmap_populate(start, end, nid, altmap);
>
> Same comment(s) as further above. I think it should be:
>
> if (IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP))
> 	r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
> else
> 	r = vmemmap_populate(start, end, nid, altmap);

Even if arch want vmemmap optimization the ability to use compound
vmemmap_populate is further restricted by

	if (is_power_of_2(sizeof(struct page)) &&
	    pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)

Hence we still want if (vmemmap_can_optimize(..)) right? 

>
> In principle all these arches below can support
> vmemmap_populate_compound_pages() but now it would require them enable it
> explicitly, which I guess makes sense considering this bug you are fixing. But
> the arch-requirement is not really if we support optimized-vmemmap or not but if
> we support the init-mm at all. OTOH tieing both hugetlb and dax additionally
> eliminates confusion around an optimization shared by both, even if the dax
> could encompass more. I am not sure what is the arch requirement for hugetlb
> vmemmap opt these days.
>
> arch/arm64/mm/mmu.c:            return vmemmap_populate_basepages(start, end,
> node, altmap);
> arch/ia64/mm/discontig.c:       return vmemmap_populate_basepages(start, end,
> node, NULL);
> arch/loongarch/mm/init.c:       return vmemmap_populate_basepages(start, end,
> node, NULL);
> arch/riscv/mm/init.c:   return vmemmap_populate_basepages(start, end, node, NULL);
> arch/x86/mm/init_64.c:          err = vmemmap_populate_basepages(start, end,
> node, NULL);
> arch/x86/mm/init_64.c:          err = vmemmap_populate_basepages(start, end,
> node, NULL);