Re: [PATCH] mm/vmemmap/devdax: Fix kernel crash when probing devdax devices

Joao Martins <joao.m.martins@xxxxxxxxxx> · Tue, 11 Apr 2023 11:33:45 +0100

On 11/04/2023 09:07, Aneesh Kumar K.V wrote:
> Joao Martins <joao.m.martins@xxxxxxxxxx> writes:
> 
>> On 07/04/2023 13:23, Aneesh Kumar K.V wrote:
>>> commit c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")
>>> added support for using optimized vmmemap for devdax devices. 
>>
>> Not really. It added *compound pages* for devdax ZONE device which is what the
>> commit describes. It is meant to represent the non base page metadata. Should
>> fsdax support the equivalent, we can have GUP performance on the dax path and
>> remove today's special zone device case of ref-per-base-page GUP-fast path.
>>
>> One was tied when it was brought in but in theory[*] we didn't require anything
>> from arch-es to do this (contrary to vmemmap_populate()). The commit you should
>> refer to is this instead:
> 
> ok the compound page dependency came via
> commit 6fd3620b3428 ("mm/page_alloc: reuse tail struct pages for compound devmaps")
> 
It's an optimization to remove unnecessary work, not exactly a real dependency.
But I understand what you mean.

> I have now reworked it such that we do
> 
> static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
> 					      unsigned long nr_pages)
> {
> #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
> 	/*
> 	 * this depends on how vmemmap is populated by
> 	 * vmemmap_populate_compound_pages()
> 	 */
> 	return is_power_of_2(sizeof(struct page)) &&
> 		!altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
> #else
> 	return nr_pages;
> #endif
> }
> 

I would instead just simplify it to:

static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
 					      unsigned long nr_pages)
{
   return IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP) &&
		is_power_of_2(sizeof(struct page)) &&
 		!altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
}

As the ternary false condition already handles the non-optimized-vmemmap case.

With your vmemmap_can_optimize() helper perhaps:

static inline unsigned long compound_nr_pages(struct dev_pagemap *pgmap,
					      struct vmem_altmap *altmap,
 					      unsigned long nr_pages)
{
   return vmemmap_can_optimize(pgmap, altmap) ?
			2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
}

>>
>> 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
>>
>> [*] or so I thought before next paragraph
>>
>>> But how vmemmap
>>> mappings are created are architecture specific. For example, powerpc with hash
>>> translation doesn't have vmemmap mappings in init_mm page table instead they are
>>> bolted table entries in the hardware page table
>>>
>>
>> Very interesting, I didn't know this (no init_mm) was an option. And my
>> apologies for regressing s390 :/ So, s390 does not support
>> vmemmap_populate_basepages() then and always need to go to arch
>> vmemmap_populate() to create the vmemmap entries.
>>
>>> vmemmap_populate_compound_pages() used by vmemmap optimization code is not aware
>>> of these architecture-specific mapping. Hence allow architecture to opt for this
>>> feature. I selected architectures supporting HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>> option as also supporting this feature. I added vmemmap_can_optimize() even
>>> though page_vmemmap_nr(pgmap) > 1 check should filter architecture not
>>> supporting this. > IMHO that brings clarity to the code where we are populating
>>> vmemmap.
>>>
>> I am not sure the last two sentences are right. I would remove, see above and
>> below comments at the end on why.
> 
> That is true because i had vmemmap_shift = 0 if arch didn't support
> vmemmap optimization. Based on your reply above, I have now moved that
> out and kept vmemmap_can_optimize()
> 
/me nods

> modified   mm/page_alloc.c
> @@ -6846,8 +6846,16 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>  static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
>  					      unsigned long nr_pages)
>  {
> +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
> +	/*
> +	 * this depends on how vmemmap is populated by
> +	 * vmemmap_populate_compound_pages()
> +	 */
>  	return is_power_of_2(sizeof(struct page)) &&
>  		!altmap ? 2 * (PAGE_SIZE / sizeof(struct page)) : nr_pages;
> +#else
> +	return nr_pages;
> +#endif
>  }
>  
>  static void __ref memmap_init_compound(struct page *head,
> modified   mm/sparse-vmemmap.c
> @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>  		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>  		return NULL;
>  
> -	if (is_power_of_2(sizeof(struct page)) &&
> -	    pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
> +	if (vmemmap_can_optimize(altmap, pgmap))
>  		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>  	else
>  		r = vmemmap_populate(start, end, nid, altmap);
> 
> ....
> 
>>
>>> Fixes: c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")
>>
>> It should be:
>>
>> Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound
>> devmaps")
> 
> updated.
> 
>>
>> But if what you included in your patch was what lead to the crash, then your
>> problem is not the vmemmap optimization ... as the latter came after. If it's
>> the hash I included above, it would reduce the affected trees to v5.19+
>>
>>> Cc: Joao Martins <joao.m.martins@xxxxxxxxxx>
>>> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx>
>>> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
>>> Reported-by: Tarun Sahu <tsahu@xxxxxxxxxxxxx>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx>
>>> ---
>>>  arch/arm64/Kconfig     |  1 +
>>>  arch/loongarch/Kconfig |  1 +
>>>  arch/s390/Kconfig      |  1 +
>>>  arch/x86/Kconfig       |  1 +
>>>  drivers/dax/device.c   |  3 ++-
>>>  include/linux/mm.h     | 16 ++++++++++++++++
>>>  mm/Kconfig             |  3 +++
>>>  mm/sparse-vmemmap.c    |  3 +--
>>>  8 files changed, 26 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 27b2592698b0..d3f5945f0aff 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -103,6 +103,7 @@ config ARM64
>>>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>>  	select ARCH_WANT_LD_ORPHAN_WARN
>>>  	select ARCH_WANTS_NO_INSTR
>>> +	select ARCH_WANT_OPTIMIZE_VMEMMAP
>>>  	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>>  	select ARCH_HAS_UBSAN_SANITIZE_ALL
>>>  	select ARM_AMBA
>>> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
>>> index 9cc8b84f7eb0..ce5802066d0e 100644
>>> --- a/arch/loongarch/Kconfig
>>> +++ b/arch/loongarch/Kconfig
>>> @@ -56,6 +56,7 @@ config LOONGARCH
>>>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>>  	select ARCH_WANT_LD_ORPHAN_WARN
>>>  	select ARCH_WANTS_NO_INSTR
>>> +	select ARCH_WANT_OPTIMIZE_VMEMMAP
>>>  	select BUILDTIME_TABLE_SORT
>>>  	select COMMON_CLK
>>>  	select CPU_PM
>>> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
>>> index 933771b0b07a..abffccd937b2 100644
>>> --- a/arch/s390/Kconfig
>>> +++ b/arch/s390/Kconfig
>>> @@ -127,6 +127,7 @@ config S390
>>>  	select ARCH_WANT_DEFAULT_BPF_JIT
>>>  	select ARCH_WANT_IPC_PARSE_VERSION
>>>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>> +	select ARCH_WANT_OPTIMIZE_VMEMMAP
>>>  	select BUILDTIME_TABLE_SORT
>>>  	select CLONE_BACKWARDS2
>>>  	select DMA_OPS if PCI
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index a825bf031f49..e8d66d834b4f 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -127,6 +127,7 @@ config X86
>>>  	select ARCH_WANT_HUGE_PMD_SHARE
>>>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP	if X86_64
>>>  	select ARCH_WANT_LD_ORPHAN_WARN
>>> +	select ARCH_WANT_OPTIMIZE_VMEMMAP	if X86_64
>>>  	select ARCH_WANTS_THP_SWAP		if X86_64
>>>  	select ARCH_HAS_PARANOID_L1D_FLUSH
>>>  	select BUILDTIME_TABLE_SORT
>>
>> I would remove these and instead use the
>> ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, and have your other snip be a second
>> patch to drop the 'HUGETLB_' part
>>
>>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>>> index 5494d745ced5..05be8e79d64b 100644
>>> --- a/drivers/dax/device.c
>>> +++ b/drivers/dax/device.c
>>> @@ -448,7 +448,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>>  	}
>>>  
>>>  	pgmap->type = MEMORY_DEVICE_GENERIC;
>>> -	if (dev_dax->align > PAGE_SIZE)
>>> +	if (dev_dax->align > PAGE_SIZE &&
>>> +	    IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP))
>>>  		pgmap->vmemmap_shift =
>>>  			order_base_2(dev_dax->align >> PAGE_SHIFT);
>>
>> vmemmap_shift relates to the compound page size we will use in
>> memmap_init_zone_device(). Otherwise we are back to the base struct page on
>> PMD/PUD case. Instead move the:
>>
>> 	IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP)
>>
>> further below to __populate_section_memmap() ...
>>
>>>  	addr = devm_memremap_pages(dev, pgmap);
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 716d30d93616..fb71e21df23d 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3442,6 +3442,22 @@ void vmemmap_populate_print_last(void);
>>>  void vmemmap_free(unsigned long start, unsigned long end,
>>>  		struct vmem_altmap *altmap);
>>>  #endif
>>> +
>>> +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
>>> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
>>> +					   struct dev_pagemap *pgmap)
>>> +{
>>> +	return is_power_of_2(sizeof(struct page)) &&
>>> +		pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
>>> +}
>>> +#else
>>> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
>>> +					   struct dev_pagemap *pgmap)
>>> +{
>>> +	return false;
>>> +}
>>> +#endif
>>> +
>>>  void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>>>  				  unsigned long nr_pages);
>>>  
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index ff7b209dec05..99f87c1be1e8 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -461,6 +461,9 @@ config SPARSEMEM_VMEMMAP
>>>  	  pfn_to_page and page_to_pfn operations.  This is the most
>>>  	  efficient option when sufficient kernel resources are available.
>>>  
>>> +config ARCH_WANT_OPTIMIZE_VMEMMAP
>>> +	bool
>>> +
>>>  config HAVE_MEMBLOCK_PHYS_MAP
>>>  	bool
>>>  
>> As you mentioned in your other followup email it probably makes sense that you
>> just use ARCH_HUGETLB_WANT_OPTIMIZE_VMEMMAP if we switch this to per-arch. And
>> then have that kconfig name be renamed into the above. It simplifies the fix in
>> case it pickups stable trees (the dax vmemmap deduplication came after hugetlb).
>> The case for the head page reusal case for hugetlb is anyways captured by a
>> hugetlb static key.
> 
> will do
> 
>>
>>
>>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>>> index c5398a5960d0..10d73a0dfcec 100644
>>> --- a/mm/sparse-vmemmap.c
>>> +++ b/mm/sparse-vmemmap.c
>>> @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>>>  		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>>>  		return NULL;
>>>  
>>> -	if (is_power_of_2(sizeof(struct page)) &&
>>> -	    pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
>>> +	if (vmemmap_can_optimize(altmap, pgmap))
>>>  		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>>>  	else
>>>  		r = vmemmap_populate(start, end, nid, altmap);
>>
>> Same comment(s) as further above. I think it should be:
>>
>> if (IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP))
>> 	r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>> else
>> 	r = vmemmap_populate(start, end, nid, altmap);
> 
> Even if arch want vmemmap optimization the ability to use compound
> vmemmap_populate is further restricted by
> 
> 	if (is_power_of_2(sizeof(struct page)) &&
> 	    pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
> 
> Hence we still want if (vmemmap_can_optimize(..)) right? 
> 
Correct. It would become:

if (IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP) &&
    is_power_of_2(sizeof(struct page)) &&
    pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
	r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);

I suppose a helper could have this in vmemmap_can_optimize() as you suggest,
as the condition is getting more unreadable. Probably worth using it too
in compound_nr_pages(), but you would need to add a pgmap argument to that
function and change its caller.

vmemmap_can_optimize() is not really the greatest name, but I can't think of
anything better that aligns with the rest of naming of this feature.

>>
>> In principle all these arches below can support
>> vmemmap_populate_compound_pages() but now it would require them enable it
>> explicitly, which I guess makes sense considering this bug you are fixing. But
>> the arch-requirement is not really if we support optimized-vmemmap or not but if
>> we support the init-mm at all. OTOH tieing both hugetlb and dax additionally
>> eliminates confusion around an optimization shared by both, even if the dax
>> could encompass more. I am not sure what is the arch requirement for hugetlb
>> vmemmap opt these days.
>>
>> arch/arm64/mm/mmu.c:            return vmemmap_populate_basepages(start, end,
>> node, altmap);
>> arch/ia64/mm/discontig.c:       return vmemmap_populate_basepages(start, end,
>> node, NULL);
>> arch/loongarch/mm/init.c:       return vmemmap_populate_basepages(start, end,
>> node, NULL);
>> arch/riscv/mm/init.c:   return vmemmap_populate_basepages(start, end, node, NULL);
>> arch/x86/mm/init_64.c:          err = vmemmap_populate_basepages(start, end,
>> node, NULL);
>> arch/x86/mm/init_64.c:          err = vmemmap_populate_basepages(start, end,
>> node, NULL);