Re: [PATCH] mm/vmemmap/devdax: Fix kernel crash when probing devdax devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 07/04/2023 13:23, Aneesh Kumar K.V wrote:
> commit c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")
> added support for using optimized vmmemap for devdax devices. 

Not really. It added *compound pages* for devdax ZONE device which is what the
commit describes. It is meant to represent the non base page metadata. Should
fsdax support the equivalent, we can have GUP performance on the dax path and
remove today's special zone device case of ref-per-base-page GUP-fast path.

One was tied when it was brought in but in theory[*] we didn't require anything
from arch-es to do this (contrary to vmemmap_populate()). The commit you should
refer to is this instead:

4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")

[*] or so I thought before next paragraph

> But how vmemmap
> mappings are created are architecture specific. For example, powerpc with hash
> translation doesn't have vmemmap mappings in init_mm page table instead they are
> bolted table entries in the hardware page table
> 

Very interesting, I didn't know this (no init_mm) was an option. And my
apologies for regressing s390 :/ So, s390 does not support
vmemmap_populate_basepages() then and always need to go to arch
vmemmap_populate() to create the vmemmap entries.

> vmemmap_populate_compound_pages() used by vmemmap optimization code is not aware
> of these architecture-specific mapping. Hence allow architecture to opt for this
> feature. I selected architectures supporting HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> option as also supporting this feature. I added vmemmap_can_optimize() even
> though page_vmemmap_nr(pgmap) > 1 check should filter architecture not
> supporting this. > IMHO that brings clarity to the code where we are populating
> vmemmap.
> 
I am not sure the last two sentences are right. I would remove, see above and
below comments at the end on why.

> This patch fixes the below crash on ppc64.
> 
> BUG: Unable to handle kernel data access on write at 0xc00c000100400038
> Faulting instruction address: 0xc000000001269d90
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in:
> CPU: 7 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc5-150500.34-default+ #2 5c90a668b6bbd142599890245c2fb5de19d7d28a
> Hardware name: IBM,9009-42G POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.40 (VL950_099) hv:phyp pSeries
> NIP:  c000000001269d90 LR: c0000000004c57d4 CTR: 0000000000000000
> REGS: c000000003632c30 TRAP: 0300   Not tainted  (6.3.0-rc5-150500.34-default+)
> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24842228  XER: 00000000
> CFAR: c0000000004c57d0 DAR: c00c000100400038 DSISR: 42000000 IRQMASK: 0
> ....
> NIP [c000000001269d90] __init_single_page.isra.74+0x14/0x4c
> LR [c0000000004c57d4] __init_zone_device_page+0x44/0xd0
> Call Trace:
> [c000000003632ed0] [c000000003632f60] 0xc000000003632f60 (unreliable)
> [c000000003632f10] [c0000000004c5ca0] memmap_init_zone_device+0x170/0x250
> [c000000003632fe0] [c0000000005575f8] memremap_pages+0x2c8/0x7f0
> [c0000000036330c0] [c000000000557b5c] devm_memremap_pages+0x3c/0xa0
> [c000000003633100] [c000000000d458a8] dev_dax_probe+0x108/0x3e0
> [c0000000036331a0] [c000000000d41430] dax_bus_probe+0xb0/0x140
> [c0000000036331d0] [c000000000cef27c] really_probe+0x19c/0x520
> [c000000003633260] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
> [c0000000036332e0] [c000000000cef888] driver_probe_device+0x58/0x120
> [c000000003633320] [c000000000cefa6c] __device_attach_driver+0x11c/0x1e0
> [c0000000036333a0] [c000000000cebc58] bus_for_each_drv+0xa8/0x130
> [c000000003633400] [c000000000ceefcc] __device_attach+0x15c/0x250
> [c0000000036334a0] [c000000000ced458] bus_probe_device+0x108/0x110
> [c0000000036334f0] [c000000000ce92dc] device_add+0x7fc/0xa10
> [c0000000036335b0] [c000000000d447c8] devm_create_dev_dax+0x1d8/0x530
> [c000000003633640] [c000000000d46b60] __dax_pmem_probe+0x200/0x270
> [c0000000036337b0] [c000000000d46bf0] dax_pmem_probe+0x20/0x70
> [c0000000036337d0] [c000000000d2279c] nvdimm_bus_probe+0xac/0x2b0
> [c000000003633860] [c000000000cef27c] really_probe+0x19c/0x520
> [c0000000036338f0] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
> [c000000003633970] [c000000000cef888] driver_probe_device+0x58/0x120
> [c0000000036339b0] [c000000000cefd08] __driver_attach+0x1d8/0x240
> [c000000003633a30] [c000000000cebb04] bus_for_each_dev+0xb4/0x130
> [c000000003633a90] [c000000000cee564] driver_attach+0x34/0x50
> [c000000003633ab0] [c000000000ced878] bus_add_driver+0x218/0x300
> [c000000003633b40] [c000000000cf1144] driver_register+0xa4/0x1b0
> [c000000003633bb0] [c000000000d21a0c] __nd_driver_register+0x5c/0x100
> [c000000003633c10] [c00000000206a2e8] dax_pmem_init+0x34/0x48
> [c000000003633c30] [c0000000000132d0] do_one_initcall+0x60/0x320
> [c000000003633d00] [c0000000020051b0] kernel_init_freeable+0x360/0x400
> [c000000003633de0] [c000000000013764] kernel_init+0x34/0x1d0
> [c000000003633e50] [c00000000000de14] ret_from_kernel_thread+0x5c/0x64
> 
> Fixes: c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages")

It should be:

Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound
devmaps")

But if what you included in your patch was what lead to the crash, then your
problem is not the vmemmap optimization ... as the latter came after. If it's
the hash I included above, it would reduce the affected trees to v5.19+

> Cc: Joao Martins <joao.m.martins@xxxxxxxxxx>
> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> Reported-by: Tarun Sahu <tsahu@xxxxxxxxxxxxx>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx>
> ---
>  arch/arm64/Kconfig     |  1 +
>  arch/loongarch/Kconfig |  1 +
>  arch/s390/Kconfig      |  1 +
>  arch/x86/Kconfig       |  1 +
>  drivers/dax/device.c   |  3 ++-
>  include/linux/mm.h     | 16 ++++++++++++++++
>  mm/Kconfig             |  3 +++
>  mm/sparse-vmemmap.c    |  3 +--
>  8 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 27b2592698b0..d3f5945f0aff 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -103,6 +103,7 @@ config ARM64
>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>  	select ARCH_WANT_LD_ORPHAN_WARN
>  	select ARCH_WANTS_NO_INSTR
> +	select ARCH_WANT_OPTIMIZE_VMEMMAP
>  	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>  	select ARCH_HAS_UBSAN_SANITIZE_ALL
>  	select ARM_AMBA
> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
> index 9cc8b84f7eb0..ce5802066d0e 100644
> --- a/arch/loongarch/Kconfig
> +++ b/arch/loongarch/Kconfig
> @@ -56,6 +56,7 @@ config LOONGARCH
>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>  	select ARCH_WANT_LD_ORPHAN_WARN
>  	select ARCH_WANTS_NO_INSTR
> +	select ARCH_WANT_OPTIMIZE_VMEMMAP
>  	select BUILDTIME_TABLE_SORT
>  	select COMMON_CLK
>  	select CPU_PM
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index 933771b0b07a..abffccd937b2 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -127,6 +127,7 @@ config S390
>  	select ARCH_WANT_DEFAULT_BPF_JIT
>  	select ARCH_WANT_IPC_PARSE_VERSION
>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +	select ARCH_WANT_OPTIMIZE_VMEMMAP
>  	select BUILDTIME_TABLE_SORT
>  	select CLONE_BACKWARDS2
>  	select DMA_OPS if PCI
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index a825bf031f49..e8d66d834b4f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -127,6 +127,7 @@ config X86
>  	select ARCH_WANT_HUGE_PMD_SHARE
>  	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP	if X86_64
>  	select ARCH_WANT_LD_ORPHAN_WARN
> +	select ARCH_WANT_OPTIMIZE_VMEMMAP	if X86_64
>  	select ARCH_WANTS_THP_SWAP		if X86_64
>  	select ARCH_HAS_PARANOID_L1D_FLUSH
>  	select BUILDTIME_TABLE_SORT

I would remove these and instead use the
ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, and have your other snip be a second
patch to drop the 'HUGETLB_' part

> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 5494d745ced5..05be8e79d64b 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -448,7 +448,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>  	}
>  
>  	pgmap->type = MEMORY_DEVICE_GENERIC;
> -	if (dev_dax->align > PAGE_SIZE)
> +	if (dev_dax->align > PAGE_SIZE &&
> +	    IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP))
>  		pgmap->vmemmap_shift =
>  			order_base_2(dev_dax->align >> PAGE_SHIFT);

vmemmap_shift relates to the compound page size we will use in
memmap_init_zone_device(). Otherwise we are back to the base struct page on
PMD/PUD case. Instead move the:

	IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP)

further below to __populate_section_memmap() ...

>  	addr = devm_memremap_pages(dev, pgmap);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 716d30d93616..fb71e21df23d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3442,6 +3442,22 @@ void vmemmap_populate_print_last(void);
>  void vmemmap_free(unsigned long start, unsigned long end,
>  		struct vmem_altmap *altmap);
>  #endif
> +
> +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
> +					   struct dev_pagemap *pgmap)
> +{
> +	return is_power_of_2(sizeof(struct page)) &&
> +		pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
> +}
> +#else
> +static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
> +					   struct dev_pagemap *pgmap)
> +{
> +	return false;
> +}
> +#endif
> +
>  void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>  				  unsigned long nr_pages);
>  
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ff7b209dec05..99f87c1be1e8 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -461,6 +461,9 @@ config SPARSEMEM_VMEMMAP
>  	  pfn_to_page and page_to_pfn operations.  This is the most
>  	  efficient option when sufficient kernel resources are available.
>  
> +config ARCH_WANT_OPTIMIZE_VMEMMAP
> +	bool
> +
>  config HAVE_MEMBLOCK_PHYS_MAP
>  	bool
>  
As you mentioned in your other followup email it probably makes sense that you
just use ARCH_HUGETLB_WANT_OPTIMIZE_VMEMMAP if we switch this to per-arch. And
then have that kconfig name be renamed into the above. It simplifies the fix in
case it pickups stable trees (the dax vmemmap deduplication came after hugetlb).
The case for the head page reusal case for hugetlb is anyways captured by a
hugetlb static key.


> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index c5398a5960d0..10d73a0dfcec 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -458,8 +458,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>  		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>  		return NULL;
>  
> -	if (is_power_of_2(sizeof(struct page)) &&
> -	    pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
> +	if (vmemmap_can_optimize(altmap, pgmap))
>  		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>  	else
>  		r = vmemmap_populate(start, end, nid, altmap);

Same comment(s) as further above. I think it should be:

if (IS_ENABLED(CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP))
	r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
else
	r = vmemmap_populate(start, end, nid, altmap);

In principle all these arches below can support
vmemmap_populate_compound_pages() but now it would require them enable it
explicitly, which I guess makes sense considering this bug you are fixing. But
the arch-requirement is not really if we support optimized-vmemmap or not but if
we support the init-mm at all. OTOH tieing both hugetlb and dax additionally
eliminates confusion around an optimization shared by both, even if the dax
could encompass more. I am not sure what is the arch requirement for hugetlb
vmemmap opt these days.

arch/arm64/mm/mmu.c:            return vmemmap_populate_basepages(start, end,
node, altmap);
arch/ia64/mm/discontig.c:       return vmemmap_populate_basepages(start, end,
node, NULL);
arch/loongarch/mm/init.c:       return vmemmap_populate_basepages(start, end,
node, NULL);
arch/riscv/mm/init.c:   return vmemmap_populate_basepages(start, end, node, NULL);
arch/x86/mm/init_64.c:          err = vmemmap_populate_basepages(start, end,
node, NULL);
arch/x86/mm/init_64.c:          err = vmemmap_populate_basepages(start, end,
node, NULL);




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux