Re: [PATCH 5.15 108/476] mm/sparsemem: fix race in accessing memory_section->usage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/21/24 14:02, Greg Kroah-Hartman wrote:
> 5.15-stable review patch.  If anyone has any objections, please let me know.
> 
> ------------------
> 
> From: Charan Teja Kalla <quic_charante@xxxxxxxxxxx>
> 
> [ Upstream commit 5ec8e8ea8b7783fab150cf86404fc38cb4db8800 ]
> 
> The below race is observed on a PFN which falls into the device memory
> region with the system memory configuration where PFN's are such that
> [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL].  Since normal zone start and end
> pfn contains the device memory PFN's as well, the compaction triggered
> will try on the device memory PFN's too though they end up in NOP(because
> pfn_to_online_page() returns NULL for ZONE_DEVICE memory sections).  When
> from other core, the section mappings are being removed for the
> ZONE_DEVICE region, that the PFN in question belongs to, on which
> compaction is currently being operated is resulting into the kernel crash
> with CONFIG_SPASEMEM_VMEMAP enabled.  The crash logs can be seen at [1].
> 
> compact_zone()			memunmap_pages
> -------------			---------------
> __pageblock_pfn_to_page
>    ......
>  (a)pfn_valid():
>      valid_section()//return true
> 			      (b)__remove_pages()->
> 				  sparse_remove_section()->
> 				    section_deactivate():
> 				    [Free the array ms->usage and set
> 				     ms->usage = NULL]
>      pfn_section_valid()
>      [Access ms->usage which
>      is NULL]
> 
> NOTE: From the above it can be said that the race is reduced to between
> the pfn_valid()/pfn_section_valid() and the section deactivate with
> SPASEMEM_VMEMAP enabled.
> 
> The commit b943f045a9af("mm/sparse: fix kernel crash with
> pfn_section_valid check") tried to address the same problem by clearing
> the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
> false thus ms->usage is not accessed.
> 
> Fix this issue by the below steps:
> 
> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
> 
> b) RCU protected read side critical section will either return NULL
>    when SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
> 
> c) Free the ->usage with kfree_rcu() and set ms->usage = NULL.  No
>    attempt will be made to access ->usage after this as the
>    SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
> 
> Thanks to David/Pavan for their inputs on this patch.
> 
> [1] https://lore.kernel.org/linux-mm/994410bb-89aa-d987-1f50-f514903c55aa@xxxxxxxxxxx/
> 
> On Snapdragon SoC, with the mentioned memory configuration of PFN's as
> [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of
> issues daily while testing on a device farm.
> 
> For this particular issue below is the log.  Though the below log is
> not directly pointing to the pfn_section_valid(){ ms->usage;}, when we
> loaded this dump on T32 lauterbach tool, it is pointing.
> 
> [  540.578056] Unable to handle kernel NULL pointer dereference at
> virtual address 0000000000000000
> [  540.578068] Mem abort info:
> [  540.578070]   ESR = 0x0000000096000005
> [  540.578073]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  540.578077]   SET = 0, FnV = 0
> [  540.578080]   EA = 0, S1PTW = 0
> [  540.578082]   FSC = 0x05: level 1 translation fault
> [  540.578085] Data abort info:
> [  540.578086]   ISV = 0, ISS = 0x00000005
> [  540.578088]   CM = 0, WnR = 0
> [  540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBSBTYPE=--)
> [  540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c
> [  540.579454] lr : compact_zone+0x994/0x1058
> [  540.579460] sp : ffffffc03579b510
> [  540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:000000000000000c
> [  540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:ffffffc03579b640
> [  540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:0000000000000000
> [  540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:ffffffdebf66d140
> [  540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:00000000009f4bff
> [  540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:0000000000000001
> [  540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :ffffff897d2cd440
> [  540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :ffffffc03579b5b4
> [  540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :0000000000000001
> [  540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :0000000000235800
> [  540.579524] Call trace:
> [  540.579527]  __pageblock_pfn_to_page+0x6c/0x14c
> [  540.579533]  compact_zone+0x994/0x1058
> [  540.579536]  try_to_compact_pages+0x128/0x378
> [  540.579540]  __alloc_pages_direct_compact+0x80/0x2b0
> [  540.579544]  __alloc_pages_slowpath+0x5c0/0xe10
> [  540.579547]  __alloc_pages+0x250/0x2d0
> [  540.579550]  __iommu_dma_alloc_noncontiguous+0x13c/0x3fc
> [  540.579561]  iommu_dma_alloc+0xa0/0x320
> [  540.579565]  dma_alloc_attrs+0xd4/0x108
> 
> [quic_charante@xxxxxxxxxxx: use kfree_rcu() in place of synchronize_rcu(), per David]
>   Link: https://lkml.kernel.org/r/1698403778-20938-1-git-send-email-quic_charante@xxxxxxxxxxx
> Link: https://lkml.kernel.org/r/1697202267-23600-1-git-send-email-quic_charante@xxxxxxxxxxx
> Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot")
> Signed-off-by: Charan Teja Kalla <quic_charante@xxxxxxxxxxx>
> Cc: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx>
> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> Cc: David Hildenbrand <david@xxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> Cc: Oscar Salvador <osalvador@xxxxxxx>
> Cc: Vlastimil Babka <vbabka@xxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
> ---
>  include/linux/mmzone.h | 14 +++++++++++---
>  mm/sparse.c            | 17 +++++++++--------
>  2 files changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9e1485083398..8b8349ffa1cd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1287,6 +1287,7 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
>  #define SUBSECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SUBSECTION_MASK)
>  
>  struct mem_section_usage {
> +	struct rcu_head rcu;
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  	DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
>  #endif
> @@ -1457,7 +1458,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
>  {
>  	int idx = subsection_map_index(pfn);
>  
> -	return test_bit(idx, ms->usage->subsection_map);
> +	return test_bit(idx, READ_ONCE(ms->usage)->subsection_map);
>  }
>  #else
>  static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> @@ -1481,6 +1482,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
>  static inline int pfn_valid(unsigned long pfn)
>  {
>  	struct mem_section *ms;
> +	int ret;
>  
>  	/*
>  	 * Ensure the upper PAGE_SHIFT bits are clear in the
> @@ -1494,13 +1496,19 @@ static inline int pfn_valid(unsigned long pfn)
>  	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
>  		return 0;
>  	ms = __pfn_to_section(pfn);
> -	if (!valid_section(ms))
> +	rcu_read_lock();
> +	if (!valid_section(ms)) {
> +		rcu_read_unlock();
>  		return 0;
> +	}
>  	/*
>  	 * Traditionally early sections always returned pfn_valid() for
>  	 * the entire section-sized span.
>  	 */
> -	return early_section(ms) || pfn_section_valid(ms, pfn);
> +	ret = early_section(ms) || pfn_section_valid(ms, pfn);
> +	rcu_read_unlock();
> +
> +	return ret;
>  }
>  #endif
>  
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 120bc8ea5293..27092badd15b 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -789,6 +789,13 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  	if (empty) {
>  		unsigned long section_nr = pfn_to_section_nr(pfn);
>  
> +		/*
> +		 * Mark the section invalid so that valid_section()
> +		 * return false. This prevents code from dereferencing
> +		 * ms->usage array.
> +		 */
> +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> +
>  		/*
>  		 * When removing an early section, the usage map is kept (as the
>  		 * usage maps of other sections fall into the same page). It
> @@ -797,16 +804,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  		 * was allocated during boot.
>  		 */
>  		if (!PageReserved(virt_to_page(ms->usage))) {
> -			kfree(ms->usage);
> -			ms->usage = NULL;
> +			kfree_rcu(ms->usage, rcu);
> +			WRITE_ONCE(ms->usage, NULL);
>  		}
>  		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> -		/*
> -		 * Mark the section invalid so that valid_section()
> -		 * return false. This prevents code from dereferencing
> -		 * ms->usage array.
> -		 */
> -		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>  	}
>  
>  	/*

Maybe this is already known, but just FYI I wanted to drop the note that for some
reasons I don't understand, this patch prevents me from compiling the proprietary
Nvidia Unix driver with versions 510.108.03 and 535.104.05 in the 5.15 series.
No problems with the 6.5 series, although the patch is included there as well.

linux-5.15.155 , nvidia 535.104.05 :

    make -j $(nproc) CC=gcc SYSSRC=/scratch/local/linux clean && make -j $(nproc) CC=gcc SYSSRC=/scratch/local/linux V=2 modules
    [...]
    MODPOST /scratch/local/bee-buczek/nvidia/test_535-104-05/kernel/Module.symvers - due to target missing
    ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
    make[2]: *** [scripts/Makefile.modpost:133: /scratch/local/bee-buczek/nvidia/test_535-104-05/kernel/Module.symvers] Error 1
    make[2]: *** Deleting file '/scratch/local/bee-buczek/nvidia/test_535-104-05/kernel/Module.symvers'
    make[1]: *** [Makefile:1826: modules] Error 2
    make[1]: Leaving directory '/scratch/local/linux'
    make: *** [Makefile:82: modules] Error 2


With fd117d03cefd15 ("mm/sparsemem: fix race in accessing memory_section->usage") reverted,
the build completes.

Best
  Donald
-- 
Donald Buczek
buczek@xxxxxxxxxxxxx
Tel: +49 30 8413 1433





[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux