On Fri, Oct 13, 2023 at 06:34:27PM +0530, Charan Teja Kalla wrote: > The below race is observed on a PFN which falls into the device memory > region with the system memory configuration where PFN's are such that > [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and > end pfn contains the device memory PFN's as well, the compaction > triggered will try on the device memory PFN's too though they end up in > NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory > sections). When from other core, the section mappings are being removed > for the ZONE_DEVICE region, that the PFN in question belongs to, > on which compaction is currently being operated is resulting into the > kernel crash with CONFIG_SPASEMEM_VMEMAP enabled. > > compact_zone() memunmap_pages > ------------- --------------- > __pageblock_pfn_to_page > ...... > (a)pfn_valid(): > valid_section()//return true > (b)__remove_pages()-> > sparse_remove_section()-> > section_deactivate(): > [Free the array ms->usage and set > ms->usage = NULL] > pfn_section_valid() > [Access ms->usage which > is NULL] > > NOTE: From the above it can be said that the race is reduced to between > the pfn_valid()/pfn_section_valid() and the section deactivate with > SPASEMEM_VMEMAP enabled. > > The commit b943f045a9af("mm/sparse: fix kernel crash with > pfn_section_valid check") tried to address the same problem by clearing > the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns > false thus ms->usage is not accessed. > > Fix this issue by the below steps: > a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. > b) RCU protected read side critical section will either return NULL when > SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. > c) Synchronize the rcu on the write side and free the ->usage. No > attempt will be made to access ->usage after this as the > SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. > > Since the section_deactivate() is a rare operation and will come in the > hot remove path, impact of synchronize_rcu() should be negligble. struct mem_section_usage has other field like pageblock_flags. Do we need to protect its readers with RCU? Also can we annotate usage field in struct mem_section with __rcu and use RCU accessors like rcu_dereference() while using memsection::usage field? > > Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot") > Signed-off-by: Charan Teja Kalla <quic_charante@xxxxxxxxxxx> > --- > include/linux/mmzone.h | 11 +++++++++-- > mm/sparse.c | 14 ++++++++------ > 2 files changed, 17 insertions(+), 8 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 4106fbc..c877396 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1987,6 +1987,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) > static inline int pfn_valid(unsigned long pfn) > { > struct mem_section *ms; > + int ret; > > /* > * Ensure the upper PAGE_SHIFT bits are clear in the > @@ -2000,13 +2001,19 @@ static inline int pfn_valid(unsigned long pfn) > if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) > return 0; > ms = __pfn_to_section(pfn); > - if (!valid_section(ms)) > + rcu_read_lock(); > + if (!valid_section(ms)) { > + rcu_read_unlock(); > return 0; > + } > /* > * Traditionally early sections always returned pfn_valid() for > * the entire section-sized span. > */ > - return early_section(ms) || pfn_section_valid(ms, pfn); > + ret = early_section(ms) || pfn_section_valid(ms, pfn); > + rcu_read_unlock(); > + > + return ret; > } > #endif > > diff --git a/mm/sparse.c b/mm/sparse.c > index 77d91e5..ca7dbe1 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -792,6 +792,13 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > unsigned long section_nr = pfn_to_section_nr(pfn); > > /* > + * Mark the section invalid so that valid_section() > + * return false. This prevents code from dereferencing > + * ms->usage array. > + */ > + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; > + This trick may not be needed if we add proper NULL checks around ms->usage. We are anyway introducing a new rule this check needs to be done under RCU lock, so why not revisit it? > + /* > * When removing an early section, the usage map is kept (as the > * usage maps of other sections fall into the same page). It > * will be re-used when re-adding the section - which is then no > @@ -799,16 +806,11 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > * was allocated during boot. > */ > if (!PageReserved(virt_to_page(ms->usage))) { > + synchronize_rcu(); > kfree(ms->usage); > ms->usage = NULL; > } If we add NULL checks around ms->usage, this becomes tmp = rcu_replace_pointer(ms->usage, NULL, hotplug_locked()); syncrhonize_rcu(); kfree(tmp); btw, Do we come here with any global locks? if yes, synchronize_rcu() can add delays in releasing the lock. In that case we may have to go for async RCU free. > memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); > - /* > - * Mark the section invalid so that valid_section() > - * return false. This prevents code from dereferencing > - * ms->usage array. > - */ > - ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; > } > > /* > Thanks, Pavan