Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

Hugh Dickins <hughd@xxxxxxxxxx> · Thu, 16 Nov 2023 00:26:39 -0800 (PST)

On Wed, 15 Nov 2023, Hugh Dickins wrote:
> On Wed, 15 Nov 2023, Matthew Wilcox wrote:
> > On Wed, Nov 15, 2023 at 06:05:30PM +0200, José Pekkarinen wrote:
> > 
> > > > I don't think we should be changing ptlock_ptr().
> > > 
> > >     This is where the null ptr dereference originates, so the only
> > > alternative I can think of is to protect the life cycle of the ptdesc
> > > to prevent it to die between the pte check and the spin_unlock of
> > > __pte_offset_map_lock. Would that work for you?
> 
> Thanks for pursuing this, José, but I agree with Matthew: I don't
> think your patch is right at all.  The change in ptlock_ptr() did not
> make sense to me, and the change in __pte_offset_map_lock() leaves us
> still wondering what has gone wrong (and misses an rcu_read_unlock()).
> 
> You mentioned "I tested the syzbot reproducer in x86 and it doesn't
> produce this kasan report anymore": are you saying that you were able
> to reproduce the issue on x86 (without your patch)?  That would be very
> interesting (and I think would disprove my hypothesis below).  I ought
> to try on x86 if you managed to reproduce on it, but it's not worth
> the effort if you did not.  If you have an x86 stack and registers,
> please show (though I'm uncertain how much KASAN spoils the output).
> 
> > 
> > Ah!  I think I found the problem.
> > 
> > If ALLOC_SPLIT_PTLOCKS is not set, there is no problem as ->ptl
> > is embedded in the struct page.  But if ALLOC_SPLIT_PTLOCKS is set
> > (eg you have LOCKDEP enabled), we can _return_ a NULL pointer from
> > ptlock_ptr.  The NULL pointer dereference isn't in ptlock_ptr(), it's
> > in __pte_offset_map_lock().
> > 
> > So, how to solve this?  We can't just check the ptl against NULL; the
> > memory that ptl points to may have been freed.  We could grab a reference
> > to the pmd_page, possibly conditionally on ALLOC_SPLIT_LOCK being set,
> > but that will slow down everything.  We could make page_ptl_cachep
> > SLAB_TYPESAFE_BY_RCU, preventing the memory from being freed (even if
> > the lock might not be associated with this page any more).
> 
> But I don't think that's quite right either: or, I hope it's not right,
> because it would say that all my business of rcu_read_lock() and
> pte_free_defer() etc was a failing waste of everyone's time: if the
> page table (and/or its lock) can be freed while someone is in that RCU-
> protected area, then I've simply got it wrong.
> 
> What I thought, when yesterday's flurry of syzbot messages came in,
> was perhaps the problem is at the other end - when the new page table
> is allocated, rather than when it's being freed: a barrier missing there?
> 
> But pmd_install() does have an smp_wmb(), with a (suspiciously) long
> comment about why no smp_rmb()s are needed, except on alpha.
> 
> I'm not much good on barriers, but the thought now in my mind is that
> that comment is not quite accurate: that at the bottom level an smp_rmb()
> is (very seldom!) needed - not just to avoid a NULL (or previous garbage)
> ptl pointer in the ALLOC_SPLIT_LOCK case, but to make sure that the ptl
> is visibly correctly initialized (or would spin_lock on it achieve that?);
> and that what pte_offset_map() points to is visibly empty.  (I'm imagining
> stale cache lines left behind on the oopsing CPU, which need to be
> refetched after pmdval has been read.)
> 
> And that this issue has, strictly speaking, always been there, but in
> practice never a problem, because of mmap_lock held for write while (or
> prior to) freeing page table, and racers holding mmap_lock for read
> (vma lock not changing the calculus); but collapse_pte_mapped_thp()
> can now be done with mmap_lock for read, letting those racers in
> (and maybe your filemap_map_pages() has helped speed these races up -
> any blame I can shift on to you, the better ;)
> 
> But I may well be spouting nonsense.  And even if I'm right, is adding
> an smp_rmb() in there too high a price to pay on some architectures?
> 
> I did make an attempt to reproduce on an arm64, but there's too much
> more I'd need to muck around with to get it working right.  Let's ask for
> a syz test on top of v6.7-rc1 - hmm, how do I tell syzbot that it's arm64
> that it needs to try on?  I hope it gets that from the Cc'ed syz number.

Syzbot couldn't do it from this mail, but did it from the original thread:
https://lore.kernel.org/linux-mm/000000000000ba0007060a40644f@xxxxxxxxxx/
tells us that I was spouting nonsense: this patch does not fix it.
I have no further idea yet.

> 
> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b85ea95d086471afb4ad062012a4d73cd328fa86
> 
> [PATCH] mm/pgtable: smp_rmb() to match smp_wmb() in pmd_install()
> 
> Not-Yet-Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> ---
>  mm/memory.c          | 2 ++
>  mm/pgtable-generic.c | 5 +++++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 1f18ed4a5497..8939357f1509 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -425,6 +425,8 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
>  		 * being the notable exception) will already guarantee loads are
>  		 * seen in-order. See the alpha page table accessors for the
>  		 * smp_rmb() barriers in page table walking code.
> +		 *
> +		 * See __pte_offset_map() for the smp_rmb() at the pte level.
>  		 */
>  		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>  		pmd_populate(mm, pmd, *pte);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4fcd959dcc4d..3330b666e9c3 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -297,6 +297,11 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>  		pmd_clear_bad(pmd);
>  		goto nomap;
>  	}
> +	/*
> +	 * Pair with the smp_wmb() in pmd_install(): make sure that the
> +	 * page table lock and page table contents are visibly initialized.
> +	 */
> +	smp_rmb();
>  	return __pte_map(&pmdval, addr);
>  nomap:
>  	rcu_read_unlock();
> -- 
> 2.35.3