On Wed, 15 Nov 2023, Hugh Dickins wrote: > On Wed, 15 Nov 2023, Matthew Wilcox wrote: > > On Wed, Nov 15, 2023 at 06:05:30PM +0200, José Pekkarinen wrote: > > > > > > I don't think we should be changing ptlock_ptr(). > > > > > > This is where the null ptr dereference originates, so the only > > > alternative I can think of is to protect the life cycle of the ptdesc > > > to prevent it to die between the pte check and the spin_unlock of > > > __pte_offset_map_lock. Would that work for you? > > Thanks for pursuing this, José, but I agree with Matthew: I don't > think your patch is right at all. The change in ptlock_ptr() did not > make sense to me, and the change in __pte_offset_map_lock() leaves us > still wondering what has gone wrong (and misses an rcu_read_unlock()). > > You mentioned "I tested the syzbot reproducer in x86 and it doesn't > produce this kasan report anymore": are you saying that you were able > to reproduce the issue on x86 (without your patch)? That would be very > interesting (and I think would disprove my hypothesis below). I ought > to try on x86 if you managed to reproduce on it, but it's not worth > the effort if you did not. If you have an x86 stack and registers, > please show (though I'm uncertain how much KASAN spoils the output). > > > > > Ah! I think I found the problem. > > > > If ALLOC_SPLIT_PTLOCKS is not set, there is no problem as ->ptl > > is embedded in the struct page. But if ALLOC_SPLIT_PTLOCKS is set > > (eg you have LOCKDEP enabled), we can _return_ a NULL pointer from > > ptlock_ptr. The NULL pointer dereference isn't in ptlock_ptr(), it's > > in __pte_offset_map_lock(). > > > > So, how to solve this? We can't just check the ptl against NULL; the > > memory that ptl points to may have been freed. We could grab a reference > > to the pmd_page, possibly conditionally on ALLOC_SPLIT_LOCK being set, > > but that will slow down everything. We could make page_ptl_cachep > > SLAB_TYPESAFE_BY_RCU, preventing the memory from being freed (even if > > the lock might not be associated with this page any more). > > But I don't think that's quite right either: or, I hope it's not right, > because it would say that all my business of rcu_read_lock() and > pte_free_defer() etc was a failing waste of everyone's time: if the > page table (and/or its lock) can be freed while someone is in that RCU- > protected area, then I've simply got it wrong. > > What I thought, when yesterday's flurry of syzbot messages came in, > was perhaps the problem is at the other end - when the new page table > is allocated, rather than when it's being freed: a barrier missing there? > > But pmd_install() does have an smp_wmb(), with a (suspiciously) long > comment about why no smp_rmb()s are needed, except on alpha. > > I'm not much good on barriers, but the thought now in my mind is that > that comment is not quite accurate: that at the bottom level an smp_rmb() > is (very seldom!) needed - not just to avoid a NULL (or previous garbage) > ptl pointer in the ALLOC_SPLIT_LOCK case, but to make sure that the ptl > is visibly correctly initialized (or would spin_lock on it achieve that?); > and that what pte_offset_map() points to is visibly empty. (I'm imagining > stale cache lines left behind on the oopsing CPU, which need to be > refetched after pmdval has been read.) > > And that this issue has, strictly speaking, always been there, but in > practice never a problem, because of mmap_lock held for write while (or > prior to) freeing page table, and racers holding mmap_lock for read > (vma lock not changing the calculus); but collapse_pte_mapped_thp() > can now be done with mmap_lock for read, letting those racers in > (and maybe your filemap_map_pages() has helped speed these races up - > any blame I can shift on to you, the better ;) > > But I may well be spouting nonsense. And even if I'm right, is adding > an smp_rmb() in there too high a price to pay on some architectures? > > I did make an attempt to reproduce on an arm64, but there's too much > more I'd need to muck around with to get it working right. Let's ask for > a syz test on top of v6.7-rc1 - hmm, how do I tell syzbot that it's arm64 > that it needs to try on? I hope it gets that from the Cc'ed syz number. Syzbot couldn't do it from this mail, but did it from the original thread: https://lore.kernel.org/linux-mm/000000000000ba0007060a40644f@xxxxxxxxxx/ tells us that I was spouting nonsense: this patch does not fix it. I have no further idea yet. > > #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b85ea95d086471afb4ad062012a4d73cd328fa86 > > [PATCH] mm/pgtable: smp_rmb() to match smp_wmb() in pmd_install() > > Not-Yet-Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > --- > mm/memory.c | 2 ++ > mm/pgtable-generic.c | 5 +++++ > 2 files changed, 7 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index 1f18ed4a5497..8939357f1509 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -425,6 +425,8 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) > * being the notable exception) will already guarantee loads are > * seen in-order. See the alpha page table accessors for the > * smp_rmb() barriers in page table walking code. > + * > + * See __pte_offset_map() for the smp_rmb() at the pte level. > */ > smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ > pmd_populate(mm, pmd, *pte); > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > index 4fcd959dcc4d..3330b666e9c3 100644 > --- a/mm/pgtable-generic.c > +++ b/mm/pgtable-generic.c > @@ -297,6 +297,11 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) > pmd_clear_bad(pmd); > goto nomap; > } > + /* > + * Pair with the smp_wmb() in pmd_install(): make sure that the > + * page table lock and page table contents are visibly initialized. > + */ > + smp_rmb(); > return __pte_map(&pmdval, addr); > nomap: > rcu_read_unlock(); > -- > 2.35.3