On Wed, Feb 08, 2023 at 08:25:10PM +0000, Matthew Wilcox wrote: > On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote: > > On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote: > > > I've been thinking about this one, and I wonder if we can do it > > > without taking any pgtable locks. The locking environment we're in > > > is the page fault handler, so we have the mmap_lock for read (for now > > > anyway ...). We also hold the folio lock, so _if_ the folio is mapped, > > > those entries can't disappear under us. > > > > Could MADV_DONTNEED do that from another pgtable that we don't hold the > > pgtable lock? > > Oh, ugh, yes. And zap_pte_range() has the PTL first, so we can't sleep > to get the folio lock. And we can't decline to zap the pte on a failed > folio_trylock() (well, we could for MADV_DONTNEED, but not in general). > > So ... how about this for a solution: > > - If the folio overlaps into the next PMD table, spin_lock it. > - If the folio overlaps into the previous PMD table, unlock our > PTL, lock the previous PTL, re-lock our PTL. > - Do the pvmw, telling it we already have the PTLs held (new PVMW flag). > > [explanation simplified; if there is no prior PMD table or if the VMA > limits how far to search, we can skip this] > > We have prior art for taking two PTLs in copy_page_range(). There, > the hierarchy is clear; one VMA belongs to the process parent and one > to the child. I don't believe we have precedent for taking two PTLs > in the same VMA, but I think my proposal (order by ascending address in > the process) is the obvious order to choose. Maybe it'll work? Not sure, but seems be something we'd be extremely careful with. Having a single mmap read lock covering both seems to guarantee that the order of the lock is stable, which is a good start.. But I have no good idea on other implications across the whole kernel. IMHO copy_page_range() is not a great example for proving deadlocks, because the dst_mm should not be exposed to the whole world yet at all when copying. Say, I don't see any case some thread can try to take the dst mm pgtable lock at all until it's all set. I'm even wondering whether it's safe to not take the dst mm pgtable lock at all during a fork().. -- Peter Xu