On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote: > On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote: > > I've been thinking about this one, and I wonder if we can do it > > without taking any pgtable locks. The locking environment we're in > > is the page fault handler, so we have the mmap_lock for read (for now > > anyway ...). We also hold the folio lock, so _if_ the folio is mapped, > > those entries can't disappear under us. > > Could MADV_DONTNEED do that from another pgtable that we don't hold the > pgtable lock? Oh, ugh, yes. And zap_pte_range() has the PTL first, so we can't sleep to get the folio lock. And we can't decline to zap the pte on a failed folio_trylock() (well, we could for MADV_DONTNEED, but not in general). So ... how about this for a solution: - If the folio overlaps into the next PMD table, spin_lock it. - If the folio overlaps into the previous PMD table, unlock our PTL, lock the previous PTL, re-lock our PTL. - Do the pvmw, telling it we already have the PTLs held (new PVMW flag). [explanation simplified; if there is no prior PMD table or if the VMA limits how far to search, we can skip this] We have prior art for taking two PTLs in copy_page_range(). There, the hierarchy is clear; one VMA belongs to the process parent and one to the child. I don't believe we have precedent for taking two PTLs in the same VMA, but I think my proposal (order by ascending address in the process) is the obvious order to choose.