Re: Folio mapcount

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 8 Feb 2023 20:25:10 +0000

On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote:
> On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> > I've been thinking about this one, and I wonder if we can do it
> > without taking any pgtable locks.  The locking environment we're in
> > is the page fault handler, so we have the mmap_lock for read (for now
> > anyway ...).  We also hold the folio lock, so _if_ the folio is mapped,
> > those entries can't disappear under us.
> 
> Could MADV_DONTNEED do that from another pgtable that we don't hold the
> pgtable lock?

Oh, ugh, yes.  And zap_pte_range() has the PTL first, so we can't sleep
to get the folio lock.  And we can't decline to zap the pte on a failed
folio_trylock() (well, we could for MADV_DONTNEED, but not in general).

So ... how about this for a solution:

 - If the folio overlaps into the next PMD table, spin_lock it.
 - If the folio overlaps into the previous PMD table, unlock our
   PTL, lock the previous PTL, re-lock our PTL.
 - Do the pvmw, telling it we already have the PTLs held (new PVMW flag).

[explanation simplified; if there is no prior PMD table or if the VMA
limits how far to search, we can skip this]

We have prior art for taking two PTLs in copy_page_range().  There,
the hierarchy is clear; one VMA belongs to the process parent and one
to the child.  I don't believe we have precedent for taking two PTLs
in the same VMA, but I think my proposal (order by ascending address in
the process) is the obvious order to choose.