Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code

"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> · Fri, 4 Oct 2019 16:24:07 +0300

On Fri, Oct 04, 2019 at 02:58:59PM +0200, Thomas Hellström (VMware) wrote:
> On 10/4/19 2:37 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 03, 2019 at 01:32:45PM +0200, Thomas Hellström (VMware) wrote:
> > > > > + *   If @mapping allows faulting of huge pmds and puds, it is desirable
> > > > > + *   that its huge_fault() handler blocks while this function is running on
> > > > > + *   @mapping. Otherwise a race may occur where the huge entry is split when
> > > > > + *   it was intended to be handled in a huge entry callback. This requires an
> > > > > + *   external lock, for example that @mapping->i_mmap_rwsem is held in
> > > > > + *   write mode in the huge_fault() handlers.
> > > > Em. No. We have ptl for this. It's the only lock required (plus mmap_sem
> > > > on read) to split PMD entry into PTE table. And it can happen not only
> > > > from fault path.
> > > > 
> > > > If you care about splitting compound page under you, take a pin or lock a
> > > > page. It will block split_huge_page().
> > > > 
> > > > Suggestion to block fault path is not viable (and it will not happen
> > > > magically just because of this comment).
> > > > 
> > > I was specifically thinking of this:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103
> > > 
> > > If a huge pud is concurrently faulted in here, it will immediatly get split
> > > without getting processed in pud_entry(). An external lock would protect
> > > against that, but that's perhaps a bug in the pagewalk code?  For pmds the
> > > situation is not the same since when pte_entry is used, all pmds will
> > > unconditionally get split.
> > I *think* it should be fixed with something like this (there's no
> > pud_trans_unstable() yet):
> > 
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index d48c2a986ea3..221a3b945f42 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -102,10 +102,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >   					break;
> >   				continue;
> >   			}
> > +		} else {
> > +			split_huge_pud(walk->vma, pud, addr);
> >   		}
> > -		split_huge_pud(walk->vma, pud, addr);
> > -		if (pud_none(*pud))
> > +		if (pud_none(*pud) || pud_trans_unstable(*pud))
> >   			goto again;
> >   		if (ops->pmd_entry || ops->pte_entry)
> 
> Yes, this seems better. I was looking at implementing a pud_trans_unstable()
> as a basis of fixing problems like this, but when I looked at
> pmd_trans_unstable I got a bit confused:
> 
> Why are devmap huge pmds considered stable? I mean, couldn't anybody just
> run madvise() to clear those just like transhuge pmds?

Matthew, Dan, could you comment on this?

> > Or better yet converted to what we do on pmd level.
> > 
> > Honestly, all the code around PUD THP missing a lot of ground work.
> > Rushing it upstream for DAX was not a right move.
> > 
> > > There's a similar more scary race in
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931
> > > 
> > > It looks like if a concurrent thread faults in a huge pud just after the
> > > test for pud_none in that pmd_alloc, things might go pretty bad.
> > Hm? It will fail the next pmd_none() check under ptl. Do you have a
> > particular racing scenarion?
> > 
> Yes, I misinterpreted the code somewhat, but here's the scenario that looks
> racy:
> 
> Thread 1		Thread 2
> huge_fault(pud)					- Fell back, for example because of write fault on dirty-tracking.
> 			huge_fault(pud)         - Taken, read fault.
> pmd_alloc()                                     - Will fail pmd_none check and return a pmd_offset()

I see. It also misses pud_tans_unstable() check or its variant.

-- 
 Kirill A. Shutemov