On Fri, Oct 04, 2019 at 02:58:59PM +0200, Thomas Hellström (VMware) wrote: > On 10/4/19 2:37 PM, Kirill A. Shutemov wrote: > > On Thu, Oct 03, 2019 at 01:32:45PM +0200, Thomas Hellström (VMware) wrote: > > > > > + * If @mapping allows faulting of huge pmds and puds, it is desirable > > > > > + * that its huge_fault() handler blocks while this function is running on > > > > > + * @mapping. Otherwise a race may occur where the huge entry is split when > > > > > + * it was intended to be handled in a huge entry callback. This requires an > > > > > + * external lock, for example that @mapping->i_mmap_rwsem is held in > > > > > + * write mode in the huge_fault() handlers. > > > > Em. No. We have ptl for this. It's the only lock required (plus mmap_sem > > > > on read) to split PMD entry into PTE table. And it can happen not only > > > > from fault path. > > > > > > > > If you care about splitting compound page under you, take a pin or lock a > > > > page. It will block split_huge_page(). > > > > > > > > Suggestion to block fault path is not viable (and it will not happen > > > > magically just because of this comment). > > > > > > > I was specifically thinking of this: > > > > > > https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103 > > > > > > If a huge pud is concurrently faulted in here, it will immediatly get split > > > without getting processed in pud_entry(). An external lock would protect > > > against that, but that's perhaps a bug in the pagewalk code? For pmds the > > > situation is not the same since when pte_entry is used, all pmds will > > > unconditionally get split. > > I *think* it should be fixed with something like this (there's no > > pud_trans_unstable() yet): > > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > > index d48c2a986ea3..221a3b945f42 100644 > > --- a/mm/pagewalk.c > > +++ b/mm/pagewalk.c > > @@ -102,10 +102,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, > > break; > > continue; > > } > > + } else { > > + split_huge_pud(walk->vma, pud, addr); > > } > > - split_huge_pud(walk->vma, pud, addr); > > - if (pud_none(*pud)) > > + if (pud_none(*pud) || pud_trans_unstable(*pud)) > > goto again; > > if (ops->pmd_entry || ops->pte_entry) > > Yes, this seems better. I was looking at implementing a pud_trans_unstable() > as a basis of fixing problems like this, but when I looked at > pmd_trans_unstable I got a bit confused: > > Why are devmap huge pmds considered stable? I mean, couldn't anybody just > run madvise() to clear those just like transhuge pmds? Matthew, Dan, could you comment on this? > > Or better yet converted to what we do on pmd level. > > > > Honestly, all the code around PUD THP missing a lot of ground work. > > Rushing it upstream for DAX was not a right move. > > > > > There's a similar more scary race in > > > > > > https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931 > > > > > > It looks like if a concurrent thread faults in a huge pud just after the > > > test for pud_none in that pmd_alloc, things might go pretty bad. > > Hm? It will fail the next pmd_none() check under ptl. Do you have a > > particular racing scenarion? > > > Yes, I misinterpreted the code somewhat, but here's the scenario that looks > racy: > > Thread 1 Thread 2 > huge_fault(pud) - Fell back, for example because of write fault on dirty-tracking. > huge_fault(pud) - Taken, read fault. > pmd_alloc() - Will fail pmd_none check and return a pmd_offset() I see. It also misses pud_tans_unstable() check or its variant. -- Kirill A. Shutemov