On Wed, Sep 02, 2020 at 04:29:46PM -0400, Zi Yan wrote: > On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote: > > > On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote: > >> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote: > >> > >>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote: > >>> > >>>>> Surprised this doesn't touch mm/pagewalk.c ? > >>>> > >>>> 1GB PUD page support is present for DAX purpose, so the code is there > >>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using > >>>> the functions in mm/pagewalk.c. :) > >>> > >>> Yes, but doesn't this change what is possible under the mmap_sem > >>> without the page table locks? > >>> > >>> ie I would expect some thing like pmd_trans_unstable() to be required > >>> as well for lockless walkers. (and I don't think the pmd code is 100% > >>> right either) > >>> > >> > >> Right. I missed that. Thanks for pointing it out. > >> The code like this, right? > > > > Technically all those *pud's are racy too, the design here with the > > _unstable function call always seemed weird. I strongly suspect it > > should mirror how get_user_pages_fast works for lockless walking > > You mean READ_ONCE on page table entry pointer first, then use the value > for the rest of the loop? I am not quite familiar with this racy check > part of the code and happy to hear more about it. There are two main issues with the THPs and lockless walks - The *pXX value can change at any time, as THPs can be split at any moment. However, once observed to be a sub page table pointer the value is fixed under the read side of the mmap (I think, I never did find the code path supporting this, but everything is busted if it isn't true...) - Reading the *pXX without load tearing is difficult on 32 bit arches So if you do READ_ONCE() it defeats the first problem. However if the sizeof(*pXX) is 8 on a 32 bit platform then load tearing is a problem. At lest the various pXX_*() test functions operate on a single 32 bit word so don't tear, but to to convert the *pXX to a lower level page table pointer a coherent, untorn, read is required. So, looking again, I remember now, I could never quite figure out why gup_pmd_range() was safe to do: pmd_t pmd = READ_ONCE(*pmdp); [..] } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) [..] ptem = ptep = pte_offset_map(&pmd, addr); As I don't see what prevents load tearing a 64 bit pmd.. Eg no pmd_trans_unstable() or equivalent here. But we see gup_get_pte() using an anti-load tearing technique.. Jason