On 10/03/22 17:27, Peter Xu wrote: > On Mon, Oct 03, 2022 at 10:20:29AM -0700, Mike Kravetz wrote: > > On 10/03/22 11:56, Peter Xu wrote: > > > After the recent rework patchset of hugetlb locking on pmd sharing, > > > kselftest for userfaultfd sometimes fails on hugetlb private tests with > > > unexpected write fault checks. > > > > > > It turns out there's nothing wrong within the locking series regarding this > > > matter, but it could have changed the timing of threads so it can trigger > > > an old bug. > > > > > > The real bug is when we call hugetlb_no_page() we're not with the pgtable > > > lock. It means we're reading the pte values lockless. It's perfectly fine > > > in most cases because before we do normal page allocations we'll take the > > > lock and check pte_same() again. However before that, there are actually > > > two paths on userfaultfd missing/minor handling that may directly move on > > > with the fault process without checking the pte values. > > > > > > It means for these two paths we may be generating an uffd message based on > > > an unstable pte, while an unstable pte can legally be anything as long as > > > the modifier holds the pgtable lock. > > > > > > One example, which is also what happened in the failing kselftest and > > > caused the test failure, is that for private mappings CoW can happen on one > > > page. CoW requires pte being cleared before being replaced with a new page > > > for TLB coherency, but then there can be a race condition: > > > > > > thread 1 thread 2 > > > -------- -------- > > > > > > hugetlb_fault hugetlb_fault > > > private pte RO > > > hugetlb_wp > > > pgtable_lock() > > > huge_ptep_clear_flush > > > pte=NULL > > > hugetlb_no_page > > > generate uffd missing event > > > even if page existed!! > > > set_huge_pte_at > > > pgtable_unlock() > > > > Thanks for working on this Peter! > > > > I agree with this patch, but I suspect the above race is not possible. Why? > > In both cases, we take the hugetlb fault mutex when processing a huegtlb > > page fault. This means only one thread can execute the fault code for > > a specific mapping/index at a time. This is why I was so confused, and may > > remain a bit confused about how we end up with userfault processing a write > > fault. It was part of the reason for my 'unclear' wording about this being > > more about cpus not seeing updated values. Note that we do drop the fault > > mutex before calling handle_usefault, but by then we have already made the > > 'missing' determination. > > > > Thoughts? Perhaps, I am still confused. > > It's my fault to have the commit message wrong, sorry. :) And thanks for > raising this question, I could have overlooked that. > > It turns out it's not the CoW that's clearing the pte... it's the > wr-protect with huge_ptep_modify_prot_start(). So the race is with > UFFDIO_WRITEPROTECT, not CoW. Thank you! Now I understand. This also explains why the new locking exposes the race. hugetlb_change_protection needs to take the i_mmap_sema in write mode because it could unshare pmds. Previously, hugetlb page faults took i_mmap_sema in read mode so this race could not happen. -- Mike Kravetz