On Mon, Oct 03, 2022 at 10:20:29AM -0700, Mike Kravetz wrote: > On 10/03/22 11:56, Peter Xu wrote: > > After the recent rework patchset of hugetlb locking on pmd sharing, > > kselftest for userfaultfd sometimes fails on hugetlb private tests with > > unexpected write fault checks. > > > > It turns out there's nothing wrong within the locking series regarding this > > matter, but it could have changed the timing of threads so it can trigger > > an old bug. > > > > The real bug is when we call hugetlb_no_page() we're not with the pgtable > > lock. It means we're reading the pte values lockless. It's perfectly fine > > in most cases because before we do normal page allocations we'll take the > > lock and check pte_same() again. However before that, there are actually > > two paths on userfaultfd missing/minor handling that may directly move on > > with the fault process without checking the pte values. > > > > It means for these two paths we may be generating an uffd message based on > > an unstable pte, while an unstable pte can legally be anything as long as > > the modifier holds the pgtable lock. > > > > One example, which is also what happened in the failing kselftest and > > caused the test failure, is that for private mappings CoW can happen on one > > page. CoW requires pte being cleared before being replaced with a new page > > for TLB coherency, but then there can be a race condition: > > > > thread 1 thread 2 > > -------- -------- > > > > hugetlb_fault hugetlb_fault > > private pte RO > > hugetlb_wp > > pgtable_lock() > > huge_ptep_clear_flush > > pte=NULL > > hugetlb_no_page > > generate uffd missing event > > even if page existed!! > > set_huge_pte_at > > pgtable_unlock() > > Thanks for working on this Peter! > > I agree with this patch, but I suspect the above race is not possible. Why? > In both cases, we take the hugetlb fault mutex when processing a huegtlb > page fault. This means only one thread can execute the fault code for > a specific mapping/index at a time. This is why I was so confused, and may > remain a bit confused about how we end up with userfault processing a write > fault. It was part of the reason for my 'unclear' wording about this being > more about cpus not seeing updated values. Note that we do drop the fault > mutex before calling handle_usefault, but by then we have already made the > 'missing' determination. > > Thoughts? Perhaps, I am still confused. It's my fault to have the commit message wrong, sorry. :) And thanks for raising this question, I could have overlooked that. It turns out it's not the CoW that's clearing the pte... it's the wr-protect with huge_ptep_modify_prot_start(). So the race is with UFFDIO_WRITEPROTECT, not CoW. Obviously when I was tracking the hpte changes I overlooked huge_ptep_get_and_clear(), only seeing the CoW path and I'm pretty sure that's already a bug which was obvious enough. I didn't prove they happened at the same time during the MISSING event. Then after I further looked at the CoW code I start to question myself on why CoW would trigger at all even with an available fault mutex, since for private mappings mapcount should be 1: if (page_mapcount(old_page) == 1 && PageAnon(old_page)) { if (!PageAnonExclusive(old_page)) page_move_anon_rmap(old_page, vma); if (likely(!unshare)) set_huge_ptep_writable(vma, haddr, ptep); delayacct_wpcopy_end(); return 0; } There could still be something else I didn't catch, even though that may not be relevant to this patchset anymore. I'll wait a few more hours for other reviewers, then prepare a new version with the corrected commit message. Thanks, -- Peter Xu