On 10/03/22 20:37, Peter Xu wrote: > After the recent rework patchset of hugetlb locking on pmd sharing, > kselftest for userfaultfd sometimes fails on hugetlb private tests with > unexpected write fault checks. > > It turns out there's nothing wrong within the locking series regarding this > matter, but it could have changed the timing of threads so it can trigger > an old bug. > > The real bug is when we call hugetlb_no_page() we're not with the pgtable > lock. It means we're reading the pte values lockless. It's perfectly fine > in most cases because before we do normal page allocations we'll take the > lock and check pte_same() again. However before that, there are actually > two paths on userfaultfd missing/minor handling that may directly move on > with the fault process without checking the pte values. > > It means for these two paths we may be generating an uffd message based on > an unstable pte, while an unstable pte can legally be anything as long as > the modifier holds the pgtable lock. > > One example, which is also what happened in the failing kselftest and > caused the test failure, is that for private mappings wr-protection changes > can happen on one page. While hugetlb_change_protection() generally > requires pte being cleared before being changed, then there can be a race > condition like: > > thread 1 thread 2 > -------- -------- > > UFFDIO_WRITEPROTECT hugetlb_fault > hugetlb_change_protection > pgtable_lock() > huge_ptep_modify_prot_start > pte==NULL > hugetlb_no_page > generate uffd missing event > even if page existed!! > huge_ptep_modify_prot_commit > pgtable_unlock() > > Fix this by recheck the pte after pgtable lock for both userfaultfd missing > & minor fault paths. > > This bug should have been around starting from uffd hugetlb introduced, so > attaching a Fixes to the commit. Also attach another Fixes to the minor > support commit for easier tracking. > > Note that userfaultfd is actually fine with false positives (e.g. caused by > pte changed), but not wrong logical events (e.g. caused by reading a pte > during changing). The latter can confuse the userspace, so the strictness > is very much preferred. E.g., MISSING event should never happen on the > page after UFFDIO_COPY has correctly installed the page and returned. > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx> > Cc: Nadav Amit <nadav.amit@xxxxxxxxx> > Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook") > Fixes: 7677f7fd8be7 ("userfaultfd: add minor fault registration mode") > Co-developed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > --- > mm/hugetlb.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 49 insertions(+), 6 deletions(-) Thanks. Do note that this will not apply on top of "mm: hugetlb: fix UAF in hugetlb_handle_userfault" which is already in Andrew's tree. However, required changes should be simple. Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> -- Mike Kravetz > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 9679fe519b90..fa3fcdb0c4b8 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -5521,6 +5521,23 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma, > return ret; > } > > +/* > + * Recheck pte with pgtable lock. Returns true if pte didn't change, or > + * false if pte changed or is changing. > + */ > +static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, > + pte_t *ptep, pte_t old_pte) > +{ > + spinlock_t *ptl; > + bool same; > + > + ptl = huge_pte_lock(h, mm, ptep); > + same = pte_same(huge_ptep_get(ptep), old_pte); > + spin_unlock(ptl); > + > + return same; > +} > + > static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > struct vm_area_struct *vma, > struct address_space *mapping, pgoff_t idx, > @@ -5562,9 +5579,30 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > goto out; > /* Check for page in userfault range */ > if (userfaultfd_missing(vma)) { > - ret = hugetlb_handle_userfault(vma, mapping, idx, > - flags, haddr, address, > - VM_UFFD_MISSING); > + /* > + * Since hugetlb_no_page() was examining pte > + * without pgtable lock, we need to re-test under > + * lock because the pte may not be stable and could > + * have changed from under us. Try to detect > + * either changed or during-changing ptes and retry > + * properly when needed. > + * > + * Note that userfaultfd is actually fine with > + * false positives (e.g. caused by pte changed), > + * but not wrong logical events (e.g. caused by > + * reading a pte during changing). The latter can > + * confuse the userspace, so the strictness is very > + * much preferred. E.g., MISSING event should > + * never happen on the page after UFFDIO_COPY has > + * correctly installed the page and returned. > + */ > + if (hugetlb_pte_stable(h, mm, ptep, old_pte)) > + ret = hugetlb_handle_userfault( > + vma, mapping, idx, flags, haddr, > + address, VM_UFFD_MISSING); > + else > + /* Retry the fault */ > + ret = 0; > goto out; > } > > @@ -5634,9 +5672,14 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > if (userfaultfd_minor(vma)) { > unlock_page(page); > put_page(page); > - ret = hugetlb_handle_userfault(vma, mapping, idx, > - flags, haddr, address, > - VM_UFFD_MINOR); > + /* See comment in userfaultfd_missing() block above */ > + if (hugetlb_pte_stable(h, mm, ptep, old_pte)) > + ret = hugetlb_handle_userfault( > + vma, mapping, idx, flags, haddr, > + address, VM_UFFD_MINOR); > + else > + /* Retry the fault */ > + ret = 0; > goto out; > } > } > -- > 2.37.3 >