On 2/10/21 1:21 PM, Axel Rasmussen wrote: > This feature allows userspace to intercept "minor" faults. By "minor" > faults, I mean the following situation: > > Let there exist two mappings (i.e., VMAs) to the same page(s). One of > the mappings is registered with userfaultfd (in minor mode), and the > other is not. Via the non-UFFD mapping, the underlying pages have > already been allocated & filled with some contents. The UFFD mapping > has not yet been faulted in; when it is touched for the first time, > this results in what I'm calling a "minor" fault. As a concrete > example, when working with hugetlbfs, we have huge_pte_none(), but > find_lock_page() finds an existing page. Do we want to intercept the fault if it is for a private mapping that will COW the page in the page cache? I think 'yes' but just want to confirm. The code added to hugetlb_no_page will intercept these COW accesses. <snip> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index e41b77cf6cc2..f150b10981a8 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4366,6 +4366,38 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > VM_FAULT_SET_HINDEX(hstate_index(h)); > goto backout_unlocked; > } > + > + /* Check for page in userfault range. */ > + if (userfaultfd_minor(vma)) { > + u32 hash; > + struct vm_fault vmf = { > + .vma = vma, > + .address = haddr, > + .flags = flags, > + /* > + * Hard to debug if it ends up being used by a > + * callee that assumes something about the > + * other uninitialized fields... same as in > + * memory.c > + */ > + }; > + > + unlock_page(page); > + > + /* > + * hugetlb_fault_mutex and i_mmap_rwsem must be dropped > + * before handling userfault. Reacquire after handling > + * fault to make calling code simpler. > + */ > + > + hash = hugetlb_fault_mutex_hash(mapping, idx); > + mutex_unlock(&hugetlb_fault_mutex_table[hash]); > + i_mmap_unlock_read(mapping); After dropping all the locks, we only hold a reference to the page in the page cache. I 'think' someone else could hole punch the page and remove it from the cache. IIUC, state changing while processing uffd faults is something that users need to deal with? Just need to make sure there are no assumptions in the kernel code. > + ret = handle_userfault(&vmf, VM_UFFD_MINOR); > + i_mmap_lock_read(mapping); > + mutex_lock(&hugetlb_fault_mutex_table[hash]); > + goto out; > + } > } > > /* > -- Mike Kravetz