On Wed, Jan 4, 2023 at 8:03 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Wed, Jan 04, 2023 at 07:10:11PM +0000, James Houghton wrote: > > > > I'll see if I can confirm that this is indeed possible and send a > > > > repro if it is. > > > > > > I think your analysis above is correct. The key being the failure to unshare > > > in the non-PUD_SIZE vma after the split. > > > > I do indeed hit the WARN_ON_ONCE (repro attached), and the MADV wasn't > > even needed (the UFFDIO_REGISTER does the VMA split before "unsharing > > all PMDs"). With the fix, we avoid the WARN_ON_ONCE, but the behavior > > is still incorrect: I expect the address range to be write-protected, > > but it isn't. > > > > The reason why is that hugetlb_change_protection uses huge_pte_offset, > > even if it's being called for a UFFDIO_WRITEPROTECT with > > UFFDIO_WRITEPROTECT_MODE_WP. In that particular case, I'm pretty sure > > we should be using huge_pte_alloc, but even so, it's not trivial to > > get an allocation failure back up to userspace. The non-hugetlb > > implementation of UFFDIO_WRITEPROTECT seems to also have this problem. > > > > Peter, what do you think? > > Indeed. Thanks for spotting that, James. > > Non-hugetlb should be fine with having empty pgtable entries. Anon doesn't > need to care about no-pgtable-populated ranges so far. Shmem does it with a > few change_prepare() calls to populate the entries so the markers can be > installed later on. Ah ok! :) > > However I think the fault handling is still not well handled as you pointed > out even for shmem: that's the path I probably never triggered myself yet > before and the code stayed there since a very early version: > > #define change_pmd_prepare(vma, pmd, cp_flags) \ > do { \ > if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ > if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd))) \ > break; \ > } \ > } while (0) > > I think a better thing we can do here (instead of warning and stop the > UFFDIO_WRITEPROTECT at the current stage) is returning with -ENOMEM > properly so the user can know the error. We'll need to touch the stacks up > to uffd_wp_range() as it's the only one that can trigger the -ENOMEM so > far, so as to not ignore retval from change_protection(). > > Meanwhile, I'd also wonder whether we should call pagefault_out_of_memory() > because it should be the same as when pgtable allocation failure happens in > page faults, we may want to OOM already. I can take care of hugetlb part > too along the way. I might be misunderstanding, the only case where hugetlb_change_protection() would *need* to allocate is when it is called from UFFDIO_WRITEPROTECT, not while handling a #pf. So I don't think any calls to pagefault_out_of_memory() need to be added. > > Man page of UFFDIO_WRITEPROTECT may need a fixup too to introduce -ENOMEM. > > I can quickly prepare some patches for this, and hopefully it doesn't need > to block the current fix on split. I don't think it should block this splitting fix. I'll send another version of this fix soon. > > Any thoughts? > > > > > > > > > To me, the fact it was somewhat difficult to come up with this scenario is an > > > argument what we should just unshare at split time as you propose. Who > > > knows what other issues may exist. > > > > > > > 60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes") is the > > > > commit that introduced the WARN_ON_ONCE; perhaps it's a good choice > > > > for a Fixes: tag (if above is indeed true). > > > > > > If the key issue in your above scenario is indeed the failure of > > > hugetlb_unshare_all_pmds in the non-PUD_SIZE vma, then perhaps we tag? > > > > > > 6dfeaff93be1 ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when > > > register wp") > > > > SGTM. Thanks Mike. > > Looks good here too. Thanks, Peter!