On Mon, Feb 13, 2023 at 10:50:39PM +0500, Muhammad Usama Anjum wrote: > On 2/13/23 9:54 PM, Peter Xu wrote: > > On Mon, Feb 13, 2023 at 09:31:23PM +0500, Muhammad Usama Anjum wrote: > >> mwriteprotect_range() errors out if [start, end) doesn't fall in one > >> VMA. We are facing a use case where multiple VMAs are present in one > >> range of interest. For example, the following pseudocode reproduces the > >> error which we are trying to fix: > >> > >> - Allocate memory of size 16 pages with PROT_NONE with mmap > >> - Register userfaultfd > >> - Change protection of the first half (1 to 8 pages) of memory to > >> PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs. > >> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors > >> out. > >> > >> This is a simple use case where user may or may not know if the memory > >> area has been divided into multiple VMAs. > >> > >> Reported-by: Paul Gofman <pgofman@xxxxxxxxxxxxxxx> > >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> > >> --- > >> Changes since v1: > >> - Correct the start and ending values passed to uffd_wp_range() > >> --- > >> mm/userfaultfd.c | 38 ++++++++++++++++++++++---------------- > >> 1 file changed, 22 insertions(+), 16 deletions(-) > >> > >> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > >> index 65ad172add27..bccea08005a8 100644 > >> --- a/mm/userfaultfd.c > >> +++ b/mm/userfaultfd.c > >> @@ -738,9 +738,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, > >> unsigned long len, bool enable_wp, > >> atomic_t *mmap_changing) > >> { > >> + unsigned long end = start + len; > >> + unsigned long _start, _end; > >> struct vm_area_struct *dst_vma; > >> unsigned long page_mask; > >> int err; > > > > I think this needs to be initialized or it can return anything when range > > not mapped. > It is being initialized to -EAGAIN already. It is not visible in this patch. I see, though -EAGAIN doesn't look suitable at all. The old retcode for !vma case is -ENOENT, so I think we'd better keep using it if we want to have this patch. > > > > >> + VMA_ITERATOR(vmi, dst_mm, start); > >> > >> /* > >> * Sanitize the command parameters: > >> @@ -762,26 +765,29 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, > >> if (mmap_changing && atomic_read(mmap_changing)) > >> goto out_unlock; > >> > >> - err = -ENOENT; > >> - dst_vma = find_dst_vma(dst_mm, start, len); > >> + for_each_vma_range(vmi, dst_vma, end) { > >> + err = -ENOENT; > >> > >> - if (!dst_vma) > >> - goto out_unlock; > >> - if (!userfaultfd_wp(dst_vma)) > >> - goto out_unlock; > >> - if (!vma_can_userfault(dst_vma, dst_vma->vm_flags)) > >> - goto out_unlock; > >> + if (!dst_vma->vm_userfaultfd_ctx.ctx) > >> + break; > >> + if (!userfaultfd_wp(dst_vma)) > >> + break; > >> + if (!vma_can_userfault(dst_vma, dst_vma->vm_flags)) > >> + break; > >> > >> - if (is_vm_hugetlb_page(dst_vma)) { > >> - err = -EINVAL; > >> - page_mask = vma_kernel_pagesize(dst_vma) - 1; > >> - if ((start & page_mask) || (len & page_mask)) > >> - goto out_unlock; > >> - } > >> + if (is_vm_hugetlb_page(dst_vma)) { > >> + err = -EINVAL; > >> + page_mask = vma_kernel_pagesize(dst_vma) - 1; > >> + if ((start & page_mask) || (len & page_mask)) > >> + break; > >> + } > >> > >> - uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp); > >> + _start = (dst_vma->vm_start > start) ? dst_vma->vm_start : start; > >> + _end = (dst_vma->vm_end < end) ? dst_vma->vm_end : end; > >> > >> - err = 0; > >> + uffd_wp_range(dst_mm, dst_vma, _start, _end - _start, enable_wp); > >> + err = 0; > >> + } > >> out_unlock: > >> mmap_read_unlock(dst_mm); > >> return err; > > > > This whole patch also changes the abi, so I'm worried whether there can be > > app that relies on the existing behavior. > Even if a app is dependent on it, this change would just don't return error > if there are multiple VMAs under the hood and handle them correctly. Most > apps wouldn't care about VMAs anyways. I don't know if there would be any > drastic behavior change, other than the behavior becoming nicer. So this logic existed since the initial version of uffd-wp. It has a good thing that it strictly checks everything and it makes sense since uffd-wp is per-vma attribute. In short, the old code fails clearly. While the new proposal is not: if -ENOENT we really have no idea what happened at all; some ranges can be wr-protected but we don't know where starts to go wrong. Now I'm looking at the original problem.. - Allocate memory of size 16 pages with PROT_NONE with mmap - Register userfaultfd - Change protection of the first half (1 to 8 pages) of memory to PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs. - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors out. Why the user app should wr-protect 16 pages at all? If so, uffd_wp_range() will be ran upon a PROT_NONE range which doesn't make sense at all, no matter whether the user is aware of vma concept or not... because it's destined that it's a vain effort. So IMHO it's the user app needs fixing here, not the interface? I think it's the matter of whether the monitor is aware of mprotect() being invoked. In short, I hope we're working on things that helps at least someone, and we should avoid working on things that does not have clear benefit yet. With the WP_ENGAGE new interface being proposed, I just didn't see any benefit of changing the current interface, especially if the change can bring uncertainties itself (e.g., should we fail upon !uffd-wp vmas, or should we skip?). > > > > > Is this for the new pagemap effort? Can this just be done in the new > > interface rather than changing the old? > We found this bug while working on pagemap patches. It is already being > handled in the new interface. We just thought that this use case can happen > pretty easily and unknowingly. So the support should be added. Thanks. My understanding is that it would have been reported if it affected any existing uffd-wp user. > > Also mwriteprotect_range() gives a pretty straight forward way to WP or > un-WP a range. Async WP can be used in coordination with pagemap file > (PM_UFFD_WP flag in PTE) as well. There may be use cases for it. On another > note, I don't see any use cases of WP async and PM_UFFD_WP flag as > !PM_UFFD_WP flag doesn't give direct information if the page is written for > !present pages. Currently we do maintain PM_UFFD_WP even for swap entries, so if it was written then I think we'll know even if the page was swapped out: } else if (is_swap_pte(pte)) { if (pte_swp_uffd_wp(pte)) flags |= PM_UFFD_WP; if (pte_marker_entry_uffd_wp(entry)) flags |= PM_UFFD_WP; So it's working? > > > > > Side note: in your other pagemap series, you can optimize "WP_ENGAGE && > > !GET" to not do generic pgtable walk at all, but use what it does in this > > patch for the initial round or wr-protect. > Yeah, it is implemented with some optimizations. IIUC in your latest public version is not optimized, but I can check the new version when it comes. Thanks, -- Peter Xu