On Wed, Feb 15, 2023 at 12:08:11PM +0500, Muhammad Usama Anjum wrote: > Hi Peter, > > Thank you for your review! > > On 2/15/23 2:50 AM, Peter Xu wrote: > > On Tue, Feb 14, 2023 at 01:49:50PM +0500, Muhammad Usama Anjum wrote: > >> On 2/14/23 2:11 AM, Peter Xu wrote: > >>> On Mon, Feb 13, 2023 at 10:50:39PM +0500, Muhammad Usama Anjum wrote: > >>>> On 2/13/23 9:54 PM, Peter Xu wrote: > >>>>> On Mon, Feb 13, 2023 at 09:31:23PM +0500, Muhammad Usama Anjum wrote: > >>>>>> mwriteprotect_range() errors out if [start, end) doesn't fall in one > >>>>>> VMA. We are facing a use case where multiple VMAs are present in one > >>>>>> range of interest. For example, the following pseudocode reproduces the > >>>>>> error which we are trying to fix: > >>>>>> > >>>>>> - Allocate memory of size 16 pages with PROT_NONE with mmap > >>>>>> - Register userfaultfd > >>>>>> - Change protection of the first half (1 to 8 pages) of memory to > >>>>>> PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs. > >>>>>> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors > >>>>>> out. > >>>>>> > >>>>>> This is a simple use case where user may or may not know if the memory > >>>>>> area has been divided into multiple VMAs. > >>>>>> > >>>>>> Reported-by: Paul Gofman <pgofman@xxxxxxxxxxxxxxx> > >>>>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> > >>>>>> --- > >>>>>> Changes since v1: > >>>>>> - Correct the start and ending values passed to uffd_wp_range() > >>>>>> --- > >>>>>> mm/userfaultfd.c | 38 ++++++++++++++++++++++---------------- > >>>>>> 1 file changed, 22 insertions(+), 16 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > >>>>>> index 65ad172add27..bccea08005a8 100644 > >>>>>> --- a/mm/userfaultfd.c > >>>>>> +++ b/mm/userfaultfd.c > >>>>>> @@ -738,9 +738,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, > >>>>>> unsigned long len, bool enable_wp, > >>>>>> atomic_t *mmap_changing) > >>>>>> { > >>>>>> + unsigned long end = start + len; > >>>>>> + unsigned long _start, _end; > >>>>>> struct vm_area_struct *dst_vma; > >>>>>> unsigned long page_mask; > >>>>>> int err; > >>>>> > >>>>> I think this needs to be initialized or it can return anything when range > >>>>> not mapped. > >>>> It is being initialized to -EAGAIN already. It is not visible in this patch. > >>> > >>> I see, though -EAGAIN doesn't look suitable at all. The old retcode for > >>> !vma case is -ENOENT, so I think we'd better keep using it if we want to > >>> have this patch. > >> I'll update in next version. > >> > >>> > >>>> > >>>>> > >>>>>> + VMA_ITERATOR(vmi, dst_mm, start); > >>>>>> > >>>>>> /* > >>>>>> * Sanitize the command parameters: > >>>>>> @@ -762,26 +765,29 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, > >>>>>> if (mmap_changing && atomic_read(mmap_changing)) > >>>>>> goto out_unlock; > >>>>>> > >>>>>> - err = -ENOENT; > >>>>>> - dst_vma = find_dst_vma(dst_mm, start, len); > >>>>>> + for_each_vma_range(vmi, dst_vma, end) { > >>>>>> + err = -ENOENT; > >>>>>> > >>>>>> - if (!dst_vma) > >>>>>> - goto out_unlock; > >>>>>> - if (!userfaultfd_wp(dst_vma)) > >>>>>> - goto out_unlock; > >>>>>> - if (!vma_can_userfault(dst_vma, dst_vma->vm_flags)) > >>>>>> - goto out_unlock; > >>>>>> + if (!dst_vma->vm_userfaultfd_ctx.ctx) > >>>>>> + break; > >>>>>> + if (!userfaultfd_wp(dst_vma)) > >>>>>> + break; > >>>>>> + if (!vma_can_userfault(dst_vma, dst_vma->vm_flags)) > >>>>>> + break; > >>>>>> > >>>>>> - if (is_vm_hugetlb_page(dst_vma)) { > >>>>>> - err = -EINVAL; > >>>>>> - page_mask = vma_kernel_pagesize(dst_vma) - 1; > >>>>>> - if ((start & page_mask) || (len & page_mask)) > >>>>>> - goto out_unlock; > >>>>>> - } > >>>>>> + if (is_vm_hugetlb_page(dst_vma)) { > >>>>>> + err = -EINVAL; > >>>>>> + page_mask = vma_kernel_pagesize(dst_vma) - 1; > >>>>>> + if ((start & page_mask) || (len & page_mask)) > >>>>>> + break; > >>>>>> + } > >>>>>> > >>>>>> - uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp); > >>>>>> + _start = (dst_vma->vm_start > start) ? dst_vma->vm_start : start; > >>>>>> + _end = (dst_vma->vm_end < end) ? dst_vma->vm_end : end; > >>>>>> > >>>>>> - err = 0; > >>>>>> + uffd_wp_range(dst_mm, dst_vma, _start, _end - _start, enable_wp); > >>>>>> + err = 0; > >>>>>> + } > >>>>>> out_unlock: > >>>>>> mmap_read_unlock(dst_mm); > >>>>>> return err; > >>>>> > >>>>> This whole patch also changes the abi, so I'm worried whether there can be > >>>>> app that relies on the existing behavior. > >>>> Even if a app is dependent on it, this change would just don't return error > >>>> if there are multiple VMAs under the hood and handle them correctly. Most > >>>> apps wouldn't care about VMAs anyways. I don't know if there would be any > >>>> drastic behavior change, other than the behavior becoming nicer. > >>> > >>> So this logic existed since the initial version of uffd-wp. It has a good > >>> thing that it strictly checks everything and it makes sense since uffd-wp > >>> is per-vma attribute. In short, the old code fails clearly. > >>> > >>> While the new proposal is not: if -ENOENT we really have no idea what > >>> happened at all; some ranges can be wr-protected but we don't know where > >>> starts to go wrong. > >> The return error codes can be made to return in better way somewhat. The > >> return error codes shouldn't block a correct functionality enhancement patch. > >> > >>> > >>> Now I'm looking at the original problem.. > >>> > >>> - Allocate memory of size 16 pages with PROT_NONE with mmap > >>> - Register userfaultfd > >>> - Change protection of the first half (1 to 8 pages) of memory to > >>> PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs. > >>> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors > >>> out. > >>> > >>> Why the user app should wr-protect 16 pages at all? > >> Taking arguments from Paul here. > >> > >> The app is free to insert guard pages inside the range (with PROT_NONE) and > >> change the protection of memory freely. Not sure why it is needed to > >> complicate things by denying any flexibility. We should never restrict what > >> is possible and what not. All of these different access attributes and > >> their any combination of interaction _must_ work without question. The > >> application should be free to change protection on any sub-range and it > >> shouldn't break the PAGE_IS_WRITTEN + UFFD_WRITE_PROTECT promise which > >> PAGEMAP_IOCTL (patches are in progress) and UFFD makes. > > > > Because uffd-wp has a limitation on e.g. it cannot nest so far. I'm fine > > with allowing mprotect() happening, but just to mention so far it cannot do > > "any combinations" yet. > > > >> > >>> > >>> If so, uffd_wp_range() will be ran upon a PROT_NONE range which doesn't > >>> make sense at all, no matter whether the user is aware of vma concept or > >>> not... because it's destined that it's a vain effort. > >> It is not a vain effort. The user want to watch/find the dirty pages of a > >> range while working with it: reserve and watch at once while Write > >> protecting or un-protecting as needed. There may be several different use > >> cases. Inserting guard pages to catch out of range access, map something > >> only when it is needed; unmap or PROT_NONE pages when they are set free in > >> the app etc. > > > > Fair enough. > > > >> > >>> > >>> So IMHO it's the user app needs fixing here, not the interface? I think > >>> it's the matter of whether the monitor is aware of mprotect() being > >>> invoked. > >> No. The common practice is to allocate a big memory chunk at once and have > >> own allocator over it (whether it is some specific allocator in a game or a > >> .net allocator with garbage collector). From the usage point of view it is > >> very limiting to demand constant memory attributes for the whole range. > >> > >> That said, if we do have the way to do exactly what we want with reset > >> through pagemap fd and it is just UFFD ioctl will be working differently, > >> it is not a blocker of course, just weird api design. > > > > Do you mean you'll disable ENGAGE_WP && !GET in your other series? Yes, if > > this will service your goal, it'll be perfect to remove that interface. > No, we cannot remove it. If this patch can land, I assume ioctl(UFFDIO_WP) can start to service the dirty tracking purpose, then why do you still need "ENGAGE_WP && !GET"? Note, I'm not asking to drop ENGAGE_WP entirely, only when !GET. > > > > >> > >>> > >>> In short, I hope we're working on things that helps at least someone, and > >>> we should avoid working on things that does not have clear benefit yet. > >>> With the WP_ENGAGE new interface being proposed, I just didn't see any > >>> benefit of changing the current interface, especially if the change can > >>> bring uncertainties itself (e.g., should we fail upon !uffd-wp vmas, or > >>> should we skip?). > >> We can work on solving uncertainties in case of error conditions. Fail if > >> !uffd-wp vma comes. > > > > Let me try to double check with you here: > > > > I assume you want to skip any vma that is not mapped at all, as the loop > > already does so. So it'll succeed if there're memory holes. > > > > You also want to explicitly fail if some vma is not registered with uffd-wp > > when walking the vma list, am I right? IOW, the tracee _won't_ ever have a > > chance to unregister uffd-wp itself, right? > Yes, fail if any VMA doesn't have uffd-wp. This fail means the > write-protection or un-protection failed on a region of memory with error > -ENOENT. This is already happening in this current patch. The unregister > code would remain same. The register and unregister ioctls are already > going over all the VMAs in a range. I'm not rigid on anything. Let me > define the interface below. > > > > >> > >>> > >>>> > >>>>> > >>>>> Is this for the new pagemap effort? Can this just be done in the new > >>>>> interface rather than changing the old? > >>>> We found this bug while working on pagemap patches. It is already being > >>>> handled in the new interface. We just thought that this use case can happen > >>>> pretty easily and unknowingly. So the support should be added. > >>> > >>> Thanks. My understanding is that it would have been reported if it > >>> affected any existing uffd-wp user. > >> I would consider the UFFD WP a recent functionality and it may not being > >> used in wide range of app scenarios. > > > > Yes I think so. > > > > Existing users should in most cases be applying the ioctl upon valid vmas > > somehow. I think the chance is low that someone relies on the errcode to > > make other decisions, but I just cannot really tell because the user app > > can do many weird things. > Correct. The user can use any combination of operation > (mmap/mprotect/uffd). They must work in harmony. No uffd - that's exactly what I'm saying: mprotect is fine here, but uffd is probably not, not in a nested way. When you try to UFFDIO_REGISTER upon some range that already been registered (by the tracee), it'll fail for you immediately: /* * Check that this vma isn't already owned by a * different userfaultfd. We can't allow more than one * userfaultfd to own a single vma simultaneously or we * wouldn't know which one to deliver the userfaults to. */ ret = -EBUSY; if (cur->vm_userfaultfd_ctx.ctx && cur->vm_userfaultfd_ctx.ctx != ctx) goto out_unlock; So if this won't work for you, then AFAICT uffd-wp won't work for you (just like soft-dirty but in another way, sorry), at least not until someone starts to work on the nested. > > > > >> > >>> > >>>> > >>>> Also mwriteprotect_range() gives a pretty straight forward way to WP or > >>>> un-WP a range. Async WP can be used in coordination with pagemap file > >>>> (PM_UFFD_WP flag in PTE) as well. There may be use cases for it. On another > >>>> note, I don't see any use cases of WP async and PM_UFFD_WP flag as > >>>> !PM_UFFD_WP flag doesn't give direct information if the page is written for > >>>> !present pages. > >>> > >>> Currently we do maintain PM_UFFD_WP even for swap entries, so if it was > >>> written then I think we'll know even if the page was swapped out: > >>> > >>> } else if (is_swap_pte(pte)) { > >>> if (pte_swp_uffd_wp(pte)) > >>> flags |= PM_UFFD_WP; > >>> if (pte_marker_entry_uffd_wp(entry)) > >>> flags |= PM_UFFD_WP; > >>> > >>> So it's working? > >>> > >>>> > >>>>> > >>>>> Side note: in your other pagemap series, you can optimize "WP_ENGAGE && > >>>>> !GET" to not do generic pgtable walk at all, but use what it does in this > >>>>> patch for the initial round or wr-protect. > >>>> Yeah, it is implemented with some optimizations. > >>> > >>> IIUC in your latest public version is not optimized, but I can check the > >>> new version when it comes. > >> I've some optimizations in next version keeping the code lines minimum. The > >> best optimization would be to add a page walk dedicated for this engaging > >> write-protect. I don't want to do that. Lets try to improve this patch in > >> how ever way possible. So that WP from UFFD ioctl can be used. > > > > If you want to do this there, I think it can be simply/mostly a copy-paste > > of this patch over there by looping over the vmas and apply the protections. > I wouldn't want to do this. PAGEMAP IOCTL performing an operation > anonymously to UFFD_WP wouldn't look good when we can improve UFFD_WP itself. > > > > > I think it's fine to do it with ioctl(UFFDIO_WP), but I want to be crystal > > clear on what interface you're looking for before changing it, and let's > > define it properly. > Thank you. Let me crystal clear why we have sent this patch and what > difference we want. > > Just like UFFDIO_REGISTER and UFFDIO_UNREGISTER don't care if the requested > range has multiple VMAs in the memory region, we want the same thing with > UFFDIO_WRITEPROTCET. It looks more uniform and obvious to us as user > doesn't care about VMAs. The user only cares about the memory he wants to > write-protect. So just update the inside code of UFFDIO_WRITEPROTECT such > that it starts to act like UFFDIO_REGISTER/UFFDIO_UNREGISTER. It shouldn't > complain if there are multiple VMAs involved under the hood. > > This patch is visiting all the VMAs in the memory range. The attached > patches are my wip v3. If you feel like there can be better way to achieve > this, please don't hesitate to send the v3 yourself. As I said I think you have a point, and let's cross finger this is fine (which I mostly agree with). We can fail on !uffd-wp vmas, sounds reasonable to me. But let's sync up with above to make sure this works for you. Since you've got a patch here, let me comment directly below. > > Thank you so much! > > -- > BR, > Muhammad Usama Anjum > From f69069dddda206b190706eef7b2dad3a67a6df10 Mon Sep 17 00:00:00 2001 > From: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> > Date: Thu, 9 Feb 2023 16:13:23 +0500 > Subject: [PATCH v3 1/2] mm/userfaultfd: Support WP on multiple VMAs > To: peterx@xxxxxxxxxx, david@xxxxxxxxxx > Cc: usama.anjum@xxxxxxxxxxxxx, kernel@xxxxxxxxxxxxx > > mwriteprotect_range() errors out if [start, end) doesn't fall in one > VMA. We are facing a use case where multiple VMAs are present in one > range of interest. For example, the following pseudocode reproduces the > error which we are trying to fix: > > - Allocate memory of size 16 pages with PROT_NONE with mmap > - Register userfaultfd > - Change protection of the first half (1 to 8 pages) of memory to > PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs. > - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors > out. > > This is a simple use case where user may or may not know if the memory > area has been divided into multiple VMAs. > > Reported-by: Paul Gofman <pgofman@xxxxxxxxxxxxxxx> > Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> > --- > Changes since v2: > - Correct the return error code > > Changes since v1: > - Correct the start and ending values passed to uffd_wp_range() > --- > mm/userfaultfd.c | 45 ++++++++++++++++++++++++++------------------- > 1 file changed, 26 insertions(+), 19 deletions(-) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index 65ad172add27..a3b48a99b942 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -738,9 +738,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, > unsigned long len, bool enable_wp, > atomic_t *mmap_changing) > { > + unsigned long end = start + len; > + unsigned long _start, _end; > struct vm_area_struct *dst_vma; > unsigned long page_mask; > - int err; > + int err = -ENOENT; > + VMA_ITERATOR(vmi, dst_mm, start); > > /* > * Sanitize the command parameters: > @@ -758,30 +761,34 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, > * operation (e.g. mremap) running in parallel, bail out and > * request the user to retry later > */ > - err = -EAGAIN; > - if (mmap_changing && atomic_read(mmap_changing)) > + if (mmap_changing && atomic_read(mmap_changing)) { > + err = -EAGAIN; > goto out_unlock; > + } Let's keep the original code and simply set -ENOENT afterwards. > > - err = -ENOENT; > - dst_vma = find_dst_vma(dst_mm, start, len); > + for_each_vma_range(vmi, dst_vma, end) { > + err = -ENOENT; > > - if (!dst_vma) > - goto out_unlock; > - if (!userfaultfd_wp(dst_vma)) > - goto out_unlock; > - if (!vma_can_userfault(dst_vma, dst_vma->vm_flags)) > - goto out_unlock; > + if (!dst_vma->vm_userfaultfd_ctx.ctx) > + break; What is this check for? > + if (!userfaultfd_wp(dst_vma)) > + break; > + if (!vma_can_userfault(dst_vma, dst_vma->vm_flags)) > + break; I think this is not useful at all (even in the old code)? It could have sneaked in somehow when I took the code over from Shaohua/Andrea. Maybe we should clean it up since at it. > > - if (is_vm_hugetlb_page(dst_vma)) { > - err = -EINVAL; > - page_mask = vma_kernel_pagesize(dst_vma) - 1; > - if ((start & page_mask) || (len & page_mask)) > - goto out_unlock; > - } > + if (is_vm_hugetlb_page(dst_vma)) { > + err = -EINVAL; > + page_mask = vma_kernel_pagesize(dst_vma) - 1; > + if ((start & page_mask) || (len & page_mask)) > + break; > + } > > - uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp); > + _start = (dst_vma->vm_start > start) ? dst_vma->vm_start : start; > + _end = (dst_vma->vm_end < end) ? dst_vma->vm_end : end; Maybe: _start = MAX(dst_vma->vm_start, start); _end = MIN(dst_vma->vm_end, end); ? > > - err = 0; > + uffd_wp_range(dst_mm, dst_vma, _start, _end - _start, enable_wp); > + err = 0; > + } > out_unlock: > mmap_read_unlock(dst_mm); > return err; > -- > 2.39.1 > > From ac119b22aa00248ed67c7ac6e285a12391943b15 Mon Sep 17 00:00:00 2001 > From: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> > Date: Mon, 13 Feb 2023 21:15:01 +0500 > Subject: [PATCH v3 2/2] mm/userfaultfd: add VM_WARN_ONCE() > To: peterx@xxxxxxxxxx, david@xxxxxxxxxx > Cc: usama.anjum@xxxxxxxxxxxxx, kernel@xxxxxxxxxxxxx > > Add VM_WARN_ONCE() to uffd_wp_range() to detect range (start, len) abuse. > > Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> > --- > mm/userfaultfd.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index a3b48a99b942..0536e23ba5f4 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -716,6 +716,8 @@ void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, > unsigned int mm_cp_flags; > struct mmu_gather tlb; > > + VM_WARN_ONCE(start < dst_vma->vm_start || start + len > dst_vma->vm_end, > + "The address range exceeds VMA boundary.\n"); > if (enable_wp) > mm_cp_flags = MM_CP_UFFD_WP; > else > -- > 2.39.1 > -- Peter Xu