Re: [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl

Mike Rapoport <rppt@xxxxxxxxxxxxx> · Fri, 25 Jan 2019 09:54:53 +0200

On Thu, Jan 24, 2019 at 05:28:48PM +0800, Peter Xu wrote:
> On Thu, Jan 24, 2019 at 09:27:07AM +0200, Mike Rapoport wrote:
> > On Thu, Jan 24, 2019 at 12:56:15PM +0800, Peter Xu wrote:
> > > On Mon, Jan 21, 2019 at 12:42:33PM +0200, Mike Rapoport wrote:
> > > 
> > > [...]
> > > 
> > > > > @@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > > 
> > > > >  		/* check not compatible vmas */
> > > > >  		ret = -EINVAL;
> > > > > -		if (!vma_can_userfault(cur))
> > > > > +		if (!vma_can_userfault(cur, vm_flags))
> > > > >  			goto out_unlock;
> > > > > 
> > > > >  		/*
> > > > > @@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > >  			if (end & (vma_hpagesize - 1))
> > > > >  				goto out_unlock;
> > > > >  		}
> > > > > +		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
> > > > > +			goto out_unlock;
> > > > 
> > > > This is problematic for the non-cooperative use-case. Way may still want to
> > > > monitor a read-only area because it may eventually become writable, e.g. if
> > > > the monitored process runs mprotect().
> > > 
> > > Firstly I think I should be able to change it to VM_MAYWRITE which
> > > seems to suite more.
> > > 
> > > Meanwhile, frankly speaking I didn't think a lot about how to nest the
> > > usages of uffd-wp and mprotect(), so far I was only considering it as
> > > a replacement of mprotect().  But indeed it can happen that the
> > > monitored process calls mprotect().  Is there an existing scenario of
> > > such usage?
> > > 
> > > The problem is I'm uncertain about whether this scenario can work
> > > after all.  Say, the monitor process A write protected process B's
> > > page P, so logically A will definitely receive a message before B
> > > writes to page P.  However here if we allow process B to do
> > > mprotect(PROT_WRITE) upon page P and grant write permission to it on
> > > its own, then A will not be able to capture the write operation at
> > > all?  Then I don't know how it can work here... or whether we should
> > > fail the mprotect() at least upon uffd-wp ranges?
> > 
> > The use-case we've discussed a while ago was to use uffd-wp instead of
> > soft-dirty for tracking memory changes in CRIU for pre-copy migration.
> > Currently, we enable soft-dirty for the migrated process and monitor
> > /proc/pid/pagemap between memory dump iterations to see what memory pages
> > have been changed.
> > With uffd-wp we thought to register all the process memory with uffd-wp and
> > then track changes with uffd-wp notifications. Back then it was considered
> > only at the very general level without paying much attention to details.
> > 
> > So my initial thought was that we do register the entire memory with
> > uffd-wp. If an area changes from RO to RW at some point, uffd-wp will
> > generate notifications to the monitor, it would be able to notice the
> > change and the write will continue normally.
> > 
> > If we are to limit uffd-wp register only to VMAs with VM_WRITE and even
> > VM_MAYWRITE, we'd need a way to handle the possible changes of VMA
> > protection and an ability to add monitoring for areas that changed from RO
> > to RW.
> > 
> > Can't say I have a clear picture in mind at the moment, will continue to
> > think about it.
> 
> Thanks for these details.  Though I have a question about how it's
> used.
> 
> Since we're talking about replacing soft dirty with uffd-wp here, I
> noticed that there's a major interface difference between soft-dirty
> and uffd-wp: the soft-dirty was all about /proc operations so a
> monitor process can easily monitor mostly any process on the system as
> long as knowing its PID.  However I'm unsure about uffd-wp since
> userfaultfd was always bound to a mm_struct.  For example, the syscall
> userfaultfd() will always attach the current process mm_struct to the
> newly created userfaultfd but it cannot be attached to another random
> mm_struct of other processes.  Or is there any way that the CRIU
> monitor process can gain an userfaultfd of any process of the system
> somehow?

Yes, there is. For CRIU to read the process state during snapshot (or one
the source in case of the migration) we inject a parasite code into the
victim process. The parasite code communicates with the "main" CRIU monitor
via UNIX socket to pass information that cannot be obtained from outside.
For uffd-wp usage we thought about creating the uffd context in the
parasite code, registering the memory and passing the userfault file
descriptor to the CRIU core via that UNIX socket.

> > 
> > > > Particularity, for using uffd-wp as a replacement for soft-dirty would
> > > > require it.
> > > > 
> > > > > 
> > > > >  		/*
> > > > >  		 * Check that this vma isn't already owned by a
> > > > > @@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > >  	do {
> > > > >  		cond_resched();
> > > > > 
> > > > > -		BUG_ON(!vma_can_userfault(vma));
> > > > > +		BUG_ON(!vma_can_userfault(vma, vm_flags));
> > > > >  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
> > > > >  		       vma->vm_userfaultfd_ctx.ctx != ctx);
> > > > >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > > > > @@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> > > > >  	return ret;
> > > > >  }
> > > > > 
> > > > > +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > > > +				    unsigned long arg)
> > > > > +{
> > > > > +	int ret;
> > > > > +	struct uffdio_writeprotect uffdio_wp;
> > > > > +	struct uffdio_writeprotect __user *user_uffdio_wp;
> > > > > +	struct userfaultfd_wake_range range;
> > > > > +
> > > > 
> > > > In the non-cooperative mode the userfaultfd_writeprotect() may race with VM
> > > > layout changes, pretty much as uffdio_copy() [1]. My solution for uffdio_copy()
> > > > was to return -EAGAIN if such race is encountered. I think the same would
> > > > apply here.
> > > 
> > > I tried to understand the problem at [1] but failed... could you help
> > > to clarify it a bit more?
> > > 
> > > I'm quoting some of the discussions from [1] here directly between you
> > > and Pavel:
> > > 
> > >   > Since the monitor cannot assume that the process will access all its memory
> > >   > it has to copy some pages "in the background". A simple monitor may look
> > >   > like:
> > >   > 
> > >   > 	for (;;) {
> > >   > 		wait_for_uffd_events(timeout);
> > >   > 		handle_uffd_events();
> > >   > 		uffd_copy(some not faulted pages);
> > >   > 	}
> > >   > 
> > >   > Then, if the "background" uffd_copy() races with fork, the pages we've
> > >   > copied may be already present in parent's mappings before the call to
> > >   > copy_page_range() and may be not.
> > >   > 
> > >   > If the pages were not present, uffd_copy'ing them again to the child's
> > >   > memory would be ok.
> > >   >
> > >   > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them
> > >   > again, child process will get memory corruption.
> > > 
> > > Here I don't understand why the child process will get memory
> > > corruption if uffd_copy() caught the mmap_sem first.
> > > 
> > > If it did it, then IMHO when uffd_copy() copies the page again it'll
> > > simply get a -EEXIST showing that the page has already been copied.
> > > Could you explain on why there will be a data corruption?
> > 
> > Let's say we do post-copy migration of a process A with CRIU and its page at
> > address 0x1000 is already copied. Now it modifies the contents of this
> > page. At this point the contents of the page at 0x1000 is different on the
> > source and the destination.
> > Next, process A forks process B. The CRIU's uffd monitor gets
> > UFFD_EVENT_FORK, and starts filling process B memory with UFFDIO_COPY.
> > It may happen, that UFFDIO_COPY to 0x1000 of the process B will occur
> 
> I think this is the place I started to get confused...
> 
> The mmap copy phase and the FORK event path is in dup_mmap() as
> mentioned in the patch too:
> 
>      dup_mmap()
>         down_write(old_mm)
>         down_write(new_mm)
>         foreach(vma)
>             copy_page_range()            (a)
>         up_write(new_mm)
>         up_write(old_mm)
>         dup_userfaultfd_complete()       (b)
> 
> Here if we already received UFFD_EVENT_FORK and started to copy pages
> to process B in the background, then we should have at least passed
> (b) above since otherwise we won't even know the existance of process
> B.  However if so, we should have already passed the point to copy
> data at (a) too, then how could copy_page_range() race?  It seems that
> I might have missed something important out there but it's not easy
> for me to figure out myself...

Apparently, I confused myself as well...
I clearly remember that there was a problem with fork() but the sequence
the causes it keeps evading me :(

Anyway, some mean of synchronization between uffd_copy and the
non-cooperative events is required. Take, for example, MADV_DONTNEED. When
it races with uffdio_copy() a process may end reading non zero values right
after MADV_DONTNEED call.

uffd monitor           | process
-----------------------+-------------------------------------------
uffdio_copy(0x1000)    | madvise(MADV_DONTNEED, 0x1000)
                       |    down_read(mmap_sem)
                       |    zap_pte_range(0x1000)
                       |    up_read(mmap_sem)
   down_read(mmap_sem) |
   copy()              |
   up_read(mmap_sem)   |
                       |  read(0x1000) != 0

Similar issues happen with mpremap() and munmap().

> Thanks,
> 
> > *before* fork() completes and it may race with copy_page_range().
> > If UFFDIO_COPY wins the race, it will fill the page with the contents from
> > the source, although the correct data is what process A set in that page.
> > 
> > Hope it helps.
> 
> > > >  
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df2cc96e77011cf7989208b206da9817e0321028
> > > >
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.