On Thu, Jan 24, 2019 at 05:28:48PM +0800, Peter Xu wrote: > On Thu, Jan 24, 2019 at 09:27:07AM +0200, Mike Rapoport wrote: > > On Thu, Jan 24, 2019 at 12:56:15PM +0800, Peter Xu wrote: > > > On Mon, Jan 21, 2019 at 12:42:33PM +0200, Mike Rapoport wrote: > > > > > > [...] > > > > > > > > @@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > > > > > > > /* check not compatible vmas */ > > > > > ret = -EINVAL; > > > > > - if (!vma_can_userfault(cur)) > > > > > + if (!vma_can_userfault(cur, vm_flags)) > > > > > goto out_unlock; > > > > > > > > > > /* > > > > > @@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > > if (end & (vma_hpagesize - 1)) > > > > > goto out_unlock; > > > > > } > > > > > + if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE)) > > > > > + goto out_unlock; > > > > > > > > This is problematic for the non-cooperative use-case. Way may still want to > > > > monitor a read-only area because it may eventually become writable, e.g. if > > > > the monitored process runs mprotect(). > > > > > > Firstly I think I should be able to change it to VM_MAYWRITE which > > > seems to suite more. > > > > > > Meanwhile, frankly speaking I didn't think a lot about how to nest the > > > usages of uffd-wp and mprotect(), so far I was only considering it as > > > a replacement of mprotect(). But indeed it can happen that the > > > monitored process calls mprotect(). Is there an existing scenario of > > > such usage? > > > > > > The problem is I'm uncertain about whether this scenario can work > > > after all. Say, the monitor process A write protected process B's > > > page P, so logically A will definitely receive a message before B > > > writes to page P. However here if we allow process B to do > > > mprotect(PROT_WRITE) upon page P and grant write permission to it on > > > its own, then A will not be able to capture the write operation at > > > all? Then I don't know how it can work here... or whether we should > > > fail the mprotect() at least upon uffd-wp ranges? > > > > The use-case we've discussed a while ago was to use uffd-wp instead of > > soft-dirty for tracking memory changes in CRIU for pre-copy migration. > > Currently, we enable soft-dirty for the migrated process and monitor > > /proc/pid/pagemap between memory dump iterations to see what memory pages > > have been changed. > > With uffd-wp we thought to register all the process memory with uffd-wp and > > then track changes with uffd-wp notifications. Back then it was considered > > only at the very general level without paying much attention to details. > > > > So my initial thought was that we do register the entire memory with > > uffd-wp. If an area changes from RO to RW at some point, uffd-wp will > > generate notifications to the monitor, it would be able to notice the > > change and the write will continue normally. > > > > If we are to limit uffd-wp register only to VMAs with VM_WRITE and even > > VM_MAYWRITE, we'd need a way to handle the possible changes of VMA > > protection and an ability to add monitoring for areas that changed from RO > > to RW. > > > > Can't say I have a clear picture in mind at the moment, will continue to > > think about it. > > Thanks for these details. Though I have a question about how it's > used. > > Since we're talking about replacing soft dirty with uffd-wp here, I > noticed that there's a major interface difference between soft-dirty > and uffd-wp: the soft-dirty was all about /proc operations so a > monitor process can easily monitor mostly any process on the system as > long as knowing its PID. However I'm unsure about uffd-wp since > userfaultfd was always bound to a mm_struct. For example, the syscall > userfaultfd() will always attach the current process mm_struct to the > newly created userfaultfd but it cannot be attached to another random > mm_struct of other processes. Or is there any way that the CRIU > monitor process can gain an userfaultfd of any process of the system > somehow? Yes, there is. For CRIU to read the process state during snapshot (or one the source in case of the migration) we inject a parasite code into the victim process. The parasite code communicates with the "main" CRIU monitor via UNIX socket to pass information that cannot be obtained from outside. For uffd-wp usage we thought about creating the uffd context in the parasite code, registering the memory and passing the userfault file descriptor to the CRIU core via that UNIX socket. > > > > > > Particularity, for using uffd-wp as a replacement for soft-dirty would > > > > require it. > > > > > > > > > > > > > > /* > > > > > * Check that this vma isn't already owned by a > > > > > @@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > > do { > > > > > cond_resched(); > > > > > > > > > > - BUG_ON(!vma_can_userfault(vma)); > > > > > + BUG_ON(!vma_can_userfault(vma, vm_flags)); > > > > > BUG_ON(vma->vm_userfaultfd_ctx.ctx && > > > > > vma->vm_userfaultfd_ctx.ctx != ctx); > > > > > WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); > > > > > @@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx, > > > > > return ret; > > > > > } > > > > > > > > > > +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, > > > > > + unsigned long arg) > > > > > +{ > > > > > + int ret; > > > > > + struct uffdio_writeprotect uffdio_wp; > > > > > + struct uffdio_writeprotect __user *user_uffdio_wp; > > > > > + struct userfaultfd_wake_range range; > > > > > + > > > > > > > > In the non-cooperative mode the userfaultfd_writeprotect() may race with VM > > > > layout changes, pretty much as uffdio_copy() [1]. My solution for uffdio_copy() > > > > was to return -EAGAIN if such race is encountered. I think the same would > > > > apply here. > > > > > > I tried to understand the problem at [1] but failed... could you help > > > to clarify it a bit more? > > > > > > I'm quoting some of the discussions from [1] here directly between you > > > and Pavel: > > > > > > > Since the monitor cannot assume that the process will access all its memory > > > > it has to copy some pages "in the background". A simple monitor may look > > > > like: > > > > > > > > for (;;) { > > > > wait_for_uffd_events(timeout); > > > > handle_uffd_events(); > > > > uffd_copy(some not faulted pages); > > > > } > > > > > > > > Then, if the "background" uffd_copy() races with fork, the pages we've > > > > copied may be already present in parent's mappings before the call to > > > > copy_page_range() and may be not. > > > > > > > > If the pages were not present, uffd_copy'ing them again to the child's > > > > memory would be ok. > > > > > > > > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them > > > > again, child process will get memory corruption. > > > > > > Here I don't understand why the child process will get memory > > > corruption if uffd_copy() caught the mmap_sem first. > > > > > > If it did it, then IMHO when uffd_copy() copies the page again it'll > > > simply get a -EEXIST showing that the page has already been copied. > > > Could you explain on why there will be a data corruption? > > > > Let's say we do post-copy migration of a process A with CRIU and its page at > > address 0x1000 is already copied. Now it modifies the contents of this > > page. At this point the contents of the page at 0x1000 is different on the > > source and the destination. > > Next, process A forks process B. The CRIU's uffd monitor gets > > UFFD_EVENT_FORK, and starts filling process B memory with UFFDIO_COPY. > > It may happen, that UFFDIO_COPY to 0x1000 of the process B will occur > > I think this is the place I started to get confused... > > The mmap copy phase and the FORK event path is in dup_mmap() as > mentioned in the patch too: > > dup_mmap() > down_write(old_mm) > down_write(new_mm) > foreach(vma) > copy_page_range() (a) > up_write(new_mm) > up_write(old_mm) > dup_userfaultfd_complete() (b) > > Here if we already received UFFD_EVENT_FORK and started to copy pages > to process B in the background, then we should have at least passed > (b) above since otherwise we won't even know the existance of process > B. However if so, we should have already passed the point to copy > data at (a) too, then how could copy_page_range() race? It seems that > I might have missed something important out there but it's not easy > for me to figure out myself... Apparently, I confused myself as well... I clearly remember that there was a problem with fork() but the sequence the causes it keeps evading me :( Anyway, some mean of synchronization between uffd_copy and the non-cooperative events is required. Take, for example, MADV_DONTNEED. When it races with uffdio_copy() a process may end reading non zero values right after MADV_DONTNEED call. uffd monitor | process -----------------------+------------------------------------------- uffdio_copy(0x1000) | madvise(MADV_DONTNEED, 0x1000) | down_read(mmap_sem) | zap_pte_range(0x1000) | up_read(mmap_sem) down_read(mmap_sem) | copy() | up_read(mmap_sem) | | read(0x1000) != 0 Similar issues happen with mpremap() and munmap(). > Thanks, > > > *before* fork() completes and it may race with copy_page_range(). > > If UFFDIO_COPY wins the race, it will fill the page with the contents from > > the source, although the correct data is what process A set in that page. > > > > Hope it helps. > > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df2cc96e77011cf7989208b206da9817e0321028 > > > > > > -- > Peter Xu > -- Sincerely yours, Mike.