Hello, I found a minor issue with the non cooperative testcase, sometime an userfault would trigger in between UFFD_EVENT_MADVDONTNEED and UFFDIO_UNREGISTER: case UFFD_EVENT_MADVDONTNEED: uffd_reg.range.start = msg.arg.madv_dn.start; uffd_reg.range.len = msg.arg.madv_dn.end - msg.arg.madv_dn.start; if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range)) It always triggered at the nr == 0: for (nr = 0; nr < nr_pages; nr++) { if (my_bcmp(area_dst + nr * page_size, zeropage, page_size)) The userfault still pending after UFFDIO_UNREGISTER returned, lead to poll() getting a UFFD_EVENT_PAGEFAULT and trying to do a UFFDIO_COPY into the unregistered range, which gracefully results in -EINVAL. So this could be all handled in userland, by storing the MADV_DONTNEED range and calling UFFDIO_WAKE instead of UFFDIO_COPY... but I think it's more reliable to fix it into the kernel. If a pending userfault happens before UFFDIO_UNREGISTER it'll just behave like if it happened after. I also noticed the order of uffd notification of MADV_DONTNEED and the pagetable zap was wrong, we've to notify userland first so it won't risk to call UFFDIO_COPY while the process runs zap_page_range. With the two patches appended below the -EINVAL error out of UFFDIO_COPY is gone. >From fc27d209e566d95e8ae0eb83a703aa4e02316b4c Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@xxxxxxxxxx> Date: Thu, 17 Nov 2016 20:15:50 +0100 Subject: [PATCH 1/2] userfaultfd: non-cooperative: avoid MADV_DONTNEED race condition MADV_DONTNEED must be notified to userland before the pages are zapped. This allows userland to immediately stop adding pages to the userfaultfd ranges before the pages are actually zapped or there could be non-zeropage leftovers as result of concurrent UFFDIO_COPY run in between zap_page_range and madvise_userfault_dontneed (both MADV_DONTNEED and UFFDIO_COPY runs under the mmap_sem for reading, so they can run concurrently). Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> --- mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/madvise.c b/mm/madvise.c index 7168bc6..4d4c7f8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -476,8 +476,8 @@ static long madvise_dontneed(struct vm_area_struct *vma, if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) return -EINVAL; - zap_page_range(vma, start, end - start, NULL); madvise_userfault_dontneed(vma, prev, start, end); + zap_page_range(vma, start, end - start, NULL); return 0; } >From 18e7b30cf82c927af4c0323a6caac20184a03ff4 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@xxxxxxxxxx> Date: Thu, 17 Nov 2016 20:20:40 +0100 Subject: [PATCH 2/2] userfaultfd: non-cooperative: wake userfaults after UFFDIO_UNREGISTER Userfaults may still happen after the userfaultfd monitor thread received a UFFD_EVENT_MADVDONTNEED until UFFDIO_UNREGISTER is run. Wake any pending userfault within UFFDIO_UNREGISTER protected by the mmap_sem for writing, so they will not be reported to userland leading to UFFDIO_COPY returning -EINVAL (as the range was already unregistered) and they will not hang permanently either. Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> --- fs/userfaultfd.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 2b75fab..42168d3 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1267,6 +1267,19 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, start = vma->vm_start; vma_end = min(end, vma->vm_end); + if (userfaultfd_missing(vma)) { + /* + * Wake any concurrent pending userfault while + * we unregister, so they will not hang + * permanently and it avoids userland to call + * UFFDIO_WAKE explicitly. + */ + struct userfaultfd_wake_range range; + range.start = start; + range.len = vma_end - start; + wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range); + } + new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP); prev = vma_merge(mm, prev, start, vma_end, new_flags, vma->anon_vma, vma->vm_file, vma->vm_pgoff, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>