On 09/30/2015 05:06 PM, Andrea Arcangeli wrote: > Hello Mike, > > On Wed, Sep 30, 2015 at 02:56:19PM -0700, Mike Kravetz wrote: >> On 09/08/2015 01:43 PM, Andrea Arcangeli wrote: >>> Here are some pending updates for userfaultfd mostly to the self test, >>> the rest are cleanups. >> >> I have a potential use case for userfualtfd. So, I started experimenting > > Glad to hear you may have one more use case. > > On a side note, there's also a patch posted to CRIU to pagein lazily > anonymous memory during restore using userfaultfd, that's yet another > recent user. > >> with the self test code. I replaced the posix_memalign() calls to allocate >> area_src and area_dst with mmap(). mmap(MAP_PRIVATE | MAP_ANONYMOUS) works >> as expected. However, mmap(MAP_SHARED | MAP_ANONYMOUS) causes the test to >> fail without any errros from the userfaultfd APIs. >> >> -------------------- >> running userfaultfd >> -------------------- >> nr_pages: 32768, nr_pages_per_cpu: 8192 >> bounces: 31, mode: rnd racing ver poll, page_nr 31523 wrong count 0 1 >> >> I would expect some type of error from the ioctl() that registers the >> range, or perhaps the poll/copy code? Just curious about the expected >> behavior. > > That should return an error during UFFDIO_REGISTER and the testcase > shouldn't start, not sure what went wrong. Can you send the > modification to the testcase? > > UFFDIO_REGISTER is the point where userfaultfd is first told which > kind of memory you want to manage with userfaults. It was planned to > fail there (and it cannot fail any earlier). > > This check has to fail and return -EINVAL in the ioctl(UFFDIO_REGISTER). > > /* check not compatible vmas */ > ret = -EINVAL; > if (cur->vm_ops) > goto out_unlock; > > In the testcase you should get an exit 1 and the fprintf printed: > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { > fprintf(stderr, "register failure\n"); > return 1; > } > > Could you double check these two paths to find what's wrong? My apologies!!!! This was running in my hacked up kernel. I removed the cur->vm_ops check as a quick and dirty way for me to register hugetlb vmas. Sorry, I forgot I was still running this kernel. >> FYI - My use case is for hugetlbfs. I would like a mechanism to catch all >> new huge page allocations as a result of page faults. I have some very >> rough code to extend userfualtfd and add the required functionality to >> hugetlbfs. Still working on it. > > Adding support for hugetlbfs sounds great to me. The use case I have is pretty simple. Recently, fallocate hole punch support was added to hugetlbfs. The reason for this is that the database people want to 'free up' huge pages they know will no longer be used. However, these huge pages are part of SGA areas sometimes mapped by tens of thousands of tasks. They would like to 'catch' any tasks that (incorrectly) fault in a page after hole punch. The thought is that this can be done with userfaultfd by registering these mappings with UFFDIO_REGISTER_MODE_MISSING. No need for UFFDIO_COPY or UFFDIO_ZEROPAGE. We would just send a signal to the task (such as SIGBUS) and then do a UFFDIO_WAKE. The only downside to this approach is having thousands of threads monitoring userfault fds to catch a database error condition. I believe the MADV_USERFAULT/NOUSERFAULT code you proposed some time back would be the ideal solution for this use case. Unfortunately, I did not know of this use case or your proposal back then. :( -- Mike Kravetz > > Only anonymous memory has null vm_ops, so once you extend the code to > track hugetlbfs (tracking at least tmpfs and not just anonymous memory > is needed for volatile pages which also work on tmpfs) you should > relax the above check to accept &hugetlb_vm_ops. > > You then need to specify which kind of ioctl you supported in the > current kernel for that kind of memory you registered on in the > uffdio_register->ioctl parameter. > > /* > * Now that we scanned all vmas we can already tell > * userland which ioctls methods are guaranteed to > * succeed on this range. > */ > if (put_user(UFFD_API_RANGE_IOCTLS, > &user_uffdio_register->ioctls)) > ret = -EFAULT; > > #define UFFD_API_RANGE_IOCTLS \ > ((__u64)1 << _UFFDIO_WAKE | \ > (__u64)1 << _UFFDIO_COPY | \ > (__u64)1 << _UFFDIO_ZEROPAGE) > > hugetlbfs doesn't seem to support the zeropage. So if vma->vm_ops == > &hugetlb_vm_ops, it should return only WAKE|COPY in > uffdio_register->ioctl. > > hugetlbfs is non standard, there's no sysconf(_SC_PAGE_SIZE) to know > the minimum granularity supported by the UFFDIO_COPY|WAKE of > hugetlbfs. This is a generic issue with hugetlbfs, not really related > to userfaultfd. The same constraints of hugetlbfs minimum granularity > and alignment applies to all other memory management syscalls too. > > So the app itself using hugetlbfs will have to know by other means > (i.e. sysfs mangling) that the minimum granularity supported by > UFFDIO_COPY is 2MB (or 1GB). That is again because it registered > userfaultfd on hugetlbfs, and hugetlbfs has non standard > constraints. In turn UFFDIO_COPY of hugetlbfs has to fail if len is > not a multiple of 2MB (never the case for all other kinds of memory > that userfaultfd could ever manage). > > There's flexibility in the userfaultfd API to gradually expand the > coverage to a variety of types of virtual memory while at the same > time not risking random behavior from a new app if run on a old > kernel. The new app will be able to tell reliably to the user, to > upgrade the kernel (or it can fallback to a non-userfaultfd mode with > just a warning to the user). > > We need to handle the write protection faults too as soon as possible > (VM_UFFD_WP/UFFD_FEATURE_PAGEFAULT_FLAG_WP). The uffdio_api->features > are already prepared to report to userland the availability of the > UFFD_FEATURE_PAGEFAULT_FLAG_WP. Then the app can set > UFFDIO_REGISTER_MODE_WP in uffdio_register.mode. > > I mentioned this because while there's flexibility to expand the > coverage gradually, it'd be great if all kinds of memory supporting > UFFDIO_REGISTER_MODE_MISSING would also support > UFFDIO_REGISTER_MODE_WP once that gets available, as it'd keep > userfaultfd_register() a bit simpler to maintain. > > Thanks, > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>