Hi Dave, On Fri, Aug 28, 2015 at 08:13:18AM -0700, Dave Hansen wrote: > Hi Andrea, > > Is there a way you can think of to use userfaultfd without having a > separate thread to sit there and be watching the file descriptor? The > current model doesn't seem like it would be possible to use with a > single-threaded app, for instance. > > Is there a reason we couldn't generate a signal and then have the > userfaultfd handling done inside the signal handler? Originally it worked precisely like that, much like volatile pages or the regular PROT_NONE+sigsegv would do (it only avoided the vma mangling). However that can't work for syscalls and get_user_pages. I thought of taking care of get_user_pages called by the KVM shadow page fault handler in a special way, but then there's O_DIRECT (or other get_user_pages) that may be invoked by userland on top of the userfault memory, which would also run into a get_user_pages done on a userfault area. How do you run a single threaded signal when get_user_pages finds that it has been called on a userfault region or when copy-user returns -EFAULT? It's unthinkable to break the syscall API with new retvals and require all apps to change the error checks of the syscalls to use userfaultfd safely. We could try to play tricks with restarting syscalls within glibc but that sigbus would be the same as a real SIGBUS. Even updating qemu alone to accept new retvals of read/write O_DIRECT syscalls, sounds like a bad idea compared to the current userfaultfd API which is entirely transparent to all syscalls and also to the KVM shadow page fault handler. The old MADV_USERFAULT/NOUSERFAULT madvise I entirely dropped it and it's basically become the UFFDIO_REGISTER/UNREGISTER ioctls. It looked bad idea to start with MADV_USERFAULT and signals, as you may later notice you need to use a syscall and you've to rewrite the code with the userfaultfd... better to start right away with userfaultfd and avoid the risk of wasting time. Right now copy-user and get_user_pages just blocks in kernel so userfaults are effectively invisible to the single threaded workflow: get_user_pages API from the caller point of view is totally userfaultfd agnostic, but you need a separate thread to handle the fault. The fact the kernel in the blocked fault talks with the userland thread directly over the fd should be more efficient too. The only downside is a schedule() call when blocking, but on the plus side we don't have to invoke the signal code at all which also isn't free (even if potentially less costly than schedule when having lots of runnable tasks and CPU overcommit). In addition signal themself can trip on userfaults now, and gdb also will successfully send sigstop to userfault blocked faults if they didn't happen in kernel context (where again signals, not even sigstop/sigcont, can't run). The only requirement added has been to have VM_FAULT_RETRY set in every fault or get_user_pages that can hit on userfaultfd registered regions, so we can drop the mmap_sem before blocking. We've to drop the mmap_sem or we'd allow an userland thread to indefinitely leave mmap_sem hold, which wouldn't safe (ps may block indefinitely etc..). If there's a bug and VM_FAULT_RETRY is missing, SIGBUS is raised, O_DIRECT or KVM page fault would return some noticeable error (like if it was a real SIGBUS), and a rate limited printk is printed in the log along with the offending stack trace that must be fixed to pass VM_FAULT_RETRY. On a side note, I'm about to extend this VM_FAULT_RETRY so that if the userfault is invoked when FAULT_FLAG_TRIED is set (which means VM_FAULT_RETRY was also previously set) we can one still hang and drop the mmap_sem. This is the major not-self-contained change needed to introduce the wrprotect fault tracking to userfaultfd (so that postcopy live snapshotting becomes possible and distributed shared memory with readonly-shared write-exclusive semantics also becomes possible). With wrprotect faults, the pagetable-based vma-less wrprotection can be armed with a UFFDIO_ ioctl while the page fault is for other reason already in the VM_FAULT_RETRY path and it already consumed it, so we may hit an wrprotect userfault in the FAULT_FLAG_TRIED case (that cannot happen with the missing fault mode, so we didn't need to alter the page fault logic yet and the stock VM_FAULT_RETRY sufficed so far). Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>