On 03/05/2015 10:17 AM, Andrea Arcangeli wrote: > Add documentation. > > Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> > --- > Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 97 insertions(+) > create mode 100644 Documentation/vm/userfaultfd.txt Just a grammar review (no analysis of technical correctness) > > diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt > new file mode 100644 > index 0000000..2ec296c > --- /dev/null > +++ b/Documentation/vm/userfaultfd.txt > @@ -0,0 +1,97 @@ > += Userfaultfd = > + > +== Objective == > + > +Userfaults allow to implement on demand paging from userland and more s/to implement/the implementation of/ and maybe: s/on demand/on-demand/ > +generally they allow userland to take control various memory page > +faults, something otherwise only the kernel code could do. > + > +For example userfaults allows a proper and more optimal implementation > +of the PROT_NONE+SIGSEGV trick. > + > +== Design == > + > +Userfaults are delivered and resolved through the userfaultfd syscall. > + > +The userfaultfd (aside from registering and unregistering virtual > +memory ranges) provides for two primary functionalities: s/provides for/provides/ > + > +1) read/POLLIN protocol to notify an userland thread of the faults s/an userland/a userland/ (remember, 'a unicorn gets an umbrella' - if the 'u' is pronounced 'you' the correct article is 'a') > + happening > + > +2) various UFFDIO_* ioctls that can mangle over the virtual memory > + regions registered in the userfaultfd that allows userland to > + efficiently resolve the userfaults it receives via 1) or to mangle > + the virtual memory in the background maybe: s/mangle/manage/2 > + > +The real advantage of userfaults if compared to regular virtual memory > +management of mremap/mprotect is that the userfaults in all their > +operations never involve heavyweight structures like vmas (in fact the > +userfaultfd runtime load never takes the mmap_sem for writing). > + > +Vmas are not suitable for page(or hugepage)-granular fault tracking s/page(or hugepage)-granular/page- (or hugepage-) granular/ > +when dealing with virtual address spaces that could span > +Terabytes. Too many vmas would be needed for that. > + > +The userfaultfd once opened by invoking the syscall, can also be > +passed using unix domain sockets to a manager process, so the same > +manager process could handle the userfaults of a multitude of > +different process without them being aware about what is going on s/process/processes/ > +(well of course unless they later try to use the userfaultfd themself s/themself/themselves/ > +on the same region the manager is already tracking, which is a corner > +case that would currently return -EBUSY). > + > +== API == > + > +When first opened the userfaultfd must be enabled invoking the > +UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API s/an uffdio/a uffdio/ > +which will specify the read/POLLIN protocol userland intends to speak > +on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested > +uffdio_api.api is spoken also by the running kernel), will return into > +uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of > +respectively the activated feature bits below PAGE_SHIFT in the > +userfault addresses returned by read(2) and the generic ioctl > +available. > + > +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should > +be invoked (if present in the returned uffdio_api.ioctls bitmask) to > +register a memory range in the userfaultfd by setting the > +uffdio_register structure accordingly. The uffdio_register.mode > +bitmask will specify to the kernel which kind of faults to track for > +the range (UFFDIO_REGISTER_MODE_MISSING would track missing > +pages). The UFFDIO_REGISTER ioctl will return the > +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve > +userfaults on the range reigstered. Not all ioctls will necessarily be s/reigstered/registered/ > +supported for all memory types depending on the underlying virtual > +memory backend (anonymous memory vs tmpfs vs real filebacked > +mappings). > + > +Userland can use the uffdio_register.ioctls to mangle the virtual maybe s/mangle/manage/ > +address space in the background (to add or potentially also remove > +memory from the userfaultfd registered range). This means an userfault s/an/a/ > +could be triggering just before userland maps in the background the > +user-faulted page. To avoid POLLIN resulting in an unexpected blocking > +read (if the UFFD is not opened in nonblocking mode in the first > +place), we don't allow the background thread to wake userfaults that > +haven't been read by userland yet. If we would do that likely the > +UFFDIO_WAKE ioctl could be dropped. This may change in the future > +(with a UFFD_API protocol bumb combined with the removal of the s/bumb/bump/ > +UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid > +optimization and worthy to force userland to use the UFFD always in > +nonblocking mode if combined with POLLIN. > + > +userfaultfd is also a generic enough feature, that it allows KVM to > +implement postcopy live migration (one form of memory externalization > +consisting of a virtual machine running with part or all of its memory > +residing on a different node in the cloud) without having to modify a > +single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT > +and all other GUP features works just fine in combination with > +userfaults (userfaults trigger async page faults in the guest > +scheduler so those guest processes that aren't waiting for userfaults > +can keep running in the guest vcpus). > + > +The primary ioctl to resolve userfaults is UFFDIO_COPY. That > +atomically copies a page into the userfault registered range and wakes > +up the blocked userfaults (unless uffdio_copy.mode & > +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to > +UFFDIO_COPY. > > > -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
Attachment:
signature.asc
Description: OpenPGP digital signature