[LSF/MM TOPIC] userfaultfd

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Thu, 15 Jan 2015 00:01:30 +0100

Hello,

I would like to attend this year (2015) LSF/MM summit. I'm
particularly interested about the MM track, in order to get help in
finalizing the userfaultfd feature I've been working on lately.

An overview on the userfaultfd feature can be read here:

   http://lwn.net/Articles/615086/

In essence the userfault feature could be imagined as an optimal
implementation for userland driven on demand paging similar to
PROT_NONE+SIGSEGV.

userfaultfd is fundamentally allowing to manage memory at the
pagetable level by delivering the page fault notification to userland
to handle it with proper userfaultfd commands that mangle the address
space, without involving heavyweight structures like vmas (in fact the
userfaultfd runtime load never takes the mmap_sem for writing, just
like its kernel counterpart wouldn't). The number of vmas is limited
too so they're not suitable if there are too many scattered faults and
the address space is not limited. userfaultfd allows all userfaults to
happen in parallel from different threads and it relies on userland to
use atomic copy or move commands to resolve the userfaults.

By adding more featured commands to the userfaultfd protocol (spoken
on the fd, like the basic atomic copy command that is needed to
resolve the userfault) in the future we can also mark regions readonly
and trap only wrprotect faults (or both wrprotect and non present
faults simultaneously).

Different userfaultfd can already be used independently by multiple
librarians and the main application within the same process.

The userfaultfd once opened, can also be passed using unix domain
sockets to a manager process (use case 5) below wants to do this), so the
same manager process could handle the userfaults of a multitude of
different process without them being aware about what is going on
(well of course unless they later try to use the userfaultfd themself
on the same region the manager is already tracking, which is a corner
case the relevancy of which should be discussed).

There was interest from multiple users, hope I'm not forgetting some:

1) KVM postcopy live migration (one form of cloud memory
   externalization). KVM postcopy live migration is the primary driver
   of this work:
   http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
   )

2) KVM postcopy live snapshotting (allowing to limit/throttle the
   memory usage, unlike fork would).

3) KVM userfaults on shared memory (currently only anonymous memory
   is handled by the userfaultfd but there's nothing that prevents to
   extend it and allow to register a tmpfs region in the userfaultfd
   and fire an userfault if the tmpfs page is not present)

4) alternate mechanism to notify web browsers or apps on embedded
   devices that volatile pages have been reclaimed. This basically
   avoids the need to run a syscall before the app can access with the
   CPU the virtual regions marked volatile. This also requires point 3)
   to be fulfilled, as volatile pages happily apply to tmpfs.

5) postcopy live migration of binaries inside linux containers
   (provided there is a userfaultfd command [not an external syscall
   like the original implementation] that allows to copy memory
   atomically in the userfaultfd "mm" and not in the manager "mm",
   hence the main reason the external syscalls are going away, and in
   turn MADV_USERFAULT fd-less is going away as well).

6) qemu linux-user binary emulation was also briefly interested about
   the wrprotection fault notification for non-x86 archs. In this
   context the userfaultfd ""might"" (not sure) be useful to JIT
   emulation to efficiently protect the translated regions by only
   wrprotecting the page table without having to split or merge vmas
   (the risk of running out of vmas isn't there for this use case as
   the translated cache is probably limited in size and not heavily
   scattered).

7) distributed shared memory that could allow simultaneous mapping of
   regions marked readonly and collapse them on the first exclusive
   write. I'm mentioning it as a corollary, because I'm not aware of
   anybody who is planning to use it that way (still I'd like that
   this will be possible too just in case it finds its way later on).

The currently planned API (as hinted above) is already different to
the first version of the code posted a couple of months ago, thanks to
the valuable feedback received by the community so far.

As usual suggestions will be welcome, thanks!
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html