Hello up there, On Thu, May 14, 2015 at 07:30:57PM +0200, Andrea Arcangeli wrote: > Hello everyone, > > This is the latest userfaultfd patchset against mm-v4.1-rc3 > 2015-05-14-10:04. > > The postcopy live migration feature on the qemu side is mostly ready > to be merged and it entirely depends on the userfaultfd syscall to be > merged as well. So it'd be great if this patchset could be reviewed > for merging in -mm. > > Userfaults allow to implement on demand paging from userland and more > generally they allow userland to more efficiently take control of the > behavior of page faults than what was available before > (PROT_NONE + SIGSEGV trap). > > The use cases are: [...] > Even though there wasn't a real use case requesting it yet, it also > allows to implement distributed shared memory in a way that readonly > shared mappings can exist simultaneously in different hosts and they > can be become exclusive at the first wrprotect fault. Sorry for maybe speaking up too late, but here is additional real potential use-case which in my view is overlapping with the above: Recently we needed to implement persistency for NumPy arrays - that is to track made changes to array memory and transactionally either abandon the changes on transaction abort, or store them back to storage on transaction commit. Since arrays can be large, it would be slow and thus not practical to have original data copy and compare memory to original to find what array parts have been changed. So I've implemented a scheme where array data is initially PROT_READ protected, then we catch SIGSEGV, if it is write and area belongs to array data - we mark that page as PROT_WRITE and continue. On commit time we know which parts were modified. Also, since arrays could be large - bigger than RAM, and only sparse parts of it could be needed to get needed information, for reading it also makes sense to lazily load data in SIGSEGV handler with initial PROT_NONE protection. This is very similar to how memory mapped files work, but adds transactionality which, as far as I know, is not provided by any currently in-kernel filesystem on Linux. The system is done as files, and arrays are then build on top of this-way memory-mapped files. So from now on we can forget about NumPy arrays and only talk about files, their mapping, lazy loading and transactionally storing in-memory changes back to file storage. To get this working, a custom user-space virtual memory manager is unrolled, which manages RAM memory "pages", file mappings into virtual address-space, tracks pages protection and does SIGSEGV handling appropriately. The gist of virtual memory-manager is this: https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c (vma_on_pagefault) For operations it currently needs - establishing virtual memory areas and connecting to tracking it - changing pages protection PROT_NONE or absent - initially PROT_NONE -> PROT_READ - after read PROT_READ -> PROT_READWRITE - after write PROT_READWRITE -> PROT_READ - after commit PROT_READWRITE -> PROT_NONE or absent (again) - after abort PROT_READ -> PROT_NONE or absent (again) - on reclaim - working with aliasable memory (thus taken from tmpfs) there could be two overlapping-in-file mapping for file (array) requested at different time, and changes from one mapping should propagate to another one -> for common parts only 1 page should be memory-mapped into 2 places in address-space. so what is currently lacking on userfaultfd side is: - ability to remove / make PROT_NONE already mapped pages (UFFDIO_REMAP was recently dropped) - ability to arbitrarily change pages protection (e.g. RW -> R) - inject aliasable memory from tmpfs (or better hugetlbfs) and into several places (UFFDIO_REMAP + some mapping copy semantic). The code is ugly because it is only a prototype. You can clone/read it all from here: https://lab.nexedi.cn/kirr/wendelin.core Virtual memory-manager even has tests, and from them it could be seen how the system is supposed to work (after each access - what pages and where are mapped and how): https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/tests/test_virtmem.c The performance currently is not great, partly because of page clearing when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas overhead and other dumb things on my side. I still wanted to show the case, as userfaultd here has potential to remove overhead related to kernel. Thanks beforehand for feedback, Kirill P.S. some context http://www.wendelin.io/NXD-Wendelin.Core.Non.Secret/asEntireHTML -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html