Hello, On Tue, Oct 07, 2014 at 08:47:59AM -0400, Linus Torvalds wrote: > On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > > > > Of course if somebody has better ideas on how to resolve an anonymous > > userfault they're welcome. > > So I'd *much* rather have a "write()" style interface (ie _copying_ > bytes from user space into a newly allocated page that gets mapped) > than a "remap page" style interface > > remapping anonymous pages involves page table games that really aren't > necessarily a good idea, and tlb invalidates for the old page etc. > Just don't do it. I see what you mean. The only cons I see is that we couldn't use then recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr, PAGE_SIZE, ..) and retain the zerocopy behavior. Or how could we? There's no recvfile(userfaultfd, socketfd, PAGE_SIZE). Ideally if we could prevent the page data coming from the network to ever become visible in the kernel we could avoid the TLB flush and also be zerocopy but I can't see how we could achieve that. The page data could come through a ssh pipe or anything (qemu supports all kind of network transports for live migration), this is why leaving the network protocol into userland is preferable. As things stands now, I'm afraid with a write() syscall we couldn't do it zerocopy. We'd still need to receive the memory in a temporary page and then copy it to a kernel page (invisible to userland while we write to it) to later map into the userfault address. If it wasn't for the TLB flush of the old page, the remap_anon_pages variant would be more optimal than doing a copy through a write syscall. Is the copy cheaper than a TLB flush? I probably naively assumed the TLB flush was always cheaper. Now another idea that comes to mind to be able to add the ability to switch between copy and TLB flush is using a RAP_FORCE_COPY flag, that would then do a copy inside remap_anon_pages and leave the original page mapped in place... (and such flag would also disable the -EBUSY error if page_mapcount is > 1). So then if the RAP_FORCE_COPY flag is set remap_anon_pages would behave like you suggested (but with a mremap-like interface, instead of a write syscall) and we could benchmark the difference between copy and TLB flush too. We could even periodically benchmark it at runtime and switch over the faster method (the more CPUs there are in the host and the more threads the process has, the faster the copy will be compared to the TLB flush). Of course in terms of API I could implement the exact same mechanism as described above for remap_anon_pages inside a write() to the userfaultfd (it's a pseudo inode). It'd need two different commands to prepare for the coming write (with a len multiple of PAGE_SIZE) to know the address where the page should be mapped into and if to behave zerocopy or if to skip the TLB flush and copy. Because the copy vs TLB flush trade off is possible to achieve with both interfaces, I think it really boils down to choosing between a mremap like interface, or file+commands protocol interface. I tend to like mremap more, that's why I opted for a remap_anon_pages syscall kept orthogonal to the userfaultfd functionality (remap_anon_pages could be also used standalone as an accelerated mremap in some circumstances) but nothing prevents to just embed the same mechanism inside userfaultfd if a file+commands API is preferable. Or we could add a different syscall (separated from userfaultfd) that creates another pseudofd to write a command plus the page data into it. Just I wouldn't see the point of creating a pseudofd just to copy a page atomically, the write() syscall would look more ideal if the userfaultfd is already open for other reasons and the pseudofd overhead is required anyway. Last thing to keep in mind is that if using userfaults with SIGBUS and without userfaultfd, remap_anon_pages would have been still useful, so if we retain the SIGBUS behavior for volatile pages and we don't force the usage for userfaultfd, it may be cleaner not to use userfaultfd but a separate pseudofd to do the write() syscall though. Otherwise the app would need to open the userfaultfd to resolve the fault even though it's not using the userfaultfd protocol which doesn't look an intuitive interface to me. Comments welcome. Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html