Hi Grigory, On Thu, Mar 20, 2014 at 10:50:07AM -0700, Grigory Makarevich wrote: > Andrea, Paolo, > > Thanks a lot for the comments. > > I like the idea of userfaultfd a lot. For my prototype I had to solve a > problem of accessing to the ondemand page from the paths where exiting is > not safe (emulator is one example). I have solved it using send message > using netlink socket/blocking calling thread/waking up once page is > delivered. userfaultfd might be cleaner way to achieve the same goal. I'm glad you like the idea of userfaultfd, the way the syscall works tends to mirror the eventfd(2) syscall but the protocol talked on the fd is clearly different. So you also made the vcpu to talk from the kernel to the "ondemand" thread through some file descriptor and protocol. So the difference is that it shouldn't be limited to the emulator but it shall work for all syscalls that the IO thread may invoke and that could hit on missing guest physical memory. > My concerns regarding "general" mm solution are: > > - Will it work with any memory mapping schema or only with anonymous memory > ? Currently I have the hooks only into anonymous memory but there's no reason why this shouldn't work for other kind of page faults. The main issue is not with the technicality in waiting in the page fault but with the semantics of a page not being mapped. If there's an hole in a filebacked mapping things are different than if there's an hole in anonymous memory. As the VM can unmap and free a filebacked page at any time. But if you're ok to notify the migration thread even in such a case, it could work the same. It would be more tricky if we had to differentiate an initial fault after the vma is created (in order to notify the migration thread only for initial faults), from faults triggered after the VM unmapped and freed the page as result of VM pressure. That would require putting a placeholder in the pagetable instead of keeping the VM code identical to now (which zeroes a pagetable when a page is unmapped from a filebacked mapping). There is also some similarity between the userfault mechanism and the volatile ranges but volatile ranges are handled in a finegrined way in the pagetables, but it looked like there was no code to share in the end. Volatile ranges provides a very different functionality from userfault, for example they don't need to provide transparent behavior if the memory is given as parameter to syscalls as far as I know. They are used only to store data in memory accessed by userland through the pagetables (not syscalls). The objective of userfault is also fundamentally different from the objective of volatile ranges. We cannot ever lose any data, while their whole point is to lose data if there's VM pressure. However if you want to extend the userfaultfd functionalty to trap the first access in the pagetabels for filebacked pages, we would also need to mangle pagetables with some placeholders to differentiate the first fault and we may want to revisit if there's some code or placeholder to share across the two features. Ideally support for filebacked ranges could be a added at a second stage. > - Before blocking the calling thread while serving the page-fault in > host-kernel, one would need carefully release mmu semaphore (otherwise, the > user-space might be in trouble serving the page-fault), which may be not > that trivial. Yes all locks must be dropped before waiting for the migration thread and that includes the mmap_sem. I don't think we'll need to add much complexity though, because we can relay on the behavior of page faults that can be repeated endlessy until they succeed. It's certainly more trivial for real faults than gup (because gup can work on more than 1 page at time so it may require some rolling back or special lock retaking during the gup loop). With the real faults the only trick as optimization will be to repeat it without returning to userland. > Regarding qemu part of that: > > - Yes, indeed, user-space would need to be careful accessing "ondemand" > pages. However, that should not be a problem, considering that qemu would > need to know in advance all "ondemand" regions. Though, I would expect > some refactoring of the qemu internal api might be required. I don't think the migration thread would risk messing with missing memory, but the refactoring of some API and wire protocol will be still needed to make the protocol bidirectional and to handle the postcopy mechanism. David is working on it. Paolo also pointed out one case that won't be entirely transparent. If gdb would debug the migration thread breakpointing into it, and then the gdb user tries to touch missing memory with ptrace after it stopped the migration thread, gdb would then soft lockup. It'll require a sigkill to unblock. There's no real way to fix it in my view... and overall it looks quite reasonable to end up in a soft lockup in such a scenario, I mean gdb has other ways to interfere in bad ways the app by corrupting its memory. Ideally the gdb user should know what it is doing. Thanks! Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html