Hi, On Thu, Mar 20, 2014 at 02:18:50PM +0100, Paolo Bonzini wrote: > Il 20/03/2014 00:27, Grigory Makarevich ha scritto: > > Hi All, > > > > I have been exploring different ways to implement on-demand paging for > > VMs running in KVM. > > > > The core of the idea is to introduce an additional exit > > KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process > > access to "not yet present" guest's page. > > Each memory slot may be instructed to keep track of ondemand bit per > > page. If the page is marked as "ondemand", page fault will generate > > exit to the host's > > user-space with the information about the faulting page. Once the page > > is filled, VMM instructs the KVM to clear "ondemand" bit for the page. > > > > I have working prototype and would like to consider upstreaming > > corresponding KVM changes. That was the original idea before userfaultfd was introduced. The problem is then what happens when qemu is doing an O_DIRECT read from the missing memory. It's not just a matter of adding an additional exit, the whole qemu userland would need to become aware in various places about new kind of errors out of legacy syscalls like read(2), not just the KVM ioctl that would be easy to control by adding a new exit reason. > > > > To start up the discussion before sending the actual patch-set, I'd like > > to send the patch for the kvm's api.txt. Please, let me know what you > > think. > > Hi, Andrea Arcangeli is considering a similar infrastructure at the > generic mm level. Last time I discussed it with him, his idea was > roughly to have: > > * a "userfaultfd" syscall that would take a memory range and return a > file descriptor; the file descriptor becomes readable when the first > access happens on a page in the region, and the read gives the address > of the access. Any thread that accesses a still-unmapped region remains > blocked until the address of the faulting page is written back to the > userfaultfd, or gets a SIGBUS if the userfaultfd is closed. > Yes, the userfaultfd by avoiding the kernel to return to userland (no exit to userland through KVM_EXIT_MEMORY_NOT_PRESENT anymore) will allow the kernel inside the vcpu/IO thread, to talk directly to the migration thread (or in Grigory case, to the ondemand paging manager thread). The kernel will sleep waiting for the page to be present without returning to userland. Then the migration/ondemand thread will notify the kernel through the userfaultfd to wakeup any vcpu/IO thread that was waiting for the page once finished (i.e. after the network transfer and remap_anon_pages completed). This should solve all troubles with O_DIRECT or similar syscalls that from the I/O thread may access the missing KVM memory, and it will handle the spte fault case more efficiently too, by avoiding an exit/enter kernel as KVM_EXIT_MEMORY_NOT_PRESENT will not be required anymore. It's not finished yet so I've no 100% proof this will work exactly as described above but I don't expect trouble as the design is pretty straightforward. The only slight difference compared to the description above, is that userfaultfd won't take a range of memory. Instead the userfault ranges will still be marked by MADV_USERFAULT. The other option would be to specify the ranges using iovecs but it felt less flexible having to specify it in the syscall invocation instead of allowing random mangling of the userfault ranges with madvise at runtime. The userfaultfd will just bind to the whole mm, so no matter which thread faults on memory marked MADV_USERFAULT, the faulting thread will engage in the userfaultfd protocol without exiting to userland. The actual syscall API will require review later anyway, that's not the primary concern at this point. > * a remap_anon_pages syscall that would be used in the userfaultfd I/O > handler to make the page accessible. The handler would build the page > in a "shadow" area with the actual contents of guest memory, and then > remap the shadow area onto the actual guest memory. > > Andrea, please correct me. > > QEMU would use this infrastructure for post-copy migration and possibly > also for live snapshotting of the guests. The advantage in making this > generic rather than KVM-based is that QEMU could use it also in > system-emulation mode (and of course anything else needing a read > barrier could use it too). Correct. Comments welcome, Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html