Re: Demand paging for VM on KVM

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Mon, 31 Mar 2014 20:03:36 +0200

Hi Grigory,

On Thu, Mar 20, 2014 at 10:50:07AM -0700, Grigory Makarevich wrote:
> Andrea, Paolo,
> 
> Thanks a lot for the comments.
> 
> I like the idea of userfaultfd a lot.  For my prototype I had to solve a
> problem of accessing to the ondemand page from the paths where exiting is
> not safe (emulator is one example). I have solved it using send message
> using netlink socket/blocking calling thread/waking up once page is
> delivered. userfaultfd might be cleaner way to achieve the same goal.

I'm glad you like the idea of userfaultfd, the way the syscall works
tends to mirror the eventfd(2) syscall but the protocol talked on the
fd is clearly different.

So you also made the vcpu to talk from the kernel to the "ondemand"
thread through some file descriptor and protocol. So the difference is
that it shouldn't be limited to the emulator but it shall work for all
syscalls that the IO thread may invoke and that could hit on missing
guest physical memory.

> My concerns regarding "general" mm solution are:
> 
> - Will it work with any memory mapping schema or only with anonymous memory
> ?

Currently I have the hooks only into anonymous memory but there's no
reason why this shouldn't work for other kind of page faults. The main
issue is not with the technicality in waiting in the page fault but
with the semantics of a page not being mapped.

If there's an hole in a filebacked mapping things are different than
if there's an hole in anonymous memory. As the VM can unmap and free a
filebacked page at any time. But if you're ok to notify the migration
thread even in such a case, it could work the same.

It would be more tricky if we had to differentiate an initial fault
after the vma is created (in order to notify the migration thread only
for initial faults), from faults triggered after the VM unmapped and
freed the page as result of VM pressure. That would require putting a
placeholder in the pagetable instead of keeping the VM code identical
to now (which zeroes a pagetable when a page is unmapped from a
filebacked mapping).

There is also some similarity between the userfault mechanism and the
volatile ranges but volatile ranges are handled in a finegrined way in
the pagetables, but it looked like there was no code to share in the
end. Volatile ranges provides a very different functionality from
userfault, for example they don't need to provide transparent behavior
if the memory is given as parameter to syscalls as far as I know. They
are used only to store data in memory accessed by userland through the
pagetables (not syscalls). The objective of userfault is also
fundamentally different from the objective of volatile ranges. We
cannot ever lose any data, while their whole point is to lose data if
there's VM pressure.

However if you want to extend the userfaultfd functionalty to trap the
first access in the pagetabels for filebacked pages, we would also
need to mangle pagetables with some placeholders to differentiate the
first fault and we may want to revisit if there's some code or
placeholder to share across the two features. Ideally support for
filebacked ranges could be a added at a second stage.

> - Before blocking the calling thread while serving the page-fault in
> host-kernel, one would need carefully release mmu semaphore (otherwise, the
> user-space might be in trouble serving the page-fault), which may be not
> that trivial.

Yes all locks must be dropped before waiting for the migration thread
and that includes the mmap_sem. I don't think we'll need to add much
complexity though, because we can relay on the behavior of page faults
that can be repeated endlessy until they succeed. It's certainly more
trivial for real faults than gup (because gup can work on more than 1
page at time so it may require some rolling back or special lock
retaking during the gup loop). With the real faults the only trick as
optimization will be to repeat it without returning to userland.

> Regarding qemu part of that:
> 
> -  Yes, indeed, user-space would need to be careful accessing "ondemand"
> pages. However,  that should not be a problem, considering that qemu would
> need to know in advance all "ondemand" regions.  Though, I would expect
> some  refactoring of the qemu internal api might be required.

I don't think the migration thread would risk messing with missing
memory, but the refactoring of some API and wire protocol will be
still needed to make the protocol bidirectional and to handle the
postcopy mechanism. David is working on it.

Paolo also pointed out one case that won't be entirely transparent. If
gdb would debug the migration thread breakpointing into it, and then
the gdb user tries to touch missing memory with ptrace after it
stopped the migration thread, gdb would then soft lockup. It'll
require a sigkill to unblock. There's no real way to fix it in my
view... and overall it looks quite reasonable to end up in a soft
lockup in such a scenario, I mean gdb has other ways to interfere in
bad ways the app by corrupting its memory. Ideally the gdb user should
know what it is doing.

Thanks!
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html