Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Fri, 13 Jan 2012 03:06:27 +0100

On Thu, Jan 12, 2012 at 03:57:47PM +0200, Avi Kivity wrote:
> On 01/03/2012 04:25 PM, Andrea Arcangeli wrote:
> >  
> > > > So the problem is if we do it in
> > > > userland with the current functionality you'll run out of VMAs and
> > > > slowdown performance too much.
> > > >
> > > > But all you need is the ability to map single pages in the address
> > > > space.
> > > 
> > > Would this also let you set different pgprots for different pages in the 
> > > same VMA?  It would be useful for write barriers in garbage collectors 
> > > (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> > > GC cycles could merge all of them back to a single VMA with PROT_READ 
> > > permissions; however, they still put some strain on the VM subsystem.
> >
> > Changing permission sounds more tricky as more code may make
> > assumptions on the vma before checking the pte.
> >
> > Adding a magic unmapped pte entry sounds fairly safe because there's
> > the migration pte already used by migrate which halts page faults and
> > wait, that creates a precedent. So I guess we could reuse the same
> > code that already exists for the migration entry and we'd need to fire
> > a signal and returns to userland instead of waiting. The signal should
> > be invoked before the page fault will trigger again. 
> 
> Delivering signals is slow, and you can't use signalfd for it, because
> that can be routed to a different task.  I would like an fd based
> protocol with an explicit ack so the other end can be implemented by the
> kernel, to use with RDMA.  Kind of like how vhost-net talks to a guest
> via a kvm ioeventfd/irqfd.

As long as we tell qemu to run some per-vcpu handler (or io-thread
handler in case of access through the I/O thread) before accessing the
same memory address again, I don't see a problem in using a faster
mechanism than signals. In theory we could even implement this
functionality in kvm itself and just add two ioctl to the kvm fd,
instead of a syscall in fact but it'd be mangling over VM things like
page->index.

> > Of course if the
> > signal returns and does nothing it'll loop at 100% cpu load but that's
> > ok. Maybe it's possible to tweak the permissions but it will need a
> > lot more thoughts. Specifically for anon pages marking them readonly
> > sounds possible if they are supposed to behave like regular COWs (not
> > segfaulting or anything), as you already can have a mixture of
> > readonly and read-write ptes (not to tell readonly KSM pages), but for
> > any other case it's non trivial. Last but not the least the API here
> > would be like a vma-less-mremap, moving a page from one address to
> > another without modifying the vmas, the permission tweak sounds more
> > like an mprotect, so I'm unsure if it could do both or if it should be
> > an optimization to consider independently.
> 
> Doesn't this stuff require tlb flushes across all threads?

It does, to do it zerocopy and atomic we must move the pte.

> > In theory I suspect we could also teach mremap to do a
> > not-vma-mangling mremap if we move pages that aren't shared and so we
> > can adjust the page->index of the pages, instead of creating new vmas
> > at the dst address with an adjusted vma->vm_pgoff, but I suspect a
> > syscall that only works on top of fault-unmapped areas is simpler and
> > safer. mremap semantics requires nuking the dst region before the move
> > starts. If we would teach mremap how to handle the fault-unmapped
> > areas we could just add one syscall prepare_fault_area (or whatever
> > name you choose).
> >
> > The locking of doing a vma-less-mremap still sounds tricky but I doubt
> > you can avoid that locking complexity by using the chardevice as long
> > as the chardevice backed-memory still allows THP, migration and swap,
> > if you want to do it atomic-zerocopy and I think zerocopy would be
> > better especially if the network card is fast and all vcpus are
> > faulting into unmapped pages simultaneously so triggering heavy amount
> > of copying from all physical cpus.
> >
> > I don't mean the current device driver doing a copy_user won't work or
> > is bad idea, it's more self contained and maybe easier to merge
> > upstream. I'm just presenting another option more VM integrated
> > zerocopy with just 2 syscalls.
> 
> Zerocopy is really interesting here, esp. w/ RDMA.  But while adding
> ptes is cheap, removing them is not.  I wonder if we can make a
> write-only page?  Of course it's unmapped for cpu access, but we can
> allow DMA write access from the NIC.  Probably too wierd.

Keeping it mapped in two places gives problem with the non-linearity
of the page->index. We've 1 page->index and two different
vma->vm_pgoff, so it's just not a problem of readonly. We could even
it let it read-write as long as it is nuked when we swap. Problem is
we can't leave it there, we must update page->index, if we don't the
rmap walk breaks and swap/migrate with it... we would allow any thread
to see random memory through that window, even if read-only
(post-swapout) and it wouldn't be secure in multiuser system.

Maybe we can find a way to have qemu call a "recv" directly on the
guest physical address final destination and never expose the page
received to userland during the receive. Without having to pass
through a buffer mapped in userland would solve the tlb flushes. That
however would prevent lzo compression or any other trick that we might
do and it'd force zerocopy behavior all the way through the kernel
recv syscall and after recv we would just return from the "post-copy
handler" and return to guest mode (or return accessing the data in the
io-thread case). Maybe splice or sendfile or some other trick allows
that. Problem is we can't have the pte and sptes established until the
full receive is complete or other vcpu threads will see partial
receive. I suspect the post-migrate faults will be network I/O
dominated anyway.

About doing a copy, that doesn't require the TLB flush, the problem is
doing it atomic and not expose partial data to the vcpus and
io-thread. I can imagine various ways to achieve that, but I don't see
how that can be done "free", plus you pay the copy too.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html