Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

Avi Kivity <avi@xxxxxxxxxx> · Thu, 12 Jan 2012 15:57:47 +0200

On 01/03/2012 04:25 PM, Andrea Arcangeli wrote:
>  
> > > So the problem is if we do it in
> > > userland with the current functionality you'll run out of VMAs and
> > > slowdown performance too much.
> > >
> > > But all you need is the ability to map single pages in the address
> > > space.
> > 
> > Would this also let you set different pgprots for different pages in the 
> > same VMA?  It would be useful for write barriers in garbage collectors 
> > (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> > GC cycles could merge all of them back to a single VMA with PROT_READ 
> > permissions; however, they still put some strain on the VM subsystem.
>
> Changing permission sounds more tricky as more code may make
> assumptions on the vma before checking the pte.
>
> Adding a magic unmapped pte entry sounds fairly safe because there's
> the migration pte already used by migrate which halts page faults and
> wait, that creates a precedent. So I guess we could reuse the same
> code that already exists for the migration entry and we'd need to fire
> a signal and returns to userland instead of waiting. The signal should
> be invoked before the page fault will trigger again. 

Delivering signals is slow, and you can't use signalfd for it, because
that can be routed to a different task.  I would like an fd based
protocol with an explicit ack so the other end can be implemented by the
kernel, to use with RDMA.  Kind of like how vhost-net talks to a guest
via a kvm ioeventfd/irqfd.

> Of course if the
> signal returns and does nothing it'll loop at 100% cpu load but that's
> ok. Maybe it's possible to tweak the permissions but it will need a
> lot more thoughts. Specifically for anon pages marking them readonly
> sounds possible if they are supposed to behave like regular COWs (not
> segfaulting or anything), as you already can have a mixture of
> readonly and read-write ptes (not to tell readonly KSM pages), but for
> any other case it's non trivial. Last but not the least the API here
> would be like a vma-less-mremap, moving a page from one address to
> another without modifying the vmas, the permission tweak sounds more
> like an mprotect, so I'm unsure if it could do both or if it should be
> an optimization to consider independently.

Doesn't this stuff require tlb flushes across all threads?

>
> In theory I suspect we could also teach mremap to do a
> not-vma-mangling mremap if we move pages that aren't shared and so we
> can adjust the page->index of the pages, instead of creating new vmas
> at the dst address with an adjusted vma->vm_pgoff, but I suspect a
> syscall that only works on top of fault-unmapped areas is simpler and
> safer. mremap semantics requires nuking the dst region before the move
> starts. If we would teach mremap how to handle the fault-unmapped
> areas we could just add one syscall prepare_fault_area (or whatever
> name you choose).
>
> The locking of doing a vma-less-mremap still sounds tricky but I doubt
> you can avoid that locking complexity by using the chardevice as long
> as the chardevice backed-memory still allows THP, migration and swap,
> if you want to do it atomic-zerocopy and I think zerocopy would be
> better especially if the network card is fast and all vcpus are
> faulting into unmapped pages simultaneously so triggering heavy amount
> of copying from all physical cpus.
>
> I don't mean the current device driver doing a copy_user won't work or
> is bad idea, it's more self contained and maybe easier to merge
> upstream. I'm just presenting another option more VM integrated
> zerocopy with just 2 syscalls.

Zerocopy is really interesting here, esp. w/ RDMA.  But while adding
ptes is cheap, removing them is not.  I wonder if we can make a
write-only page?  Of course it's unmapped for cpu access, but we can
allow DMA write access from the NIC.  Probably too wierd.

>
> vmas must not be involved in the mremap for reliability, or too much
> memory could get pinned in vmas even if we temporary lift the
> /proc/sys/vm/max_map_count for the process. Plus sending another
> signal (not sigsegv or sigbus) should be more reliable in case the
> migration crashes for real.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html