Re: [RFC] postcopy livemigration proposal

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Thu, 11 Aug 2011 18:55:11 +0200

Hello everyone,

so basically this is a tradeoff between not having a long latency for
the migration to succeed and reducing the total network traffic (and
CPU load) in the migration source and destination and reducing the
memory footprint a bit, by adding an initial latency to the memory
accesses on the destination of the migration (i.e. causing a more
significant and noticeable slowdown to the guest).

It's more or less like if when the guest starts on the destination
node, it will find all its memory swapped out to a network swap
device, so it needs to do I/O for the first access (side note: and
hopefully it won't run out of memory while the memory is copied to the
destination node or the guest will crash).

On Thu, Aug 11, 2011 at 11:19:19AM +0900, Isaku Yamahata wrote:
> On Wed, Aug 10, 2011 at 04:55:32PM +0300, Avi Kivity wrote:
> > I'm not 100% sure, but I think that thp and ksm need the vma to be  
> > anonymous, not just the page.
> 
> Yes, they seems to check if not only the page is anonymous, but also the vma.
> I'd like to hear from Andrea before digging into the code deeply.

The vma doesn't need to be anonymous for THP, an mmap on /dev/zero
MAP_PRIVATE also is backed by THP. But it must be close to anonymous
and not have special VM_IO/PFNMAP flags or khugepaged/ksm will not
scan it. ->vm_file itself isn't checked by THP/KSM (sure for THP
because of the /dev/zero example which I explicitly fixed as it wasn't
fully handled initially). NOTE a chardevice won't work on RHEL6
because I didn't allow /dev/zero to use it there (it wasn't an
important enough feature and it was more risky) but upstream it should
work already.

A chardevice doing this may work, even if it would be simpler/cleaner
if this was still an anonymous vma. A chardevice could act similar to
/dev/zero MAP_PRIVATE. In theory KSM should work on /dev/zero too, you
can test that if you want. But a chardevice will require dealing with
permissions when we don't actually need special permissions for this.

Another problem is you can't migrate the stuff using hugepages or it'd
multiply the latency 512 times (with 2M contiguous access it won't
make a difference but if the guest is accessing memory randomly it
would make a difference). So you will have to relay on khugepaged to
collapse the hugepages later. That should work but initially the guest
will run slower even when the migration is already fully completed.

> If it is possible to convert the vma into anonymous, swap device or
> backed by device/file wouldn't matter in respect to ksm and thp.
> Acquiring mmap_sem suffices?

A swap device would require root permissions and we don't want qemu to
mangle over the swapdevices automatically. It'd be bad to add new
admin requirements, few people would use it. Ideally the migration API
should remain the same and it should be an internal tweak in qemu to
select which migration mode to use beforehand.

Even if it was a swap device it'd still require special operations to
setup swap entries in the process pagetables before the pages
exists. A swap device may give more complication than it solves.

If it was only KVM accessing the guest physical memory we could just
handle it in KVM, and call get_user_pages_fast, if that fails and it's
the first ever invocation we just talk with QEMU to get the page and
establish it by hand. But qemu can also write to memory and if it's a
partial write and the guest reads the not-written yet part with
get_user_pages_fast+spte establishment, it'll go wrong. Maybe qemu is
already doing all checks on the pages it's going to write and we could
hook there too from qemu side.

Another more generic (not KVM centric) that will not require a special
chardev or a special daemon way could be a new:

	sys_set_muserload(unsigned long start, unsigned long len, int signal)
	sys_muserload(void *from, void *to)

When sys_set_muserload is called the region start,start+len gets
covered by muserload swap entries that trigger special page faults.

When anything touches the memory with an muserload swap entry still
set, the thread gets a signal with force_sig_info_fault(si_signo =
signal), and the signal handler will get the faulting address in
info.si_addr. The signal handler is then responsible to call
sys_userload after talking the thread doing the TCP send/recv()
talking with the qemu source. The recv(mmap(4096), 4096) should
generate a page in the destination node in some random mapping
(aligned).

Then the muserload(tcp_received_page_address,
guest_faulting_physical_address_from_info_si_addr) does get_user_pages
on tcp_received_page_address, takes it away from the
tcp_received_page_address (clears the pte at that address), adjusts
page->index for the new vma, and maps the page zercopy atomically into
the "guest_faulting_physical_address_from_info_si_addr" new address,
if and only if the pagetable at address is still of muserload
type. Then the signal munmap(tcp_received_page_address, 4096) to
truncate/free the vma (the page is already moved and it's already
free) and the signal returns and this time the guest access the page.

Another second thread in the background calls
muserload(tcp_received_page_address, guest_physical_address) maxing
out the network bandwidth using a second TCP socket for the streaming
transfer ignoring if muserload fails (if the async page fault arrives
first and already loaded the information there).

If a double muserload fault happens before the first pending I guess
it'll just hang and kill -9 will solve it (or we could detect a double
fault and sigsegv).

qemu better not touch guest physical ram marked muserload by
set_muserload() from signal context.

The signal handler may communicate with another thread if there's only
one TCP socket for the "sync muserload fault" transfers for each qemu
instance. Alternatively every vcpu thread plus the iothread could talk
with the source with a different TCP socket (also to reduce the
latency) and if multiple vcpu faults on the same address it'll just
loop a bit for the vcpus that didn't ask the page first to the source
qemu. For example if the source will respond "already freed" then
loader returns without calling muserload() and loop continues until
the vcpu that recv()d the page from the source finally calls
muserload().

Too much already for not having though enough about it yet, these are
the first ideas that comes to mind.

Not sure if the guaranteed slowdown on the destination node could be
preferable as a "default" migration mode, but this certainly sounds a
more reliable (i.e. maybe better for enterprise) way of doing
migration because there are no black magic numbers involved to decide
when to stop the source node and transfer all remaining dirty pages,
magic numbers aren't good for enterprise because of the potentially
enormous VM sizes or very heavy workloads. OTOH for desktop virt:
small VM and usually idle and with tiny working set the current
precopy method probably is less visible to the user, but it's less
reliable too. So maybe this is a better default because of the less
black magic and more reliability even if it would most certainly
perform worse for the small use case.

Especially on 100mbit networks it may be pretty bad, the equivalent of
a 10MB/sec swap device will be quite a pretty bad initial slowdown. At
least it won't be slowed down by seeking but swapping in at 10M/sec
like on a real old HD is going to be still very bad. It'll run almost
as slow as a low-end laptop that suspended-to-disk by swapping out
half of the RAM. So pretty noticeable. With 1gigabit and up it'll get
better.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html