Hello everyone, so basically this is a tradeoff between not having a long latency for the migration to succeed and reducing the total network traffic (and CPU load) in the migration source and destination and reducing the memory footprint a bit, by adding an initial latency to the memory accesses on the destination of the migration (i.e. causing a more significant and noticeable slowdown to the guest). It's more or less like if when the guest starts on the destination node, it will find all its memory swapped out to a network swap device, so it needs to do I/O for the first access (side note: and hopefully it won't run out of memory while the memory is copied to the destination node or the guest will crash). On Thu, Aug 11, 2011 at 11:19:19AM +0900, Isaku Yamahata wrote: > On Wed, Aug 10, 2011 at 04:55:32PM +0300, Avi Kivity wrote: > > I'm not 100% sure, but I think that thp and ksm need the vma to be > > anonymous, not just the page. > > Yes, they seems to check if not only the page is anonymous, but also the vma. > I'd like to hear from Andrea before digging into the code deeply. The vma doesn't need to be anonymous for THP, an mmap on /dev/zero MAP_PRIVATE also is backed by THP. But it must be close to anonymous and not have special VM_IO/PFNMAP flags or khugepaged/ksm will not scan it. ->vm_file itself isn't checked by THP/KSM (sure for THP because of the /dev/zero example which I explicitly fixed as it wasn't fully handled initially). NOTE a chardevice won't work on RHEL6 because I didn't allow /dev/zero to use it there (it wasn't an important enough feature and it was more risky) but upstream it should work already. A chardevice doing this may work, even if it would be simpler/cleaner if this was still an anonymous vma. A chardevice could act similar to /dev/zero MAP_PRIVATE. In theory KSM should work on /dev/zero too, you can test that if you want. But a chardevice will require dealing with permissions when we don't actually need special permissions for this. Another problem is you can't migrate the stuff using hugepages or it'd multiply the latency 512 times (with 2M contiguous access it won't make a difference but if the guest is accessing memory randomly it would make a difference). So you will have to relay on khugepaged to collapse the hugepages later. That should work but initially the guest will run slower even when the migration is already fully completed. > If it is possible to convert the vma into anonymous, swap device or > backed by device/file wouldn't matter in respect to ksm and thp. > Acquiring mmap_sem suffices? A swap device would require root permissions and we don't want qemu to mangle over the swapdevices automatically. It'd be bad to add new admin requirements, few people would use it. Ideally the migration API should remain the same and it should be an internal tweak in qemu to select which migration mode to use beforehand. Even if it was a swap device it'd still require special operations to setup swap entries in the process pagetables before the pages exists. A swap device may give more complication than it solves. If it was only KVM accessing the guest physical memory we could just handle it in KVM, and call get_user_pages_fast, if that fails and it's the first ever invocation we just talk with QEMU to get the page and establish it by hand. But qemu can also write to memory and if it's a partial write and the guest reads the not-written yet part with get_user_pages_fast+spte establishment, it'll go wrong. Maybe qemu is already doing all checks on the pages it's going to write and we could hook there too from qemu side. Another more generic (not KVM centric) that will not require a special chardev or a special daemon way could be a new: sys_set_muserload(unsigned long start, unsigned long len, int signal) sys_muserload(void *from, void *to) When sys_set_muserload is called the region start,start+len gets covered by muserload swap entries that trigger special page faults. When anything touches the memory with an muserload swap entry still set, the thread gets a signal with force_sig_info_fault(si_signo = signal), and the signal handler will get the faulting address in info.si_addr. The signal handler is then responsible to call sys_userload after talking the thread doing the TCP send/recv() talking with the qemu source. The recv(mmap(4096), 4096) should generate a page in the destination node in some random mapping (aligned). Then the muserload(tcp_received_page_address, guest_faulting_physical_address_from_info_si_addr) does get_user_pages on tcp_received_page_address, takes it away from the tcp_received_page_address (clears the pte at that address), adjusts page->index for the new vma, and maps the page zercopy atomically into the "guest_faulting_physical_address_from_info_si_addr" new address, if and only if the pagetable at address is still of muserload type. Then the signal munmap(tcp_received_page_address, 4096) to truncate/free the vma (the page is already moved and it's already free) and the signal returns and this time the guest access the page. Another second thread in the background calls muserload(tcp_received_page_address, guest_physical_address) maxing out the network bandwidth using a second TCP socket for the streaming transfer ignoring if muserload fails (if the async page fault arrives first and already loaded the information there). If a double muserload fault happens before the first pending I guess it'll just hang and kill -9 will solve it (or we could detect a double fault and sigsegv). qemu better not touch guest physical ram marked muserload by set_muserload() from signal context. The signal handler may communicate with another thread if there's only one TCP socket for the "sync muserload fault" transfers for each qemu instance. Alternatively every vcpu thread plus the iothread could talk with the source with a different TCP socket (also to reduce the latency) and if multiple vcpu faults on the same address it'll just loop a bit for the vcpus that didn't ask the page first to the source qemu. For example if the source will respond "already freed" then loader returns without calling muserload() and loop continues until the vcpu that recv()d the page from the source finally calls muserload(). Too much already for not having though enough about it yet, these are the first ideas that comes to mind. Not sure if the guaranteed slowdown on the destination node could be preferable as a "default" migration mode, but this certainly sounds a more reliable (i.e. maybe better for enterprise) way of doing migration because there are no black magic numbers involved to decide when to stop the source node and transfer all remaining dirty pages, magic numbers aren't good for enterprise because of the potentially enormous VM sizes or very heavy workloads. OTOH for desktop virt: small VM and usually idle and with tiny working set the current precopy method probably is less visible to the user, but it's less reliable too. So maybe this is a better default because of the less black magic and more reliability even if it would most certainly perform worse for the small use case. Especially on 100mbit networks it may be pretty bad, the equivalent of a 10MB/sec swap device will be quite a pretty bad initial slowdown. At least it won't be slowed down by seeking but swapping in at 10M/sec like on a real old HD is going to be still very bad. It'll run almost as slow as a low-end laptop that suspended-to-disk by swapping out half of the RAM. So pretty noticeable. With 1gigabit and up it'll get better. Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html