On 23/08/2017 14:45, Thomas Gleixner wrote: > So the real question is how to ensure that: > > 1) None of the update functions is in progress > > 2) The update is propagated via the existing mechanisms > > The whole live migration magic is orchestrated by qemu and the kernel. So > it's reasonably simple to ensure that. There is no orchestration whatsoever between QEMU and the host kernel, much less between anything and the guest. QEMU just sends ioctls. The complex part is above QEMU, just because you have to make sure that the destination sets up networking and everything else just like the source, but as far as QEMU is concerned live migration is as simple as stop_guest() get_guest_state() write_data_to_socket() read_data_from_socket() set_guest_state() resume_guest() and as far as KVM is concerned, it's as simple as get a signal, KVM_RUN exits KVM_GET_REGS and a bunch more ioctls KVM_SET_REGS and a bunch more ioctls KVM_SET_KVMCLOCK triggers kvmclock update ioctl(KVM_RUN) process KVM_REQ_CLOCK_UPDATE enter the guest Adding interrupts and all that is much, much more complicated than just reusing code that runs all the time (albeit only on hosts with impaired TSC) and needs no special case at all. > The reason why you do that is to support live migration of VMs to hosts > with a different TSC frequency. And for exactly that particular corner case > the whole conversion magic is implemented and now you need even more duct > tape to make it work with nested VMs. The point of the first part of this series is to _remove_ the duct tape and actually make KVM use generic timekeeper services, namely ktime_get_snapshot. If you look at patches 1-2-3-4-5-7 the delta is -120 lines of code, without any nested virtualization stuff. More duct tape would have been just: - if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC) + mode = READ_ONCE(pvclock_gtod_data.clock.vclock_mode); + if (mode != VCLOCK_TSC && + (mode != VCLOCK_PVCLOCK || !pvclock_nested_virt_magic()) return false; - return do_realtime(ts, cycle_now) == VCLOCK_TSC; + switch (mode) { + case VCLOCK_TSC: + return do_realtime_tsc(ts, cycle_now); + case VCLOCK_PVCLOCK: + return do_realtime_pvclock(ts, cycle_now); + } Nested virtualization does need a clocksource change notifier on top, but we can cross that bridge later. Maybe Denis can post just those patches to begin with. Paolo > So now for the nested KVM case. If you follow the above scheme then this > becomes really simple: > > 1) The TSC frequency is merily propagated to the L2 guest. It's the > same as the L1 guest TSC frequency. No magic voodoo required. > > 2) Migration of a L2 guest to a different L1 guest follows the > same scheme as above > > 3) Migration of a L2 guest to a physcial host follows the same scheme as > above - no idea whether that's supported at all > > 4) Migration of a L1 guest with a embedded L2 guest is not rocket science > either. The above needs some extra code which propagates the time > keeper freeze to L2 and then when L1 resumes, the updated frequency > data is propagated to L2 and L2 resumed along with it. > > You don't need any timestamp snapshot magic and voodoo callbacks. All you > need is a proper mechanism to update the timekeeper. > > I might be missing something really important as usual. If so, I'm happy to > be educated.