On Tue, 22 Aug 2017, Paolo Bonzini wrote: > Regarding the "why is it best" part. Right now, the hypervisor makes a > copy of the timekeeper information in order to prepare the stable kvmclock. > This code is very much tied to the TSC. However, a snapshot of the timekeeper > information is almost entirely the same thing that ktime_get_snapshot returns, > so my suggestion to "untie" the hypervisor code from the TSC was to use > ktime_get_snapshot instead. This way, the clocksource itself tells KVM > whether it can be the base for a vsyscall-happy kvmclock (which means, it > must be the TSC or a linear transformation of it). > > While I am very happy with how the KVM code comes out, it might certainly > be not the best solution---I definitely need help from the clocksource > maintainers here, not just approval! In particular, it doesn't help that > a lot of code surrounding ktime_get_snapshot is unused, so that may have > sent me off track. > > In particular, the return value of the new callback can be defined as "is > it the TSC or a linear transformation of it". But that's as good a > definition as "is it good for KVM" (i.e., not very good) without some > documentation on the meaning of "cycles" in the struct returned by > ktime_get_snapshot. Once I understand that, I hope I can provide a better > explanation for the return value of the callback. This all looks wrong to begin with and you are just bolting and duct taping stuff together as you see fit. I understand the idea of providing a clocksource to the core code which provides direct nano second resolution and do the host -> guest conversion in the kvmclock implementation. But that's the root cause of all evils. The reason why you do that is to support live migration of VMs to hosts with a different TSC frequency. And for exactly that particular corner case the whole conversion magic is implemented and now you need even more duct tape to make it work with nested VMs. That's just wrong. We need to sit back and look at that whole kvm clock mechanism and redesign it proper. Let's look at the goals: 1) Provide the information of host frequency to the guest 2) Propagate host frequency changes to the guest That's solely used for migration purposes. It's not there for dealing with non constant frequency TSCs, which would be beyond insane. Neither to runtime propagate host clocksource adjustments (via NTP & friends) to a guest. If you use it that way today, then this needs to be fixed, simply because that's the wrong approach. We have proper correction mechanisms (NTP,PPS,PTP ...) which can be utilized to do so. Aside of that it'd be interesting to see the interaction between the host frequency scaling change and NTP in the guest trying to adjust as well. Now lets look at a simple ktime_get() with the current implementation: ktime_get() do { seq = seqcount_begin(tk); data = tk->data; now = tk->clock->read(tk->clock); kvmclock_read() do { seqk = seqcount_begin(kvmclock); nowk = kvmclock_read(); datak = kvmclock_data; } while (seqcount_retry(seqk, kvmclock)); nsec = convert(nowk, datak); return nsec; } while (seqcount_retry(seq, tk)); nsec = convert(now, data); return nsec; So you need two sequence counters and two conversions for reading the clock. Sorry, but that's just crap. Let's look at the normal usecase (no migration) first: The TSC frequency can be retrieved at guest boot time and does not ever change. For this case the above is bloat and completely useless. All you need to do is to register the kvm clocksource with the proper frequency and the readout is just the plain TSC read. Nothing else. So now the special case of live migration. You need to make sure that the change of the TSC frequency is properly propagated and any reader which is in the middle of a time getter function will retry with the new parameters. That's not any different from the case where the timekeeper is adjusted by any of the regular mechanisms: update_wall_time(), ntp, pps, ptp .... The only thing you need to ensure is that the timekeeping core is properly informed about the change and code which executes time getter functions will retry. The core has _ALL_ mechanisms available for that. So the real question is how to ensure that: 1) None of the update functions is in progress 2) The update is propagated via the existing mechanisms The whole live migration magic is orchestrated by qemu and the kernel. So it's reasonably simple to ensure that. Something like the below should work for this: Host Guest 1: prevent_nmi_delivery_to_guest() 2: inject_kvmclock_irq() 3: handle_kvmclock_irq() 4: timekeeping_freeze() 5: exit_vm() 6: stop_and_migrate_guest() 7: update_guest_data() 8: resume_guest() 9: timekeeping_reconfigure(); 10: timekeeping_unfreeze(); 11: resume_nmi_delivery_to_guest() #1 Ensures that no NMI is delivered to the guest so that the timekeeper core does not have to worry about NMI context calling any of the NMI safe time getters. If that's not possible then it's simple enough to do something about the NMI safe time getters in the time keeping core code, so you don't have to worry much about it. #2 Inject a special interrupt vector which initiates the timekeeping freeze mechanism in the guest #3 kvm clock interrupt handler gets invoked #4 Ensures that: - All potential timekeeping update mechanisms have left the critical region - All potential time getters are blocked The mechanism for that is simple enough: timekeeping_freeze() { raw_spin_lock(&timekeeper_lock); write_seqcount_begin(&tk_core.seq); /* Ensure a consistent state */ timekeeping_forward_now(tk); timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET); } #5 Exit the VM with some magic exit reason so the host side knows that timekeeping is frozen #6 Stop the VM, migrate it #9 Retrieve the new conversion factor and adjust the timekeeper data accordingly #10 Unfreeze the time keeper with the new configuration timekeeping_freeze() { /* Ensure a consistent state */ timekeeping_forward_now(tk); timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET); write_seqcount_end(&tk_core.seq); raw_spin_unlock(&timekeeper_lock); } #11 Resume NMI delivery As I said in #1 this can be completely handled in the timekeeper core, but if we can avoid that then stuff becomes simpler. And simpler is preferred .... So now for the nested KVM case. If you follow the above scheme then this becomes really simple: 1) The TSC frequency is merily propagated to the L2 guest. It's the same as the L1 guest TSC frequency. No magic voodoo required. 2) Migration of a L2 guest to a different L1 guest follows the same scheme as above 3) Migration of a L2 guest to a physcial host follows the same scheme as above - no idea whether that's supported at all 4) Migration of a L1 guest with a embedded L2 guest is not rocket science either. The above needs some extra code which propagates the time keeper freeze to L2 and then when L1 resumes, the updated frequency data is propagated to L2 and L2 resumed along with it. You don't need any timestamp snapshot magic and voodoo callbacks. All you need is a proper mechanism to update the timekeeper. I might be missing something really important as usual. If so, I'm happy to be educated. Thanks, tglx