On Wed, 2024-07-24 at 15:24 -0700, Sean Christopherson wrote: > /cast <Raise Skeleton> > > On Wed, Jan 17, 2024, David Woodhouse wrote: > > On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote: > > > > > > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > > > * is slightly ahead) here we risk going negative on unsigned > > > * 'system_time' when 'data.clock' is very small. > > > */ > > > - if (kvm->arch.use_master_clock) > > > - now_ns = ka->master_kernel_ns; > > > + if (data.flags & KVM_CLOCK_REALTIME) { > > > + u64 now_real_ns = ktime_get_real_ns(); > > > + > > > + /* > > > + * Avoid stepping the kvmclock backwards. > > > + */ > > > + if (now_real_ns > data.realtime) > > > + data.clock += now_real_ns - data.realtime; > > > + } > > > + > > > + if (ka->use_master_clock) > > > + now_raw_ns = ka->master_kernel_ns; > > > > This looks wrong to me. > > > > > else > > > - now_ns = get_kvmclock_base_ns(); > > > - ka->kvmclock_offset = data.clock - now_ns; > > > + now_raw_ns = get_kvmclock_base_ns(); > > > + ka->kvmclock_offset = data.clock - now_raw_ns; > > > kvm_end_pvclock_update(kvm); > > > return 0; > > > } > > > > We use the host CLOCK_MONOTONIC_RAW plus the boot offset, as a > > 'kvmclock base clock', and get_kvmclock_base_ns() returns that. The KVM > > clocks for each VMs are based on this 'kvmclock base clock', each > > offset by a ka->kvmclock_offset which represents the time at which that > > VM was started — so each VM's clock starts from zero. > > > > The values of ka->master_kernel_ns and ka->master_cycle_now represent a > > single point in time, the former being the value of > > get_kvmclock_base_ns() at that moment and the latter being the host TSC > > value. In pvclock_update_vm_gtod_copy(), kvm_get_time_and_clockread() > > is used to return both values at precisely the same moment, from the > > *same* rdtsc(). > > > > This allows the current 'kvmclock base clock' to be calculated at any > > moment by reading the TSC, calculating a delta to that reading from > > ka->master_cycle_now to determine how much time has elapsed since > > ka->master_kernel_ns. We can then add ka->kvmclock_offset to get the > > kvmclock for this particular VM. > > > > Now, looking at the code quoted above. It's given a kvm_clock_data > > struct which contains a value of the KVM clock which is to be set as > > the time "now", and all it does is adjust ka->kvmclock_offset > > accordingly. Which is really simple: > > > > now_raw_ns = get_kvmclock_base_ns(); > > ka->kvmclock_offset = data.clock - now_raw_ns; > > > > Et voilà, now get_kvmclock_base_ns() + ka->kvmclock_offset at any given > > moment in time will result in a kvmclock value according to what was > > just set. Yay! > > > > Except... in the case where the TSC is constant, we actually set > > 'now_raw_ns' to a value that doesn't represent *now*. Instead, we set > > it to ka->master_kernel_ns which represents some point in the *past*. > > We should add the number of TSC ticks since ka->master_cycle_now if > > we're going to use that, surely? > > Somewhat ironically, without the KVM_CLOCK_REALTIME goo, there's no need to > re-read TSC, because the rdtsc() in pvclock_update_vm_gtod_copy() *just* happened. > But the call to ktime_get_real_ns() could theoretically spin for a non-trivial > amount of time if the clock is being refreshed. Aha, thank you. That makes sense. I note that in similar code in https://lore.kernel.org/lkml/20240522001817.619072-4-dwmw2@xxxxxxxxxxxxx/ Jack explicitly added a comment to precisely that effect: + * The call to pvclock_update_vm_gtod_copy() has created a new time + * reference point in ka->master_cycle_now and ka->master_kernel_ns. Code comments are wonderful things. Someone must have trained him well :) (I'll be casting Raise Skeleton on that series too as soon as I have dispatched the existing horde, FWIW.) So the remaining problem is that we do re-read the TSC for the KVM_CLOCK_REALTIME case, as you pointed out. When we should be calculating the relationship between real time and the KVM clock at precisely the *same* moment. But I don't care about that, because this whole API is suboptimal anyway; we should just set the KVM clock in terms of the guest TSC. Which is what Jack's patch allows.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature