Marcelo Tosatti <mtosatti@xxxxxxxxxx> writes: > Hi Vitaly, > > On Mon, Mar 29, 2021 at 01:47:59PM +0200, Vitaly Kuznetsov wrote: >> When guest time is reset with KVM_SET_CLOCK(0), it is possible for >> hv_clock->system_time to become a small negative number. This happens >> because in KVM_SET_CLOCK handling we set kvm->arch.kvmclock_offset based >> on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled, >> kvm_guest_time_update() does >> >> hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset; > > Hum, masterclock update should always precede KVM_SET_CLOCK() call. > > So when KVM_SET_CLOCK(0) is called, we'd like the guest (or the > following): > > static __always_inline > u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc) > { > u64 delta = tsc - src->tsc_timestamp; > u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul, > src->tsc_shift); > return src->system_time + offset; > } > > To return 0 (which won't happen if pvclock_data->system_time is being > initialized to a negative value). > > KVM_SET_CLOCK(0) > > kvm_gen_update_masterclock(kvm); > now_ns = get_kvmclock_ns(kvm); > kvm->arch.kvmclock_offset += user_ns.clock - now_ns = -now_ns; > > hv_clock.tsc_timestamp = ka->master_cycle_now; > hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; > > Wonder if the negative value happens due to the switch from > masterclock=off -> on (get_kvmclock_ns reading kernel time directly). > > Perhaps this error is being added to clock when migration is performed. > The issue is very easy to reproduce if you take the selftest (PATCH2 from this series), it is just a simple KVM_SET_CLOCK(0) call which makes Hyper-V TSC page value very very big. The order of events is: KVM_SET_CLOCK(0) is performed and we set now_ns = get_kvmclock_ns(kvm); kvm->arch.kvmclock_offset += user_ns.clock - now_ns; user_ns.clock is zero here. Note, we use get_kvmclock_ns() which gives the precise kvmclock value for the moment , to simplify things a bit: kvm->arch.kvmclock_offset = kvmclock_offset - master_kernel_ns - kvmclock_offset - pvclock_scale_delta() == -master_kernel_ns - pvclock_scale_delta() pvclock_scale_delta() is time since the begining of the time interval. Now we call kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE) and this leads to kvm_guest_time_update(). In kvm_guest_time_update() we do: if (use_master_clock) { host_tsc = ka->master_cycle_now; kernel_ns = ka->master_kernel_ns; } ... vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset; Note, we don't add the delta since the beginning of the time interval (assuming we are still on the same interval) so combined with the above: vcpu->hv_clock.system_time = master_kernel_ns - master_kernel_ns - pvclock_scale_delta() == -pvclock_scale_delta() and this is certainly a negative number. A very small one, but it's negative. We could've solved the problem by reducing the precision a bit and instead of doing now_ns = get_kvmclock_ns(kvm); in KVM_SET_CLOCK() handling we could do now_ns = ka->master_kernel_ns and that would make vcpu->hv_clock.system_time == 0 after kvm_guest_time_update() but it'd hurt 'normal' use-cases to fix the corner case. > But just thinking out loud... > >> And 'master_kernel_ns' represents the last time when masterclock >> got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a >> problem, the difference is very small, e.g. I'm observing >> hv_clock.system_time = -70 ns. The issue comes from the fact that >> 'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in >> compute_tsc_page_parameters() becomes a very big number. > > Which it is (a very big number). > >> Use div_s64() to get the proper result when dividing maybe-negative >> 'hv_clock.system_time' by 100. > > Well hv_clock.system_time is not negative. It is positive. > >> Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> >> --- >> arch/x86/kvm/hyperv.c | 9 ++++++--- >> 1 file changed, 6 insertions(+), 3 deletions(-) >> >> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c >> index f98370a39936..0529b892f634 100644 >> --- a/arch/x86/kvm/hyperv.c >> +++ b/arch/x86/kvm/hyperv.c >> @@ -1070,10 +1070,13 @@ static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock, >> hv_clock->tsc_to_system_mul, >> 100); >> >> - tsc_ref->tsc_offset = hv_clock->system_time; >> - do_div(tsc_ref->tsc_offset, 100); >> - tsc_ref->tsc_offset -= >> + /* >> + * Note: 'hv_clock->system_time' despite being 'u64' can hold a negative >> + * value here, thus div_s64(). >> + */ >> + tsc_ref->tsc_offset = div_s64(hv_clock->system_time, 100) - >> mul_u64_u64_shr(hv_clock->tsc_timestamp, tsc_ref->tsc_scale, 64); >> + >> return true; >> } > > Won't the guest see: > > 0.1, 0.005, 0.0025, 0.0001, ..., 0, 0.0001, 0.0025, 0.005, 0.1, ... > > As system_time progresses from negative value to positive value with > the above fix? > > While the fix seems OK in practice, perhaps the negative system_time > could be fixed instead. Well, that was v1 of the patch where I suggested to clamp system_time value to 0 in kvm_guest_time_update() but Paolo talked me into this v2 :-) -- Vitaly