On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote: > Oliver, > > Do you have any numbers for the improvement in guests CLOCK_REALTIME > accuracy across migration, when this is in place? > > On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: > > Handling the migration of TSCs correctly is difficult, in part because > > Linux does not provide userspace with the ability to retrieve a (TSC, > > realtime) clock pair for a single instant in time. In lieu of a more > > convenient facility, KVM can report similar information in the kvm_clock > > structure. > > > > Provide userspace with a host TSC & realtime pair iff the realtime clock > > is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid > > realtime value, advance the KVM clock by the amount of elapsed time. Do > > not step the KVM clock backwards, though, as it is a monotonic > > oscillator. > > > > Suggested-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > > Signed-off-by: Oliver Upton <oupton@xxxxxxxxxx> > > --- > > Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++------- > > arch/x86/include/asm/kvm_host.h | 3 +++ > > arch/x86/kvm/x86.c | 36 +++++++++++++++++++++------- > > include/uapi/linux/kvm.h | 7 +++++- > > 4 files changed, 70 insertions(+), 18 deletions(-) > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > > index a6729c8cf063..d0b9c986cf6c 100644 > > --- a/Documentation/virt/kvm/api.rst > > +++ b/Documentation/virt/kvm/api.rst > > @@ -993,20 +993,34 @@ such as migration. > > When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the > > set of bits that KVM can return in struct kvm_clock_data's flag member. > > > > -The only flag defined now is KVM_CLOCK_TSC_STABLE. If set, the returned > > -value is the exact kvmclock value seen by all VCPUs at the instant > > -when KVM_GET_CLOCK was called. If clear, the returned value is simply > > -CLOCK_MONOTONIC plus a constant offset; the offset can be modified > > -with KVM_SET_CLOCK. KVM will try to make all VCPUs follow this clock, > > -but the exact value read by each VCPU could differ, because the host > > -TSC is not stable. > > +FLAGS: > > + > > +KVM_CLOCK_TSC_STABLE. If set, the returned value is the exact kvmclock > > +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called. > > +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant > > +offset; the offset can be modified with KVM_SET_CLOCK. KVM will try > > +to make all VCPUs follow this clock, but the exact value read by each > > +VCPU could differ, because the host TSC is not stable. > > + > > +KVM_CLOCK_REALTIME. If set, the `realtime` field in the kvm_clock_data > > +structure is populated with the value of the host's real time > > +clocksource at the instant when KVM_GET_CLOCK was called. If clear, > > +the `realtime` field does not contain a value. > > + > > +KVM_CLOCK_HOST_TSC. If set, the `host_tsc` field in the kvm_clock_data > > +structure is populated with the value of the host's timestamp counter (TSC) > > +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field > > +does not contain a value. > > > > :: > > > > struct kvm_clock_data { > > __u64 clock; /* kvmclock current value */ > > __u32 flags; > > - __u32 pad[9]; > > + __u32 pad0; > > + __u64 realtime; > > + __u64 host_tsc; > > + __u32 pad[4]; > > }; > > > > > > @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter. > > In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios > > such as migration. > > > > +FLAGS: > > + > > +KVM_CLOCK_REALTIME. If set, KVM will compare the value of the `realtime` field > > +with the value of the host's real time clocksource at the instant when > > +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final > > +kvmclock value that will be provided to guests. > > + > > :: > > > > struct kvm_clock_data { > > __u64 clock; /* kvmclock current value */ > > __u32 flags; > > - __u32 pad[9]; > > + __u32 pad0; > > + __u64 realtime; > > + __u64 host_tsc; > > + __u32 pad[4]; > > }; > > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index be6805fc0260..9c34b5b63e39 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void); > > > > int alloc_all_memslots_rmaps(struct kvm *kvm); > > > > +#define KVM_CLOCK_VALID_FLAGS \ > > + (KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC) > > + > > #endif /* _ASM_X86_KVM_HOST_H */ > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 523c4e5c109f..cb5d5cad5124 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > > get_cpu(); > > > > if (__this_cpu_read(cpu_tsc_khz)) { > > +#ifdef CONFIG_X86_64 > > + struct timespec64 ts; > > + > > + if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) { > > + data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec; > > + data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC; > > + } else > > +#endif > > + data->host_tsc = rdtsc(); > > + > > kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL, > > &hv_clock.tsc_shift, > > &hv_clock.tsc_to_system_mul); > > - data->clock = __pvclock_read_cycles(&hv_clock, rdtsc()); > > + data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc); > > } else { > > data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset; > > } > > @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > > r = KVM_SYNC_X86_VALID_FIELDS; > > break; > > case KVM_CAP_ADJUST_CLOCK: > > - r = KVM_CLOCK_TSC_STABLE; > > + r = KVM_CLOCK_VALID_FLAGS; > > break; > > case KVM_CAP_X86_DISABLE_EXITS: > > r |= KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE | > > @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > > { > > struct kvm_arch *ka = &kvm->arch; > > struct kvm_clock_data data; > > - u64 now_ns; > > + u64 now_raw_ns; > > > > if (copy_from_user(&data, argp, sizeof(data))) > > return -EFAULT; > > > > - if (data.flags) > > + if (data.flags & ~KVM_CLOCK_REALTIME) > > return -EINVAL; > > > > kvm_hv_invalidate_tsc_page(kvm); > > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) > > * is slightly ahead) here we risk going negative on unsigned > > * 'system_time' when 'data.clock' is very small. > > */ > > - if (kvm->arch.use_master_clock) > > - now_ns = ka->master_kernel_ns; > > + if (data.flags & KVM_CLOCK_REALTIME) { > > + u64 now_real_ns = ktime_get_real_ns(); > > + > > + /* > > + * Avoid stepping the kvmclock backwards. > > + */ > > + if (now_real_ns > data.realtime) > > + data.clock += now_real_ns - data.realtime; > > + } > > Forward jumps can also cause problems, for example: > > * Kernel watchdogs > > * https://patchwork.ozlabs.org/project/qemu-devel/patch/20130618233825.GA19042@xxxxxxxx/ > > So perhaps limiting the amount of forward jump that is allowed > would be a good thing? (which can happen if the two hosts realtime > clocks are off). > > Now by how much, i am not sure. > > Or, as mentioned earlier, only enable KVM_CLOCK_REALTIME if userspace > KVM code checks clock synchronization. > > Thomas, CC'ed, has deeper understanding of problems with > forward time jumps than I do. Thomas, any comments? Thomas, Based on the earlier discussion about the problems of synchronizing the guests clock via a notification to the NTP/Chrony daemon (where there is a window where applications can read the stale value of the clock), a possible solution would be triggering an NMI on the destination (so that it runs ASAP, with higher priority than application/kernel). What would this NMI do, exactly? > As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag > for either vm pause / vm resume (well, if paused for long periods of time) > or savevm / restorevm. Maybe with the NMI above, it would be possible to use the realtime clock as a way to know time elapsed between events and advance guest clock without the current problematic window.