There is a potential for drift between the TSC and a KVM/PV clock when the guest TSC is scaled (as seen previously in [1]). Which fixed drift between timers over the lifetime of a VM. However, there is another factor which will cause a drift. In a situation such as a kexec/live-update of the kernel or a live-migration of a VM the PV clock information is recalculated by KVM (KVM_REQ_MASTERCLOCK_UPDATE). This update samples a new system_time & tsc_timestamp to be used in the structure. For example, when a guest is running with a TSC frequency of 1.5GHz but the host frequency is 3.0GHz upon an update of the PV time information a delta of ~3500ns is observed between the TSC and the KVM/PV clock. There is no reason why a fixup creating an accuracy of ±1ns cannot be achieved. Additional interfaces are added to retrieve & fixup the PV time information when a VMM may believe is appropriate (deserialization after live-update/ migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the currently used PV time information and then when the VMM believes a drift may occur can then instruct KVM to perform a correction via the setter KVM_SET_CLOCK_GUEST. The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host TSC & kernel timstamp are sampled at a singular point in time. Using the already known scaling/offset for L1 the guest TSC is then derived from this information. >From here two PV time information structures are created, one which is the original time information structure prior to whatever may have caused a PV clock re-calculation (live-update/migration). The second is then using the singular point in time sampled just prior. An individual KVM/PV clock for each of the PV time information structures using the singular guest TSC. A delta is then determined between the two calculated PV times, which is then used as a correction offset added onto the kvmclock_offset for the VM. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae Suggested-by: David Woodhouse <dwmw2@xxxxxxxxxxxxx> Signed-off-by: Jack Allister <jalliste@xxxxxxxxxx> CC: Paul Durrant <paul@xxxxxxx> --- Documentation/virt/kvm/api.rst | 43 +++++++++++++++++ arch/x86/kvm/x86.c | 87 ++++++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 3 ++ 3 files changed, 133 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 0b5a33ee71ee..5f74d8ac1002 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6352,6 +6352,49 @@ a single guest_memfd file, but the bound ranges must not overlap). See KVM_SET_USER_MEMORY_REGION2 for additional details. +4.143 KVM_GET_CLOCK_GUEST +---------------------------- + +:Capability: none +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct pvclock_vcpu_time_info (out) +:Returns: 0 on success, <0 on error + +Retrieves the current time information structure used for KVM/PV clocks. +On x86 a PV clock is derived from the current TSC and is then scaled based +upon the a specified multiplier and shift. The result of this is then added +to a system time. + +The guest needs a way to determine the system time, multiplier and shift. This +can be done by multiple ways, for KVM guests this can be via an MSR write to +MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical +address KVM shall put the structure. On Xen guests this can be found in the Xen +vcpu_info. + +This is structure is useful information for a VMM to also know when taking into +account potential timer drift on live-update/migration. + +4.144 KVM_SET_CLOCK_GUEST +---------------------------- + +:Capability: none +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct pvclock_vcpu_time_info (in) +:Returns: 0 on success, <0 on error + +Triggers KVM to perform a correction of the KVM/PV clock structure based upon a +known prior PV clock structure (see KVM_GET_CLOCK_GUEST). + +If a VM is utilizing TSC scaling there is a potential for a drift between the +KVM/PV clock and the TSC itself. This is due to the loss of precision when +performing a multiply and shift rather than divide for the TSC. + +To perform the correction a delta is calculated between the original time info +(which is assumed correct) at a singular point in time X. The KVM clock offset +is then offset by this delta. + 5. The kvm_run structure ======================== diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 47d9f03b7778..5d2e10cd1c30 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6988,6 +6988,87 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) return 0; } +static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm) +{ + struct kvm_vcpu *vcpu = NULL; + int i; + + for (i = 0; i < KVM_MAX_VCPUS; i++) { + vcpu = kvm_get_vcpu_by_id(kvm, i); + if (!vcpu) + continue; + + if (kvm_vcpu_is_reset_bsp(vcpu)) + break; + } + + return vcpu; +} + +static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp) +{ + struct kvm_vcpu *vcpu; + + vcpu = kvm_get_bsp_vcpu(kvm); + if (!vcpu) + return -EINVAL; + + if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time) + return -EIO; + + if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock))) + return -EFAULT; + + return 0; +} + +static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp) +{ + struct kvm_vcpu *vcpu; + struct pvclock_vcpu_time_info orig_pvti; + struct pvclock_vcpu_time_info dummy_pvti; + int64_t kernel_ns; + uint64_t host_tsc, guest_tsc; + uint64_t clock_orig, clock_dummy; + int64_t correction; + unsigned long i; + + vcpu = kvm_get_bsp_vcpu(kvm); + if (!vcpu) + return -EINVAL; + + if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti))) + return -EFAULT; + + /* + * Sample the kernel time and host TSC at a singular point. + * We then calculate the guest TSC using this exact point in time, + * From here we can then determine the delta using the + * PV time info requested from the user and what we currently have + * using the fixed point in time. This delta is then used as a + * correction factor to fixup the potential drift. + */ + if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc)) + return -EFAULT; + + guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc); + + dummy_pvti = orig_pvti; + dummy_pvti.tsc_timestamp = guest_tsc; + dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset; + + clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc); + clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc); + + correction = clock_orig - clock_dummy; + kvm->arch.kvmclock_offset += correction; + + kvm_for_each_vcpu(i, vcpu, kvm) + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); + + return 0; +} + int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { struct kvm *kvm = filp->private_data; @@ -7246,6 +7327,12 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) case KVM_GET_CLOCK: r = kvm_vm_ioctl_get_clock(kvm, argp); break; + case KVM_SET_CLOCK_GUEST: + r = kvm_vm_ioctl_set_clock_guest(kvm, argp); + break; + case KVM_GET_CLOCK_GUEST: + r = kvm_vm_ioctl_get_clock_guest(kvm, argp); + break; case KVM_SET_TSC_KHZ: { u32 user_tsc_khz; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 2190adbe3002..0d306311e4d6 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; }; +#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd5, struct pvclock_vcpu_time_info) +#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd6, struct pvclock_vcpu_time_info) + #endif /* __LINUX_KVM_H */ -- 2.40.1