[PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



There is a potential for drift between the TSC and a KVM/PV clock when the
guest TSC is scaled (as seen previously in [1]). Which fixed drift between
timers over the lifetime of a VM.

However, there is another factor which will cause a drift. In a situation
such as a kexec/live-update of the kernel or a live-migration of a VM the
PV clock information is recalculated by KVM (KVM_REQ_MASTERCLOCK_UPDATE).
This update samples a new system_time & tsc_timestamp to be used in the
structure.

For example, when a guest is running with a TSC frequency of 1.5GHz but the
host frequency is 3.0GHz upon an update of the PV time information a delta
of ~3500ns is observed between the TSC and the KVM/PV clock. There is no
reason why a fixup creating an accuracy of ±1ns cannot be achieved.

Additional interfaces are added to retrieve & fixup the PV time information
when a VMM may believe is appropriate (deserialization after live-update/
migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the
currently used PV time information and then when the VMM believes a drift
may occur can then instruct KVM to perform a correction via the setter
KVM_SET_CLOCK_GUEST.

The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host
TSC & kernel timstamp are sampled at a singular point in time. Using the
already known scaling/offset for L1 the guest TSC is then derived from this
information.

>From here two PV time information structures are created, one which is the
original time information structure prior to whatever may have caused a PV
clock re-calculation (live-update/migration). The second is then using the
singular point in time sampled just prior. An individual KVM/PV clock for
each of the PV time information structures using the singular guest TSC.

A delta is then determined between the two calculated PV times, which is
then used as a correction offset added onto the kvmclock_offset for the VM.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae

Suggested-by: David Woodhouse <dwmw2@xxxxxxxxxxxxx>
Signed-off-by: Jack Allister <jalliste@xxxxxxxxxx>
CC: Paul Durrant <paul@xxxxxxx>
---
 Documentation/virt/kvm/api.rst | 43 +++++++++++++++++
 arch/x86/kvm/x86.c             | 87 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  3 ++
 3 files changed, 133 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..5f74d8ac1002 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,49 @@ a single guest_memfd file, but the bound ranges must not overlap).
 
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
+4.143 KVM_GET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct pvclock_vcpu_time_info (out)
+:Returns: 0 on success, <0 on error
+
+Retrieves the current time information structure used for KVM/PV clocks.
+On x86 a PV clock is derived from the current TSC and is then scaled based
+upon the a specified multiplier and shift. The result of this is then added
+to a system time.
+
+The guest needs a way to determine the system time, multiplier and shift. This
+can be done by multiple ways, for KVM guests this can be via an MSR write to
+MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical
+address KVM shall put the structure. On Xen guests this can be found in the Xen
+vcpu_info.
+
+This is structure is useful information for a VMM to also know when taking into
+account potential timer drift on live-update/migration.
+
+4.144 KVM_SET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct pvclock_vcpu_time_info (in)
+:Returns: 0 on success, <0 on error
+
+Triggers KVM to perform a correction of the KVM/PV clock structure based upon a
+known prior PV clock structure (see KVM_GET_CLOCK_GUEST).
+
+If a VM is utilizing TSC scaling there is a potential for a drift between the
+KVM/PV clock and the TSC itself. This is due to the loss of precision when
+performing a multiply and shift rather than divide for the TSC.
+
+To perform the correction a delta is calculated between the original time info
+(which is assumed correct) at a singular point in time X. The KVM clock offset
+is then offset by this delta.
+
 5. The kvm_run structure
 ========================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 47d9f03b7778..5d2e10cd1c30 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6988,6 +6988,87 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 	return 0;
 }
 
+static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu = NULL;
+	int i;
+
+	for (i = 0; i < KVM_MAX_VCPUS; i++) {
+		vcpu = kvm_get_vcpu_by_id(kvm, i);
+		if (!vcpu)
+			continue;
+
+		if (kvm_vcpu_is_reset_bsp(vcpu))
+			break;
+	}
+
+	return vcpu;
+}
+
+static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_vcpu *vcpu;
+
+	vcpu = kvm_get_bsp_vcpu(kvm);
+	if (!vcpu)
+		return -EINVAL;
+
+	if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time)
+		return -EIO;
+
+	if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_vcpu *vcpu;
+	struct pvclock_vcpu_time_info orig_pvti;
+	struct pvclock_vcpu_time_info dummy_pvti;
+	int64_t kernel_ns;
+	uint64_t host_tsc, guest_tsc;
+	uint64_t clock_orig, clock_dummy;
+	int64_t correction;
+	unsigned long i;
+
+	vcpu = kvm_get_bsp_vcpu(kvm);
+	if (!vcpu)
+		return -EINVAL;
+
+	if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
+		return -EFAULT;
+
+	/*
+	 * Sample the kernel time and host TSC at a singular point.
+	 * We then calculate the guest TSC using this exact point in time,
+	 * From here we can then determine the delta using the
+	 * PV time info requested from the user and what we currently have
+	 * using the fixed point in time. This delta is then used as a
+	 * correction factor to fixup the potential drift.
+	 */
+	if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc))
+		return -EFAULT;
+
+	guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+
+	dummy_pvti = orig_pvti;
+	dummy_pvti.tsc_timestamp = guest_tsc;
+	dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset;
+
+	clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
+	clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
+
+	correction = clock_orig - clock_dummy;
+	kvm->arch.kvmclock_offset += correction;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+	return 0;
+}
+
 int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
 	struct kvm *kvm = filp->private_data;
@@ -7246,6 +7327,12 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 	case KVM_GET_CLOCK:
 		r = kvm_vm_ioctl_get_clock(kvm, argp);
 		break;
+	case KVM_SET_CLOCK_GUEST:
+		r = kvm_vm_ioctl_set_clock_guest(kvm, argp);
+		break;
+	case KVM_GET_CLOCK_GUEST:
+		r = kvm_vm_ioctl_get_clock_guest(kvm, argp);
+		break;
 	case KVM_SET_TSC_KHZ: {
 		u32 user_tsc_khz;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..0d306311e4d6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_SET_CLOCK_GUEST       _IOW(KVMIO,  0xd5, struct pvclock_vcpu_time_info)
+#define KVM_GET_CLOCK_GUEST       _IOR(KVMIO,  0xd6, struct pvclock_vcpu_time_info)
+
 #endif /* __LINUX_KVM_H */
-- 
2.40.1





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux