On 13/9/2023 5:30 pm, David Woodhouse wrote:
On 13 September 2023 11:17:49 CEST, Like Xu <like.xu.linux@xxxxxxxxx> wrote:
I think you also need to set kvm->arch.user_set_tsc() in
kvm_arch_tsc_set_attr(), don't you?
How about:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c55cc60769db..374965f66137 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5545,6 +5545,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
ns = get_kvmclock_base_ns();
+ kvm->arch.user_set_tsc = true;
__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
Yep, that looks good.
This comment isn't quite right; it wants to use some excerpt of the
commit message I've suggested above.
How about:
@@ -2735,20 +2735,34 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
* kvm_clock stable after CPU hotplug
*/
synchronizing = true;
- } else {
+ } else if (kvm->arch.user_set_tsc) {
u64 tsc_exp = kvm->arch.last_tsc_write +
nsec_to_cycles(vcpu, elapsed);
u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
/*
- * Special case: TSC write with a small delta (1 second)
- * of virtual cycle time against real time is
- * interpreted as an attempt to synchronize the CPU.
+ * Here lies UAPI baggage: user-initiated TSC write with
+ * a small delta (1 second) of virtual cycle time
+ * against real time is interpreted as an attempt to
+ * synchronize the CPU.
Much better, thanks. But I don't much like "an attempt to synchronize the CPU".
In my response to Sean I objected to that classification. Userspace is just *setting* the TSC. There is no dedicated intent to "synchronize" it. It just sets it, and the value just *might* happen to be in sync with another vCPU.
It's just that our API is so fundamentally broken that it *can't* be in sync, so we "help" it a bit if it looks close.
So maybe...
Here lies UAPI baggage: when a user-initiated TSC write has a small delta (1 second) of virtual cycle time against the previously set vCPU, we assume that they were intended to be in sync and the delta was only due to the racy nature of the legacy API.
+ * This trick falls down when restoring a guest which genuinely
+ * has been running for less time than the 1 second of imprecision
+ * which we allow for in the legacy API. In this case, the first
+ * value written by userspace (on any vCPU) should not be subject
+ * to this 'correction' to make it sync up with values that only
+ * from from the kernel's default vCPU creation. Make the 1-second
+ * slop hack only trigger if flag is already set.
+ *
+ * The correct answer is for the VMM not to use the legacy API.
Userspace used to be able to write zero to force a sync. You've removed
that ability from the ABI, and haven't even mentioned it. Must we?
Will continue to use "bool user_initiated" for lack of a better move.
Why? Can't we treat an explicit zero write just the same as when the kernel does it?
Not sure if it meets your simplified expectations:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..0f05cf90d636 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2735,20 +2735,35 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
u64 data)
* kvm_clock stable after CPU hotplug
*/
synchronizing = true;
- } else {
+ } else if (!data || kvm->arch.user_set_tsc) {
u64 tsc_exp = kvm->arch.last_tsc_write +
nsec_to_cycles(vcpu, elapsed);
u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
/*
- * Special case: TSC write with a small delta (1 second)
- * of virtual cycle time against real time is
- * interpreted as an attempt to synchronize the CPU.
+ * Here lies UAPI baggage: when a user-initiated TSC write has
+ * a small delta (1 second) of virtual cycle time against the
+ * previously set vCPU, we assume that they were intended to be
+ * in sync and the delta was only due to the racy nature of the
+ * legacy API.
+ *
+ * This trick falls down when restoring a guest which genuinely
+ * has been running for less time than the 1 second of imprecision
+ * which we allow for in the legacy API. In this case, the first
+ * value written by userspace (on any vCPU) should not be subject
+ * to this 'correction' to make it sync up with values that only
+ * from from the kernel's default vCPU creation. Make the 1-second
+ * slop hack only trigger if flag is already set.
+ *
+ * The correct answer is for the VMM not to use the legacy API.
*/
synchronizing = data < tsc_exp + tsc_hz &&
data + tsc_hz > tsc_exp;
}
}
+ if (data)
+ kvm->arch.user_set_tsc = true;
+
/*
* For a reliable TSC, we can match TSC offsets, and for an unstable
* TSC, we add elapsed time in this computation. We could let the
@@ -5536,6 +5551,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
ns = get_kvmclock_base_ns();
+ kvm->arch.user_set_tsc = true;
__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);