On 5/18/2023 8:48 PM, bugzilla-daemon@xxxxxxxxxx wrote:
[...]
(Sorry for late response)
Can elaborate more on this hrtimer issue/code path?
Below are the steps in detail, I traced them via bpftrace, to simplify the
analysis, the preemption timer on host is disabled, guest is running with
TSC timer deadline mode.
TSC changes before save VM:
1 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
host_tsc0 = 0 + offset0
2 pause VM after guest start finished (about 200ms)
host_tsc1 = guest_tsc1 + offset0
guest_tsc1_deadline = guest_tsc1 + expire1
3 save VM state
save guest_tsc1 by reading MSR_IA32_TSC
save guest_tsc1_deadline by reading MSR_IA32_TSC_DEADLINE
TSC changes in restore VM (to simplify the analysis, step 4
and step 5 ignore the host TSC changes in restore process):
4 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
host_tsc3 = 0 + offset1
5 restore VM state
set MSR_IA32_TSC by guest_tsc1
set MSR_IA32_TSC_DEADLINE by guest_tsc1_deadline
6 start VM
VCPU_RUN
In step 5 setting MSR_IA32_TSC, because the guest_tsc1 is within 1 second,
KVM will take this update as TSC synchronize, then skip update offset1.
This means the guest TSC is still at 0 (initialize value).
IIUC, here no matter synchronizing = true or false, offset will always be
updated, i.e. __kvm_synchronize_tsc() will be called. But the offset value will
differ.
I guess your environment is tsc_stable, then offset = kvm->arch.cur_tsc_offset,
which is 0. That is to say, the elapsed time isn't counted in by the heuristics
method in current code, that's the culprit.
static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
{
...
offset = kvm_compute_l1_tsc_offset(vcpu, data);
...
/*
* For a reliable TSC, we can match TSC offsets, and for an unstable
* TSC, we add elapsed time in this computation. We could let the
* compensation code attempt to catch up if we fall behind, but
* it's better to try to match offsets from the beginning.
*/
if (synchronizing &&
vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
if (!kvm_check_tsc_unstable()) {
offset = kvm->arch.cur_tsc_offset;
} else {
u64 delta = nsec_to_cycles(vcpu, elapsed);
data += delta;
offset = kvm_compute_l1_tsc_offset(vcpu, data);
}
matched = true;
}
__kvm_synchronize_tsc(vcpu, offset, data, ns, matched);
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
}
An alternative, I think, is to bypass this directly write IA32_MSR_TSC way
to set/sync TSC offsets, but follow new approach introduced in your VMM by
commit 828ca89628bfcb1b8f27535025f69dd00eb55207
Author: Oliver Upton <oliver.upton@xxxxxxxxx>
Date: Thu Sep 16 18:15:38 2021 +0000
KVM: x86: Expose TSC offset controls to userspace
...
Documentation/virt/kvm/devices/vcpu.rst:
4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
:Parameters: 64-bit unsigned TSC offset
...
Specifies the guest's TSC offset relative to the host's TSC. The guest's
TSC is then derived by the following equation:
guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
The following describes a possible algorithm to use for this purpose
...
"TSC counts the time during which the VM was paused.", This new feature works
for live migration. But if we save/restore VM with snapshot, the TSC should be
paused either?
Not sure what's host's TSC situation when host is, say, suspended/hibernated. VM
Save/Restore can refer to that.
But, the key point of this new approach is to use OFFSET rather than direct TSC
value, this is like x86 TSC_ADJUST was introduced, and is preferred.
Via this new interface,
"... Ensure that the KVM_CLOCK_REALTIME flag is set in the provided structure.
KVM will advance the VM's kvmclock to account for elapsed time since recording
the clock values.", therefore I think it can solve your problem, rather than
modify the ancient and heuristics code at high risk.