Re: [Bug 217423] TSC synchronization issue in VM restore

Robert Hoo <robert.hoo.linux@xxxxxxxxx> · Fri, 23 Jun 2023 22:40:55 +0800

On 5/18/2023 8:48 PM, bugzilla-daemon@xxxxxxxxxx wrote:
[...]
(Sorry for late response)
Can elaborate more on this hrtimer issue/code path?

Below are the steps in detail, I traced them via bpftrace, to simplify the
analysis, the preemption timer on host is disabled, guest is running with
TSC timer deadline mode.

TSC changes before save VM:
1 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
   host_tsc0 = 0 + offset0
2 pause VM after guest start finished (about 200ms)
   host_tsc1 = guest_tsc1 + offset0
   guest_tsc1_deadline = guest_tsc1 + expire1
3 save VM state
   save guest_tsc1 by reading MSR_IA32_TSC
   save guest_tsc1_deadline by reading MSR_IA32_TSC_DEADLINE

TSC changes in restore VM (to simplify the analysis, step 4
and step 5 ignore the host TSC changes in restore process):
4 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
   host_tsc3 = 0 + offset1
5 restore VM state
   set MSR_IA32_TSC by guest_tsc1
   set MSR_IA32_TSC_DEADLINE by guest_tsc1_deadline
6 start VM
   VCPU_RUN

In step 5 setting MSR_IA32_TSC, because the guest_tsc1 is within 1 second,
KVM will take this update as TSC synchronize, then skip update offset1.
This means the guest TSC is still at 0 (initialize value).

IIUC, here no matter synchronizing = true or false, offset will always be 
updated, i.e. __kvm_synchronize_tsc() will be called. But the offset value will 
differ.

I guess your environment is tsc_stable, then offset = kvm->arch.cur_tsc_offset, 
which is 0. That is to say, the elapsed time isn't counted in by the heuristics 
method in current code, that's the culprit.

static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
{
	...
	offset = kvm_compute_l1_tsc_offset(vcpu, data);
	...

	/*
	 * For a reliable TSC, we can match TSC offsets, and for an unstable
	 * TSC, we add elapsed time in this computation.  We could let the
	 * compensation code attempt to catch up if we fall behind, but
	 * it's better to try to match offsets from the beginning.
         */
	if (synchronizing &&
	    vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
		if (!kvm_check_tsc_unstable()) {
			offset = kvm->arch.cur_tsc_offset;
		} else {
			u64 delta = nsec_to_cycles(vcpu, elapsed);
			data += delta;
			offset = kvm_compute_l1_tsc_offset(vcpu, data);
		}
		matched = true;
	}

	__kvm_synchronize_tsc(vcpu, offset, data, ns, matched);
	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
}

An alternative, I think, is to bypass this directly write IA32_MSR_TSC way
to set/sync TSC offsets, but follow new approach introduced in your VMM by

commit 828ca89628bfcb1b8f27535025f69dd00eb55207
Author: Oliver Upton <oliver.upton@xxxxxxxxx>
Date:   Thu Sep 16 18:15:38 2021 +0000

      KVM: x86: Expose TSC offset controls to userspace

...

Documentation/virt/kvm/devices/vcpu.rst:

4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET

:Parameters: 64-bit unsigned TSC offset

...

Specifies the guest's TSC offset relative to the host's TSC. The guest's
TSC is then derived by the following equation:

    guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET

The following describes a possible algorithm to use for this purpose
...

"TSC counts the time during which the VM was paused.", This new feature works
for live migration. But if we save/restore VM with snapshot, the TSC should be
paused either?

Not sure what's host's TSC situation when host is, say, suspended/hibernated. VM 
Save/Restore can refer to that.
But, the key point of this new approach is to use OFFSET rather than direct TSC 
value, this is like x86 TSC_ADJUST was introduced, and is preferred.
Via this new interface,
"... Ensure that the KVM_CLOCK_REALTIME flag is set in the provided structure.
KVM will advance the VM's kvmclock to account for elapsed time since recording 
the clock values.", therefore I think it can solve your problem, rather than 
modify the ancient and heuristics code at high risk.