[Bug 217423] TSC synchronization issue in VM restore

bugzilla-daemon@xxxxxxxxxx · Thu, 18 May 2023 12:48:05 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=217423

--- Comment #3 from zhuangel (zhuangel570@xxxxxxxxx) ---
(In reply to robert.hoo.linux from comment #2)
> On 5/9/2023 10:01 PM, bugzilla-daemon@xxxxxxxxxx wrote:
> > Hi
> > 
> > We are using lightweight VM with snapshot feature, the VM will be saved
> with
> > 100ms+, and we found restore such VM will not get correct TSC, which will
> > make
> > the VM world stop about 100ms+ after restore (the stop time is same as time
> > when VM saved).
> > 
> > After Investigation, we found the issue caused by TSC synchronization in
> > setting MSR_IA32_TSC. In VM save, VMM (cloud-hypervisor) will record TSC of
> > each
> > VCPU, then restore the TSC of VCPU in VM restore (about 100ms+ in guest
> > time).
> > But in KVM, setting a TSC within 1 second is identified as TSC
> > synchronization,
> > and the TSC offset will not be updated in stable TSC environment, this will
> > cause the lapic set up a hrtimer expires after 100ms+, 
> 
> Can elaborate more on this hrtimer issue/code path?

Below are the steps in detail, I traced them via bpftrace, to simplify the
analysis, the preemption timer on host is disabled, guest is running with
TSC timer deadline mode.

TSC changes before save VM:
1 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
  host_tsc0 = 0 + offset0
2 pause VM after guest start finished (about 200ms)
  host_tsc1 = guest_tsc1 + offset0
  guest_tsc1_deadline = guest_tsc1 + expire1
3 save VM state
  save guest_tsc1 by reading MSR_IA32_TSC
  save guest_tsc1_deadline by reading MSR_IA32_TSC_DEADLINE

TSC changes in restore VM (to simplify the analysis, step 4
and step 5 ignore the host TSC changes in restore process):
4 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
  host_tsc3 = 0 + offset1
5 restore VM state
  set MSR_IA32_TSC by guest_tsc1
  set MSR_IA32_TSC_DEADLINE by guest_tsc1_deadline
6 start VM
  VCPU_RUN

In step 5 setting MSR_IA32_TSC, because the guest_tsc1 is within 1 second,
KVM will take this update as TSC synchronize, then skip update offset1.
This means the guest TSC is still at 0 (initialize value).

In step 5 setting MSR_IA32_TSC_DEADLINE, VMM just want setup hrtimer after
expire1, but KVM will get the current guest TSC is 0, and then calculate
the expire time as (guest_tsc1_deadline - 0), then the hrtimer introduce
guest_tsc1 latency (the guest kernel stopped if it will only kick by loapic
timer).

> 
> > the restored VM now will
> > in stop state about 100ms+, if no other event to wake guest kernel in NO_HZ
> > mode.
> > 
> > More investigation show, the MSR_IA32_TSC set from guest side has disabled
> > TSC
> > synchronization in commit 0c899c25d754 (KVM: x86: do not attempt TSC
> > synchronization on guest writes), now host side will do TSC synchronization
> > when
> > setting MSR_IA32_TSC.
> > 
> > I think setting MSR_IA32_TSC within 1 second from host side should not be
> > identified as TSC synchronization, like above case, VMM set TSC from host
> > side
> > always should be updated as user want.
> 
> This is heuristics, I think; at the very beginning, it was 5 seconds.
> Perhaps nowadays, can we have some deterministic approach to identify a 
> synchronization? e.g. add a new VM ioctl?

The 5 seconds was original introduced in commit f38e098ff3a3 ("KVM: x86:
TSC reset compensation"), when setting TSC into same value in 5 second. And
in 46543ba45fc49 ("KVM: x86: Robust TSC compensation"), remove the same value
check.

>From changes in these 2 commits, I think the TSC synchronize first introduced
to handle synchronize from guest side (kernel boot), then changed to migration
setting from host side.

In commit 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization
on guest writes"), Paolo has split the guest TSC write from this complicated
synchronization.

Now just migration scenarios left, in migration the TSC restore should be set
before VCPU start to run, one hack is to skip checking whether the VCPU starts
running, such as check last_vmentry_cpu is -1.

> > 
> > The MSR_IA32_TSC set code is complicated and with a long history, so I come
> > here
> > to try to get help about whether my thought is correct. Here is my fix to
> > solve
> > the issue, any comments are welcomed:
> > 
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index ceb7c5e9cf9e..9380a88b9c1f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2722,17 +2722,6 @@ static void kvm_synchronize_tsc(struct kvm_vcpu
> *vcpu,
> > u64 data)
> >                           * kvm_clock stable after CPU hotplug
> >                           */
> >                          synchronizing = true;
> > -               } else {
> > -                       u64 tsc_exp = kvm->arch.last_tsc_write +
> > -                                               nsec_to_cycles(vcpu,
> > elapsed);
> > -                       u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
> > -                       /*
> > -                        * Special case: TSC write with a small delta (1
> > second)
> > -                        * of virtual cycle time against real time is
> > -                        * interpreted as an attempt to synchronize the
> CPU.
> > -                        */
> > -                       synchronizing = data < tsc_exp + tsc_hz &&
> > -                                       data + tsc_hz > tsc_exp;
> >                  }
> >          }
> > 
> This hunk of code is indeed historic and heuristic. But simply removing it 
> isn't the way.
> Is the interval between your "save VM" and "restore VM" less than 1s?
> 
> An alternative, I think, is to bypass this directly write IA32_MSR_TSC way 
> to set/sync TSC offsets, but follow new approach introduced in your VMM by
> 
> commit 828ca89628bfcb1b8f27535025f69dd00eb55207
> Author: Oliver Upton <oliver.upton@xxxxxxxxx>
> Date:   Thu Sep 16 18:15:38 2021 +0000
> 
>      KVM: x86: Expose TSC offset controls to userspace
> 
> ...
> 
> Documentation/virt/kvm/devices/vcpu.rst:
> 
> 4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> 
> :Parameters: 64-bit unsigned TSC offset
> 
> ...
> 
> Specifies the guest's TSC offset relative to the host's TSC. The guest's
> TSC is then derived by the following equation:
> 
>    guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
> 
> The following describes a possible algorithm to use for this purpose
> ...

"TSC counts the time during which the VM was paused.", This new feature works
for live migration. But if we save/restore VM with snapshot, the TSC should be
paused either?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.