Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Tue, 07 Nov 2023 08:17:48 +0000

On Mon, 2023-11-06 at 17:44 -0800, Dongli Zhang wrote:
> > +       if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
> > +           static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> 
> If there any reason to use both vcpu->kvm->arch.use_master_clock and
> X86_FEATURE_CONSTANT_TSC?

Er, paranoia? I'll recheck.

> I think even __get_kvmclock() would not require both cases at the same time?
> 
>  3071         if (ka->use_master_clock &&
>  3072             (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
> 

But it does. That requires ka->use_master_clock (line 3071) AND that we
know the current CPU's TSC frequency (line 3072).

My code insists on the CONSTANT_TSC form of "knowing the current CPU's
TSC frequency" because even with a get_cpu(), it's not clear the guest
*was* running on this vCPU when it did its calculations. So I don't
want to go anywhere near the !CONSTANT_TSC case; it can use the
fallback.

> > +       } else {
> > +               /*
> > +                * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
> > +                *
> > +                * Also if the guest PV clock hasn't been set up yet, as is
> > +                * likely to be the case during migration when the vCPU has
> > +                * not been run yet. It would be possible to calculate the
> > +                * scaling factors properly in that case but there's not much
> > +                * point in doing so. The get_kvmclock_ns() drift accumulates
> > +                * over time, so it's OK to use it at startup. Besides, on
> > +                * migration there's going to be a little bit of skew in the
> > +                * precise moment at which timers fire anyway. Often they'll
> > +                * be in the "past" by the time the VM is running again after
> > +                * migration.
> > +                */
> > +               guest_now = get_kvmclock_ns(vcpu->kvm);
> > +               kernel_now = ktime_get();
> 
> 1. Can I assume the issue is still there if we fall into the "else" case? That
> is, the increasing inaccuracy as the VM has been up for longer and longer time?
> 
> If that is the case, which may be better?
> 
> (1) get_kvmclock_ns(), or
> (2) always get_kvmclock_base_ns() + ka->kvmclock_offset, when pvclock is not
> enabled, regardless whether master clock is used. At least, the inaccurary is
> not going to increase over guest time?

No, those are both wrong, and drifting further away over time. They are
each *differently* wrong, which is why periodically clamping (1) to (2)
is also broken, as previously discussed. I know you've got a patch to
do that clamping more *often* which would slightly reduce the pain
because the kvmclock wouldn't jump backwards so *far* each time... but
it's still wrong to do so at all (in either direction).

> 
> 2. I see 3 scenarios here:
> 
> (1) vcpu->arch.hv_clock.version and master clock is used.
> 
> In this case, the bugfix looks good.
> 
> (2) The master clock is used. However, pv clock is not enabled.
> 
> In this case, the bug is not resolved? ... even the master clock is used.

Under Xen the PV clock is always enabled. It's in the vcpu_info
structure which is required for any kind of event channel setup.

> 
> (3) The master clock is not used.
> 
> We fall into get_kvmclock_base_ns() + ka->kvmclock_offset. The behavior is not
> changed. This looks good.
> 
> 
> Just from my own point: as this patch involves relatively complex changes, I
> would suggest resolve the issue, but not use a temporary solution :)
> 

This is the conversation I had with Paul on Tuesday, when he wanted me
to fix up this "(3) / behaviour is not changed" case. And yes, I argued
that we *don't* want a temporary solution for this case. Because yes:

> (I see you mentioned that you will be back with get_kvmclock_ns())

We need to fix get_kvmclock_ns() anyway. The systemic drift *and* the
fact that we periodically clamp it to a different clock and make it
jump. I was working on the former and have something half-done but was
preempted by the realisation that the QEMU soft freeze is today, and I
needed to flush my QEMU patch queue.

But even once we fix get_kvmclock_ns(), *this* patch stands. Because it
*also* addresses the "now" problem, where we get the time by one clock
... and then some time passes ... and we get the time by another clock,
and subtract one from the other as if they were the *same* time.

Using kvm_get_monotonic_and_clockread() gives us a single TSC read
corresponding to the CLOCK_MONOTONIC time, from which we can calculate
the kvmclock time. We just *happen* to calculate it correctly here,
unlike anywhere else in KVM.

> Based on your bug fix, I see the below cases:
> 
> If master clock is not used:
>     get_kvmclock_base_ns() + ka->kvmclock_offset
> 
> If master clock is used:
>     If pvclock is enabled:
>         use the &vcpu->arch.hv_clock to get current guest time
>     Else
>         create a temporary hv_clock, based on masterclock.

I don't want to do that last 'else' clause yet, because that feels like
a temporary workaround. It should be enough to call get_kvmclock_ns(),
once we fix it.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature