Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 24 Mar 2023 13:08:54 +0000

On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote:
> On 3/15/23 20:57, Sean Christopherson wrote:
> > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > > preserve the TSC value and apply a known offset would require
> > > duplicating the TSC scaling computations in userspace to account for
> > > frequency differences between source and destination TSCs.
> > > 
> > > Hence, if userspace wants to set the TSC to some known value without
> > > having to deal with TSC scaling, and while also being resilient against
> > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > > suitable options.
> > 
> > Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> > particularly onerous, at least not relative to all the other time insanity.  In
> > other words, why should KVM take on more complexity and a mostly-redundant uAPI?
> 
> Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
> unscaled TSC units (which was done because the guest TSC frequency is at 
> least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?
> 
> Perhaps it's more important (uAPI-wise) for KVM to return the precise
> guest/host TSC ratio via a vcpu device attribute?
> 

My criteria for this are that in the case of a live update (serialize
guest, kexec, resume guest on precisely the same hardware a few
milliseconds of steal time later), the guest clocks (KVM_CLOCK, TSC,
etc) shall be *precisely* the same as before with no slop. The same
offset, the same scaling. And ideally precisely the same values
advertised in the pvclock data to the guest.

In the case of live migration, we can't be cycle-accurate because of
course we're limited to the accuracy of the NTP sync between hosts. But
that is the *only* inaccuracy we shall incur, because we can express
clocks in terms of each other, e.g. <KVM clock was X at time of day Y>
and then e.g. <TSC was W at KVM clock X> is unchanged from before.

It's OK to expect userspace to do some calculation, as long as we never
expect userspace to magically perform calculations based on some
concept of "now", and to call kernel APIs, without any actual time
elapsing while it does so.

I said it's OK to expect userspace to do *some* calculation. But that
should be clearly documented, *and* when we document it, that
documentation shouldn't codify too much of the kernel's internal
relationships between clocks, and shouldn't make us ashamed to be
kernel engineers. 

We tried documenting it, in
https://lore.kernel.org/all/20220316045308.2313184-1-oupton@xxxxxxxxxx/

I don't quite know how to summarise that thread, other than "it's too
broken; let's fix it first and *then* document it".

But if it can be done, I'm happy for someone to fix the documentation
in a way which describes how to meet the above criteria using the
existing kernel APIs. And then perhaps we can make a decision about
just how ashamed of ourselves we should be, and whether we want to
provide a better, easier API for userspace to use.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature