On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote: > On 3/15/23 20:57, Sean Christopherson wrote: > > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to > > > preserve the TSC value and apply a known offset would require > > > duplicating the TSC scaling computations in userspace to account for > > > frequency differences between source and destination TSCs. > > > > > > Hence, if userspace wants to set the TSC to some known value without > > > having to deal with TSC scaling, and while also being resilient against > > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are > > > suitable options. > > > > Requiring userspace to handle certain aspects of TSC scaling doesn't seem > > particularly onerous, at least not relative to all the other time insanity. In > > other words, why should KVM take on more complexity and a mostly-redundant uAPI? > > Yeah, it seems like the problem is that KVM_GET_CLOCK return host > unscaled TSC units (which was done because the guest TSC frequency is at > least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)? > > Perhaps it's more important (uAPI-wise) for KVM to return the precise > guest/host TSC ratio via a vcpu device attribute? > My criteria for this are that in the case of a live update (serialize guest, kexec, resume guest on precisely the same hardware a few milliseconds of steal time later), the guest clocks (KVM_CLOCK, TSC, etc) shall be *precisely* the same as before with no slop. The same offset, the same scaling. And ideally precisely the same values advertised in the pvclock data to the guest. In the case of live migration, we can't be cycle-accurate because of course we're limited to the accuracy of the NTP sync between hosts. But that is the *only* inaccuracy we shall incur, because we can express clocks in terms of each other, e.g. <KVM clock was X at time of day Y> and then e.g. <TSC was W at KVM clock X> is unchanged from before. It's OK to expect userspace to do some calculation, as long as we never expect userspace to magically perform calculations based on some concept of "now", and to call kernel APIs, without any actual time elapsing while it does so. I said it's OK to expect userspace to do *some* calculation. But that should be clearly documented, *and* when we document it, that documentation shouldn't codify too much of the kernel's internal relationships between clocks, and shouldn't make us ashamed to be kernel engineers. We tried documenting it, in https://lore.kernel.org/all/20220316045308.2313184-1-oupton@xxxxxxxxxx/ I don't quite know how to summarise that thread, other than "it's too broken; let's fix it first and *then* document it". But if it can be done, I'm happy for someone to fix the documentation in a way which describes how to meet the above criteria using the existing kernel APIs. And then perhaps we can make a decision about just how ashamed of ourselves we should be, and whether we want to provide a better, easier API for userspace to use.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature