Re: [PATCH v3 0/5] KVM_{GET,SET}_TSC_OFFSET ioctls

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Mon, 17 Aug 2020 18:49:39 +0200

On 14/08/20 18:55, Oliver Upton wrote:
> We allow our userspace to decide the host TSC / wall clock pair at
> which the vCPUs were paused. From that host TSC value we reconstruct
> the guest TSCs using the offsets and migrate that info. On the
> destination we grab another host TSC / clock pair and recalculate
> guest TSC offsets, which we then pass to KVM_SET_TSC_OFFSET. This is
> desirable over a per-vCPU read of MSR_IA32_TSC because we've
> effectively recorded all TSCs at the exact same moment in time.

Yes, this should work very well.  But in the end KVM ends up with the
vcpu->arch.cur_tsc_{nsec,write} of the source (only shifted in time),
while losing the notion that the pair is per-VM rather than per-VCPU for
the "already matched" vCPUs.  So that is why I'm thinking of retrieving
that pair from the kernel directly.

If you don't have time to work on it I can try to find some for 5.10,
but I'm not sure exactly when.

Paolo

> Otherwise, we inadvertently add skew between guest TSCs by reading
> each vCPU at different times. It seems that the sync heuristics
> address this issue along with any guest TSC coordination.
>
> Not only that, depending on the heuristics to detect a sync from
> userspace gets a bit tricky if we (KVM) are running nested. Couldn't
> more than a second of time elapse between successive KVM_SET_MSRS when
> running in L1 if L0 decides to pause our vCPUs (suspend/resume,
> migration)? It seems to me that in this case we will fail to detect a
> sync condition and configure the L2 TSCs to be out-of-phase.
> 
> Setting the guest TSCs by offset doesn't have these complications.
> Even if L0 were to pause L1 for some inordinate amount of time, the
> relation of L1 -> L2 TSC is never disturbed.
> 
>>
>> I am all for improving migration of TSC state, but I think we should do
>> it right, so we should migrate a host clock / TSC pair: then the
>> destination host can use TSC frequency and host clock to compute the new
>> TSC value.  In fact, such a pair is exactly the data that the syncing
>> heuristics track for each "generation" of syncing.
>>
>> To migrate the synchronization state, instead, we only need to migrate
>> the "already_matched" (vcpu->arch.this_tsc_generation ==
>> kvm->arch.cur_tsc_generation) state.
>>
>> Putting all of this together, it would be something like this:
>>
>> - a VM-wide KVM_CLOCK/KVM_SET_CLOCK needs to migrate
>> vcpu->arch.cur_tsc_{nsec,write} in addition to the current kvmclock
>> value (it's unrelated, but I don't think it's worth creating a new
>> ioctl).  A new flag is set if these fields are set in the struct.  If
>> the flag is set, KVM_SET_CLOCK copies the fields back, bumps the
>> generation and clears kvm->arch.nr_vcpus_matched_tsc.
>>
>> - a VCPU-wide KVM_GET_TSC_INFO returns a host clock / guest TSC pair
>> plus the "already matched" state.  KVM_SET_TSC_INFO will only use the
>> host clock / TSC pair if "already matched" is false, to compute the
>> destination-side TSC offset but not otherwise doing anything with it; or
>> if "already matched" is true, it will ignore the pair, compute the TSC
>> offset from the data in kvm->arch, and update
>> kvm->arch.nr_vcpus_matched_tsc.
>>
> 
> It seems to me that a per-vCPU ioctl (like you've suggested above) is
> necessary to uphold the guest-facing side of our sync scheme,
> regardless of what we do on the userspace-facing side.
> 
>> Paolo
>>
>