Re: [PATCH v3 0/5] KVM_{GET,SET}_TSC_OFFSET ioctls

Oliver Upton <oupton@xxxxxxxxxx> · Mon, 17 Aug 2020 14:40:30 -0500

On Mon, Aug 17, 2020 at 11:49 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> On 14/08/20 18:55, Oliver Upton wrote:
> > We allow our userspace to decide the host TSC / wall clock pair at
> > which the vCPUs were paused. From that host TSC value we reconstruct
> > the guest TSCs using the offsets and migrate that info. On the
> > destination we grab another host TSC / clock pair and recalculate
> > guest TSC offsets, which we then pass to KVM_SET_TSC_OFFSET. This is
> > desirable over a per-vCPU read of MSR_IA32_TSC because we've
> > effectively recorded all TSCs at the exact same moment in time.
>
> Yes, this should work very well.  But in the end KVM ends up with the
> vcpu->arch.cur_tsc_{nsec,write} of the source (only shifted in time),
> while losing the notion that the pair is per-VM rather than per-VCPU for
> the "already matched" vCPUs.  So that is why I'm thinking of retrieving
> that pair from the kernel directly.
>
> If you don't have time to work on it I can try to find some for 5.10,
> but I'm not sure exactly when.

Shouldn't be an issue, I'll futz around with some changes to the
series and send them out in the coming weeks.

>
> Paolo
>
> > Otherwise, we inadvertently add skew between guest TSCs by reading
> > each vCPU at different times. It seems that the sync heuristics
> > address this issue along with any guest TSC coordination.
> >
> > Not only that, depending on the heuristics to detect a sync from
> > userspace gets a bit tricky if we (KVM) are running nested. Couldn't
> > more than a second of time elapse between successive KVM_SET_MSRS when
> > running in L1 if L0 decides to pause our vCPUs (suspend/resume,
> > migration)? It seems to me that in this case we will fail to detect a
> > sync condition and configure the L2 TSCs to be out-of-phase.
> >
> > Setting the guest TSCs by offset doesn't have these complications.
> > Even if L0 were to pause L1 for some inordinate amount of time, the
> > relation of L1 -> L2 TSC is never disturbed.
> >
> >>
> >> I am all for improving migration of TSC state, but I think we should do
> >> it right, so we should migrate a host clock / TSC pair: then the
> >> destination host can use TSC frequency and host clock to compute the new
> >> TSC value.  In fact, such a pair is exactly the data that the syncing
> >> heuristics track for each "generation" of syncing.
> >>
> >> To migrate the synchronization state, instead, we only need to migrate
> >> the "already_matched" (vcpu->arch.this_tsc_generation ==
> >> kvm->arch.cur_tsc_generation) state.
> >>
> >> Putting all of this together, it would be something like this:
> >>
> >> - a VM-wide KVM_CLOCK/KVM_SET_CLOCK needs to migrate
> >> vcpu->arch.cur_tsc_{nsec,write} in addition to the current kvmclock
> >> value (it's unrelated, but I don't think it's worth creating a new
> >> ioctl).  A new flag is set if these fields are set in the struct.  If
> >> the flag is set, KVM_SET_CLOCK copies the fields back, bumps the
> >> generation and clears kvm->arch.nr_vcpus_matched_tsc.
> >>
> >> - a VCPU-wide KVM_GET_TSC_INFO returns a host clock / guest TSC pair
> >> plus the "already matched" state.  KVM_SET_TSC_INFO will only use the
> >> host clock / TSC pair if "already matched" is false, to compute the
> >> destination-side TSC offset but not otherwise doing anything with it; or
> >> if "already matched" is true, it will ignore the pair, compute the TSC
> >> offset from the data in kvm->arch, and update
> >> kvm->arch.nr_vcpus_matched_tsc.
> >>
> >
> > It seems to me that a per-vCPU ioctl (like you've suggested above) is
> > necessary to uphold the guest-facing side of our sync scheme,
> > regardless of what we do on the userspace-facing side.
> >
> >> Paolo
> >>
> >
>