On Mon, Aug 17, 2020 at 11:49 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > > On 14/08/20 18:55, Oliver Upton wrote: > > We allow our userspace to decide the host TSC / wall clock pair at > > which the vCPUs were paused. From that host TSC value we reconstruct > > the guest TSCs using the offsets and migrate that info. On the > > destination we grab another host TSC / clock pair and recalculate > > guest TSC offsets, which we then pass to KVM_SET_TSC_OFFSET. This is > > desirable over a per-vCPU read of MSR_IA32_TSC because we've > > effectively recorded all TSCs at the exact same moment in time. > > Yes, this should work very well. But in the end KVM ends up with the > vcpu->arch.cur_tsc_{nsec,write} of the source (only shifted in time), > while losing the notion that the pair is per-VM rather than per-VCPU for > the "already matched" vCPUs. So that is why I'm thinking of retrieving > that pair from the kernel directly. > > If you don't have time to work on it I can try to find some for 5.10, > but I'm not sure exactly when. Shouldn't be an issue, I'll futz around with some changes to the series and send them out in the coming weeks. > > Paolo > > > Otherwise, we inadvertently add skew between guest TSCs by reading > > each vCPU at different times. It seems that the sync heuristics > > address this issue along with any guest TSC coordination. > > > > Not only that, depending on the heuristics to detect a sync from > > userspace gets a bit tricky if we (KVM) are running nested. Couldn't > > more than a second of time elapse between successive KVM_SET_MSRS when > > running in L1 if L0 decides to pause our vCPUs (suspend/resume, > > migration)? It seems to me that in this case we will fail to detect a > > sync condition and configure the L2 TSCs to be out-of-phase. > > > > Setting the guest TSCs by offset doesn't have these complications. > > Even if L0 were to pause L1 for some inordinate amount of time, the > > relation of L1 -> L2 TSC is never disturbed. > > > >> > >> I am all for improving migration of TSC state, but I think we should do > >> it right, so we should migrate a host clock / TSC pair: then the > >> destination host can use TSC frequency and host clock to compute the new > >> TSC value. In fact, such a pair is exactly the data that the syncing > >> heuristics track for each "generation" of syncing. > >> > >> To migrate the synchronization state, instead, we only need to migrate > >> the "already_matched" (vcpu->arch.this_tsc_generation == > >> kvm->arch.cur_tsc_generation) state. > >> > >> Putting all of this together, it would be something like this: > >> > >> - a VM-wide KVM_CLOCK/KVM_SET_CLOCK needs to migrate > >> vcpu->arch.cur_tsc_{nsec,write} in addition to the current kvmclock > >> value (it's unrelated, but I don't think it's worth creating a new > >> ioctl). A new flag is set if these fields are set in the struct. If > >> the flag is set, KVM_SET_CLOCK copies the fields back, bumps the > >> generation and clears kvm->arch.nr_vcpus_matched_tsc. > >> > >> - a VCPU-wide KVM_GET_TSC_INFO returns a host clock / guest TSC pair > >> plus the "already matched" state. KVM_SET_TSC_INFO will only use the > >> host clock / TSC pair if "already matched" is false, to compute the > >> destination-side TSC offset but not otherwise doing anything with it; or > >> if "already matched" is true, it will ignore the pair, compute the TSC > >> offset from the data in kvm->arch, and update > >> kvm->arch.nr_vcpus_matched_tsc. > >> > > > > It seems to me that a per-vCPU ioctl (like you've suggested above) is > > necessary to uphold the guest-facing side of our sync scheme, > > regardless of what we do on the userspace-facing side. > > > >> Paolo > >> > > >