On 14/08/20 18:55, Oliver Upton wrote: > We allow our userspace to decide the host TSC / wall clock pair at > which the vCPUs were paused. From that host TSC value we reconstruct > the guest TSCs using the offsets and migrate that info. On the > destination we grab another host TSC / clock pair and recalculate > guest TSC offsets, which we then pass to KVM_SET_TSC_OFFSET. This is > desirable over a per-vCPU read of MSR_IA32_TSC because we've > effectively recorded all TSCs at the exact same moment in time. Yes, this should work very well. But in the end KVM ends up with the vcpu->arch.cur_tsc_{nsec,write} of the source (only shifted in time), while losing the notion that the pair is per-VM rather than per-VCPU for the "already matched" vCPUs. So that is why I'm thinking of retrieving that pair from the kernel directly. If you don't have time to work on it I can try to find some for 5.10, but I'm not sure exactly when. Paolo > Otherwise, we inadvertently add skew between guest TSCs by reading > each vCPU at different times. It seems that the sync heuristics > address this issue along with any guest TSC coordination. > > Not only that, depending on the heuristics to detect a sync from > userspace gets a bit tricky if we (KVM) are running nested. Couldn't > more than a second of time elapse between successive KVM_SET_MSRS when > running in L1 if L0 decides to pause our vCPUs (suspend/resume, > migration)? It seems to me that in this case we will fail to detect a > sync condition and configure the L2 TSCs to be out-of-phase. > > Setting the guest TSCs by offset doesn't have these complications. > Even if L0 were to pause L1 for some inordinate amount of time, the > relation of L1 -> L2 TSC is never disturbed. > >> >> I am all for improving migration of TSC state, but I think we should do >> it right, so we should migrate a host clock / TSC pair: then the >> destination host can use TSC frequency and host clock to compute the new >> TSC value. In fact, such a pair is exactly the data that the syncing >> heuristics track for each "generation" of syncing. >> >> To migrate the synchronization state, instead, we only need to migrate >> the "already_matched" (vcpu->arch.this_tsc_generation == >> kvm->arch.cur_tsc_generation) state. >> >> Putting all of this together, it would be something like this: >> >> - a VM-wide KVM_CLOCK/KVM_SET_CLOCK needs to migrate >> vcpu->arch.cur_tsc_{nsec,write} in addition to the current kvmclock >> value (it's unrelated, but I don't think it's worth creating a new >> ioctl). A new flag is set if these fields are set in the struct. If >> the flag is set, KVM_SET_CLOCK copies the fields back, bumps the >> generation and clears kvm->arch.nr_vcpus_matched_tsc. >> >> - a VCPU-wide KVM_GET_TSC_INFO returns a host clock / guest TSC pair >> plus the "already matched" state. KVM_SET_TSC_INFO will only use the >> host clock / TSC pair if "already matched" is false, to compute the >> destination-side TSC offset but not otherwise doing anything with it; or >> if "already matched" is true, it will ignore the pair, compute the TSC >> offset from the data in kvm->arch, and update >> kvm->arch.nr_vcpus_matched_tsc. >> > > It seems to me that a per-vCPU ioctl (like you've suggested above) is > necessary to uphold the guest-facing side of our sync scheme, > regardless of what we do on the userspace-facing side. > >> Paolo >> >