On Wed, Jun 9, 2021 at 12:05 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > > On 09/06/21 17:11, Oliver Upton wrote: > > Perhaps this will clarify the motivation for my approach: what if the > > kernel wasn't the authoritative source for wall time in a system? > > Furthermore, VMMs may wish to define their own heuristics for counter > > migration (e.g. we only allow the counter to 'jump' by X seconds > > during migration blackout). If a VMM tried to assert its whims on the > > TSC state before handing it down to the kernel, we would inadvertently > > be sampling the host counter twice again. And, anything can happen > > between the time we assert elapsed time is within SLO and KVM > > computing the TSC offset (scheduling, L0 hypervisor preemption). > > > > So, Maxim's changes would address my concerns in the general case, but > > maybe not as much in edge cases where an operator may make decisions > > about how much time can elapse while the guest hasn't had CPU time. > > I think I understand. We still need a way to get a consistent > (host_TSC, nanosecond) pair on the source, the TSC offset is not enough. > This is arguably not a KVM issue, but we're still the one having to > provide a solution, so we would need a slightly more complicated interface. Ah, right, good luck doing that without some help from the kernel. +1 to this _not_ being a KVM issue, but anyways... > 1) In the kernel: > > * KVM_GET_CLOCK should also return kvmclock_ns - realtime_ns and > host_TSC. It should set two flags in struct kvm_clock_data saying that > the respective fields are valid. > > * KVM_SET_CLOCK checks the flag for kvmclock_ns - realtime_ns. If set, > it looks at the kvmclock_ns - realtime_ns field and disregards the > kvmclock_ns field. Yeah, these additions all make sense to me. > > 2) On the source, userspace will: > > * per-VM: invoke KVM_GET_CLOCK. Migrate kvmclock_ns - realtime_ns and > kvmclock_ns. Stash host_TSC for subsequent use. > > * per-vCPU: retrieve guest_TSC - host_TSC with your new ioctl. Sum it > to the stashed host_TSC value; migrate the resulting value (a guest TSC). > > 3) On the destination: > > * per-VM: Pass the migrated kvmclock_ns - realtime_ns to KVM_SET_CLOCK. > Use KVM_GET_CLOCK to get a consistent pair of kvmclock_ns ("newNS" > below) and host TSC ("newHostTSC"). Stash them for subsequent use, > together with the migrated kvmclock_ns value ("sourceNS") that you > haven't used yet. > > * per-vCPU: using the data of the previous step, and the sourceGuestTSC > in the migration stream, compute sourceGuestTSC + (newNS - sourceNS) * > freq - newHostTSC. This is the TSC offset to be passed to your new ioctl. > > Your approach still needs to use the "quirky" approach to host-initiated > MSR_IA32_TSC_ADJUST writes, which write the MSR without affecting the > VMCS offset. This is just a documentation issue. I think I follow what you're saying. To confirm: My suggested ioctl for the vCPU will still exist, and it will still affect the VMCS tsc offset, right? However, we need to do one of the following: - Stash the guest's MSR_IA32_TSC_ADJUST value in the kvm_system_counter_state structure. During KVM_SET_SYSTEM_COUNTER_STATE, check to see if the field is valid. If so, treat it as a dumb value (i.e. show it to the guest but don't fold it into the offset). - Inform userspace that it must still migrate MSR_IA32_TSC_ADJUST, and continue to our quirky behavior around host-initiated writes of the MSR. This is why Maxim's spin migrated a value for IA32_TSC_ADJUST, right? Doing so ensures we don't have any guest-observable consequences due to our migration of TSC state. To me, adding the guest IA32_TSC_ADJUST MSR into the new counter state structure is probably best. No strong opinions in either direction on this point, though :) -- Thanks, Oliver > > Does this make sense? > > Paolo > _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm