Marcelo, On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote: > > On 01/10/21 12:32, Marcelo Tosatti wrote: > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), + > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...] > > > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock > > > > nanoseconds + (k_0) and realtime nanoseconds (r_0) in their > > > > respective fields. + Ensure that the KVM_CLOCK_REALTIME flag is > > > > set in the provided + structure. KVM will advance the VM's > > > > kvmclock to account for elapsed + time since recording the clock > > > > values. > > > > > > You can't advance both kvmclock (kvmclock_offset variable) and the > > > TSCs, which would be double counting. > > > > > > So you have to either add the elapsed realtime (1) between > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the > > > TSCs. If you do both, there is double counting. Am i missing > > > something? > > > > Probably one of these two (but it's worth pointing out both of them): > > > > 1) the attribute that's introduced here *replaces* > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added. > > > > 2) the adjustment formula later in the algorithm does not care about how > > much time passed between step 1 and step 4. It just takes two well > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is > > the same on the destination as if the guest was still running on the > > source. It is irrelevant that one of them is before migration and one > > is after, all it matters is that one is on the source and one is on the > > destination. > > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay > which is introduced during migration (which is what i would guess is > the lower hanging fruit) (for guests using TSC). The series gives userspace the ability to modify the guest's perception of the TSC in whatever way it sees fit. The algorithm in the documentation provides a suggestion to userspace on how to do exactly that. I kept that advancement logic out of the kernel because IMO it is an implementation detail: users have differing opinions on how clocks should behave across a migration and KVM shouldn't have any baked-in rules around it. At the same time, userspace can choose to _not_ jump the TSC and use the available interfaces to just migrate the existing state of the TSCs. When I had initially proposed this series upstream, Paolo astutely pointed out that there was no good way to get a (CLOCK_REALTIME, TSC) pairing, which is critical for the TSC advancement algorithm in the documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists in userspace [1], hence the missing kvm clock changes. So, in all, the spirit of the KVM clock changes is to provide missing UAPI around the clock/TSC, with the side effect of changing the guest-visible value. [1] https://cloud.google.com/spanner/docs/true-time-external-consistency > My point was that, by advancing the _TSC value_ by: > > T0. stop guest vcpus (source) > T1. KVM_GET_CLOCK (source) > T2. KVM_SET_CLOCK (destination) > T3. Write guest TSCs (destination) > T4. resume guest (destination) > > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1 > > t_0: host TSC at KVM_GET_CLOCK time. > off_n: TSC offset at vcpu-n (as long as no guest TSC writes are performed, > TSC offset is fixed). > ... > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds > + (k_0) and realtime nanoseconds (r_0) in their respective fields. > + Ensure that the KVM_CLOCK_REALTIME flag is set in the provided > + structure. KVM will advance the VM's kvmclock to account for elapsed > + time since recording the clock values. > > Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock > (hopefully modern guests on modern hosts will use TSC clocksource, > whose clock_gettime is faster... some people are using that already). > Hopefully the above explanation made it clearer how the TSCs are supposed to get advanced, and why it isn't done in the kernel. > At some point QEMU should enable invariant TSC flag by default? > > That said, the point is: why not advance the _TSC_ values > (instead of kvmclock nanoseconds), as doing so would reduce > the "the CLOCK_REALTIME delay which is introduced during migration" > for both kvmclock users and modern tsc clocksource users. > > So yes, i also like this patchset, but would like it even more > if it fixed the case above as well (and not sure whether adding > the migration delta to KVMCLOCK makes it harder to fix TSC case > later). > > > Perhaps we can add to step 6 something like: > > > > > +6. Adjust the guest TSC offsets for every vCPU to account for (1) > > > time + elapsed since recording state and (2) difference in TSCs > > > between the + source and destination machine: + + new_off_n = t_0 > > > + off_n + (k_1 - k_0) * freq - t_1 + > > > > "off + t - k * freq" is the guest TSC value corresponding to a time of 0 > > in kvmclock. The above formula ensures that it is the same on the > > destination as it was on the source. > > > > Also, the names are a bit hard to follow. Perhaps > > > > t_0 tsc_src > > t_1 tsc_dest > > k_0 guest_src > > k_1 guest_dest > > r_0 host_src > > off_n ofs_src[i] > > new_off_n ofs_dest[i] > > > > Paolo > > Yeah, sounds good to me. Shall I respin the whole series from what you have in kvm/queue, or just send you the bits and pieces that ought to be applied? -- Thanks, Oliver