On Fri, Oct 01, 2021 at 12:33:28PM -0700, Oliver Upton wrote: > Marcelo, > > On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > > > > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote: > > > On 01/10/21 12:32, Marcelo Tosatti wrote: > > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), + > > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...] > > > > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock > > > > > nanoseconds + (k_0) and realtime nanoseconds (r_0) in their > > > > > respective fields. + Ensure that the KVM_CLOCK_REALTIME flag is > > > > > set in the provided + structure. KVM will advance the VM's > > > > > kvmclock to account for elapsed + time since recording the clock > > > > > values. > > > > > > > > You can't advance both kvmclock (kvmclock_offset variable) and the > > > > TSCs, which would be double counting. > > > > > > > > So you have to either add the elapsed realtime (1) between > > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the > > > > TSCs. If you do both, there is double counting. Am i missing > > > > something? > > > > > > Probably one of these two (but it's worth pointing out both of them): > > > > > > 1) the attribute that's introduced here *replaces* > > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added. > > > > > > 2) the adjustment formula later in the algorithm does not care about how > > > much time passed between step 1 and step 4. It just takes two well > > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is > > > the same on the destination as if the guest was still running on the > > > source. It is irrelevant that one of them is before migration and one > > > is after, all it matters is that one is on the source and one is on the > > > destination. > > > > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay > > which is introduced during migration (which is what i would guess is > > the lower hanging fruit) (for guests using TSC). > > The series gives userspace the ability to modify the guest's > perception of the TSC in whatever way it sees fit. The algorithm in > the documentation provides a suggestion to userspace on how to do > exactly that. I kept that advancement logic out of the kernel because > IMO it is an implementation detail: users have differing opinions on > how clocks should behave across a migration and KVM shouldn't have any > baked-in rules around it. Ok, was just trying to visualize how this would work with QEMU Linux guests. > > At the same time, userspace can choose to _not_ jump the TSC and use > the available interfaces to just migrate the existing state of the > TSCs. > > When I had initially proposed this series upstream, Paolo astutely > pointed out that there was no good way to get a (CLOCK_REALTIME, TSC) > pairing, which is critical for the TSC advancement algorithm in the > documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists > in userspace [1], hence the missing kvm clock changes. So, in all, the > spirit of the KVM clock changes is to provide missing UAPI around the > clock/TSC, with the side effect of changing the guest-visible value. > > [1] https://cloud.google.com/spanner/docs/true-time-external-consistency > > > My point was that, by advancing the _TSC value_ by: > > > > T0. stop guest vcpus (source) > > T1. KVM_GET_CLOCK (source) > > T2. KVM_SET_CLOCK (destination) > > T3. Write guest TSCs (destination) > > T4. resume guest (destination) > > > > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1 > > > > t_0: host TSC at KVM_GET_CLOCK time. > > off_n: TSC offset at vcpu-n (as long as no guest TSC writes are performed, > > TSC offset is fixed). > > ... > > > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds > > + (k_0) and realtime nanoseconds (r_0) in their respective fields. > > + Ensure that the KVM_CLOCK_REALTIME flag is set in the provided > > + structure. KVM will advance the VM's kvmclock to account for elapsed > > + time since recording the clock values. > > > > Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock > > (hopefully modern guests on modern hosts will use TSC clocksource, > > whose clock_gettime is faster... some people are using that already). > > > > Hopefully the above explanation made it clearer how the TSCs are > supposed to get advanced, and why it isn't done in the kernel. > > > At some point QEMU should enable invariant TSC flag by default? > > > > That said, the point is: why not advance the _TSC_ values > > (instead of kvmclock nanoseconds), as doing so would reduce > > the "the CLOCK_REALTIME delay which is introduced during migration" > > for both kvmclock users and modern tsc clocksource users. > > > > So yes, i also like this patchset, but would like it even more > > if it fixed the case above as well (and not sure whether adding > > the migration delta to KVMCLOCK makes it harder to fix TSC case > > later). > > > > > Perhaps we can add to step 6 something like: > > > > > > > +6. Adjust the guest TSC offsets for every vCPU to account for (1) > > > > time + elapsed since recording state and (2) difference in TSCs > > > > between the + source and destination machine: + + new_off_n = t_0 > > > > + off_n + (k_1 - k_0) * freq - t_1 + > > > > > > "off + t - k * freq" is the guest TSC value corresponding to a time of 0 > > > in kvmclock. The above formula ensures that it is the same on the > > > destination as it was on the source. > > > > > > Also, the names are a bit hard to follow. Perhaps > > > > > > t_0 tsc_src > > > t_1 tsc_dest > > > k_0 guest_src > > > k_1 guest_dest > > > r_0 host_src > > > off_n ofs_src[i] > > > new_off_n ofs_dest[i] > > > > > > Paolo > > > > > Yeah, sounds good to me. Shall I respin the whole series from what you > have in kvm/queue, or just send you the bits and pieces that ought to > be applied? > > -- > Thanks, > Oliver > >