On Fri, 2025-02-14 at 12:34 +0100, Thomas Gleixner wrote: > > 2. In kernel, asking KVM to populate the vmclock structure much like > > it does other pvclocks shared with the guest. KVM/x86 already uses > > pvclock_gtod_register_notifier() to hook changes; should we expand > > on that? The problem with that notifier is that it seems to be > > called far more frequently than I'd expect. > > It's called once per tick to expose the continous updates to the > conversion factors and related internal data. My recollection (a vague one) is that it's called, and reports "changes", even when there *are* no changes to underlying conversion factors. Something along the lines of "N ticks at 333 counts per tick, then one tick at 334 counts per tick to catch up" because it can't express the division factor completely without that discontinuity? The actual 'error' caused by the apparent fluctuation in rate is probably entirely negligible, but I am slightly concerned about the steal time, if the hypervisor then spends stolen CPU time relaying all those "changes" to the guest, and then the guest has to spend time feeding the "changes" into its own timekeeping. I'd like to strive for a mode where we only adjust what we tell guests, when adjtimex actually changes the real timing factors. In fact if we have a userspace tool like chrony feeding adjtimex based on external NTP/PPS/whatever, that tool could probably calibrate a stable host TSC directly against the external real time. And in that mode maybe we don't even need to feed the guest from the kernel's CLOCK_REALTIME; that would be just another conversion step to introduce noise. We might end up with the direct setup for dedicated hosting environments, but I do also want to support the general-purpose QEMU- based setup where we expose the host's CLOCK_REALTIME as efficiently as possible. How about this: A KVM feature to provide/populate the VMCLOCK, since only KVM knows the precise TSC scaling (and can immediately flip the VMCLOCK to report invalid state if the TSC becomes unreliable). It can *either* be fed the precise TSC/realtime relationship from userspace (maybe in a vmclock structure that *userspace* populates, so all the kernel has to do is scale/offset to account for the guest TSC being different from the host TSC). Or it can be in 'automatic' mode, where it derives from the host's timekeeping. Which at the moment would have "too many" updates for my liking, but we can worry about that later if necessary.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature