On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@xxxxxxxxxx> wrote: > Right, but I'd argue that interface has some problems too. It > depends on the guest polling instead of an interrupt from the > hypervisor. It also has no way of informing the kernel exactly how much > time has elapsed. > The whole point of all these hacks that we've done internally is that we, > the hypervisor, know full well how much real time hasv advanced during the > VM blackout. If we can at least let the guest know how much to fudge real > time, it can then poke NTP for better refinement. I worry about using NTP > as the sole source of truth for such a mechanism, since you'll need to go > out to the network and any reads until the response comes back are hosed. (I'm a kernel newbie, so please excuse any ignorance with respect to kernel Internals or kernel/hypervisor interfaces.) We can have it both ways, I think. Let the hypervisor manipulate the guest TSC so as to keep the guest kernel's idea of real time as accurate as possible without any awareness required on the guest's side. *Also* give the guest kernel a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else, and let the kernel propagate this notification to userspace so that the NTP daemon can recombobulate itself as quickly as possible, treating whatever TSC adjustment was received as best-effort only. The KVM_PVCLOCK_STOPPED event should trigger a change in some of the globals kept by kernel/time/ntp.c (which are visible to userspace through adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`. The aforementioned fields get overwritten by the NTP daemon every polling interval, and whatever notification gets sent to userspace is going to be asynchronous, so we need to avoid race conditions wherein userspace clobbers them with its outdated idea of their correct value before the notification arrives. I propose allocating one of the unused fields at the end of `struct timex` as a `clockver` field. This field will be 0 at boot time, and get incremented after clock discontinuity such as a KVM blackout, an ACPI suspend-to-RAM event, or a manual setting of the clock through clock_settime(3) or similar. Augment `modes` with an `ADJ_CLOCKVER` bit, with the semantics "make this call fail If the `clockver` I passed in is not current". As for the form of the notification to userspace... dunno, I think a netlink interface makes the most sense? I'm not too opinionated here. Whatever we decide, I'll be happy to contribute the patches to chrony and ntpd. The NTP daemon should handle this notification by essentially resetting its state to how it was at first startup, considering itself unsynchronized and immediately kicking off new queries to all its peers. One other thing the NTP daemon might want to do with this notification is clobber its drift file, but only if it's running on different hardware than before. But this requires getting that bit of information from the hypervisor to the kernel and I haven't thought about how this should work. There's a minor problem remaining here, which impacts NTP servers but not clients. NTP servers don't use adjtimex to get the timestamps the use to populate NTP packets; they just use clock_gettime, because the latter is a VDSO call and the former is not. That means that they may not learn of the `clockver` step in time to answer any queries that come in immediately post-blackout. So, the server may briefly become a falseticker if the hypervisor did too poor a job with the TSC step. I don't think this is a big deal or something that needs to be addressed in the first iteration. The long-term solution is to make adjtimex also be a VDSO when `modes` is 0, so that NTP daemons can use it for everything.