Hi Daniel, As we're venturing into the realm of timekeeping and he has noted the evils of virtualization several times, I've CC'ed Thomas on this mail. Some comments inline on the ongoing discussion. On Tue, Mar 22, 2022 at 07:18:20PM +0000, Franke, Daniel wrote: > On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@xxxxxxxxxx> wrote: > > Right, but I'd argue that interface has some problems too. It > > depends on the guest polling instead of an interrupt from the > > hypervisor. It also has no way of informing the kernel exactly how much > > time has elapsed. > > > The whole point of all these hacks that we've done internally is that we, > > the hypervisor, know full well how much real time hasv advanced during the > > VM blackout. If we can at least let the guest know how much to fudge real > > time, it can then poke NTP for better refinement. I worry about using NTP > > as the sole source of truth for such a mechanism, since you'll need to go > > out to the network and any reads until the response comes back are hosed. > > (I'm a kernel newbie, so please excuse any ignorance with respect to kernel > Internals or kernel/hypervisor interfaces.) Welcome :-) > We can have it both ways, I think. Let the hypervisor manipulate the guest TSC > so as to keep the guest kernel's idea of real time as accurate as possible > without any awareness required on the guest's side. *Also* give the guest kernel > a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else, > and let the kernel propagate this notification to userspace so that the NTP > daemon can recombobulate itself as quickly as possible, treating whatever TSC > adjustment was received as best-effort only. But what happens to CLOCK_MONOTONIC in this case? We are still accepting the fact that live migrations destroy CLOCK_MONOTONIC if we directly advance the guest TSCs to account for elapsed time. The definition of CLOCK_MONOTONIC is that the clock does not count while the system is suspended. From the viewpoint of the guest, a live migration appears to be a forced suspend operation at an arbitrary instruction boundary. There is no realistic way for the guest to give the illusion that MONOTONIC has stopped without help from the hypervisor. > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the > globals kept by kernel/time/ntp.c (which are visible to userspace through > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`. I do not disagree that NTP needs to throw the book out after a live migration. But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED relies on the guest polling a shared structure, and who knows when the guest is going to check the structure again? If we inject an interrupt the guest is likely to check this state in a reasonable amount of time. Thomas, we're talking about how to not wreck time (as bad) under virtualization. I know this has been an area of interest to you for a while ;-) The idea is that the hypervisor should let the guest know about time travel. Let's just assume for now that the hypervisor will *not* quiesce the guest into an S2IDLE state before migration. I think quiesced migrations are a good thing to have in the toolbelt, but there will always be host-side problems that require us to migrate a VM off a host immediately with no time to inform the guest. Given that, we're deciding which clock is going to get wrecked during a migration and what the guest can do afterwards to clean it up. Whichever clock gets wrecked is going to have a window where reads race with the eventual fix, and could be completely wrong. My thoughts: We do not advance the TSC during a migration and notify the guest (interrupt, shared structure) about how much it has time traveled (delta_REALTIME). REALTIME is wrong until the interrupt is handled in the guest, but should fire off all of the existing mechanisms for a clock step. Userspace gets notified with TFD_TIMER_CANCEL_ON_SET. I believe you have proposed something similar as a way to make live migration less sinister from the guest perspective. It seems possible to block racing reads of REALTIME if we protect it with a migration sequence counter. Host raises the sequence after a migration when control is yielded back to the guest. The sequence is odd if an update is pending. Guest increments the sequence again after the interrupt handler accounts for time travel. That has the side effect of blocking all realtime clock reads until the interrupt is handled. But what are the value of those reads if we know they're wrong? There is also the implication that said shared memory interface gets mapped through to userspace for vDSO, haven't thought at all about those implications yet. Doing this the other way around (advance the TSC, tell the guest to fix MONOTONIC) is fundamentally wrong, as it violates two invariants of the monotonic clock. Monotonic counts during a migration, which really is a forced suspend. Additionally, you cannot step the monotonic clock. Thoughts? Sorry to revisit this conversation yet again. Virtualization isn't going away any time soon and the illusion that migrations are invisible to the guest is simply not true. -- Thanks, Oliver