>> The problem starts when the 'force suspended' time becomes excessive, as >> that causes the mass expiry of clock monotonic timers with all the nasty >> side effects described in a3ed0e4393d6. In the worst case it's going to >> exceed the catchup limit of non-suspended timekeeping (~440s for a TSC >> @2GHz) which in fact breaks the world and some more. >> >> So let me go back to the use cases: >> >> 1) Regular freeze of a VM to disk and resume at some arbitrary point in >> the future. >> >> 2) Live migration >> >> In both cases the proper solution is to make the guest go into a well >> defined state (suspend) and resume it on restore. Everything just works >> because it is well defined. >> >> Though you say that there is a special case (not that I believe it): > > I believe the easier special case to articulate is when the hypervisor > has already done its due diligence to warn the guest about a > migration. Guest doesn't heed the warning and doesn't quiesce. The > most predictable state at this point is probably just to kill the VM > on the spot, but that is likely to be a _very_ tough sell :) I'm all for it. It's very well defined. > So assuming that it's still possible for a non-cooperative suspend > (live migration, VM freeze, etc.) there's still a need to stop the > bleeding. I think you touch on what that may look like: > >> 1) Trestart - Tstop < TOLERABLE_THRESHOLD >> >> That's the easy case as you just can adjust TSC on restore by that >> amount on all vCPUs and be done with it. Just works like scheduling >> out all vCPUs for some time. >> >> 2) Trestart - Tstop >= TOLERABLE_THRESHOLD >> >> Avoid adjusting TSC for the reasons above. > > Which naturally will prompt the question: what is the value of > TOLERABLE_THRESHOLD? Speaking from experience (Google already does > something similar, but without a good fallback for exceeding the > threshold), there's ~zero science in deriving that value. But, IMO if > it's at least documented we can make the shenanigans at least a bit > more predictable. It also makes it very easy to define who > (guest/host) is responsible for cleaning up the mess. See above, but the hyperscalers with experience on heavy host overload might have better information when keeping a vCPU scheduled out starts to create problems in the guest. > In absence of documentation there's an unlimited license for VM > operators to do as they please and I fear we will forever perpetuate > the pain of time in virt. You can prevent that, by making 'Cooperate within time or die hard' the policy. :) >> if (seq != get_migration_sequence()) >> do_something_smart(); >> else >> proceed_as_usual(); > > Agreed pretty much the whole way through. There's no point in keeping > NTP naive at this point. > > There's a need to sound the alarm for NTP regardless of whether > TOLERABLE_THRESHOLD is exceeded. David pointed out that the host > advancing the guest clocks (delta injection or TSC advancement) could > inject some error. Also, hardware has likely changed and the new parts > will have their own errors as well. There is no reason why you can't use the #2 scheme for the #1 case too: >> On destination host: >> Restore memory image >> Expose metadata in PV: >> - migration sequence number + 1 - Flag whether Tout was compensated already via TSC or just set Tout = 0 >> - Tout (dest/source host delta of clock TAI) >> Run guest >> >> Guest kernel: >> >> - Keep track of the PV migration sequence number. >> >> If it changed act accordingly by injecting the TAI delta, >> which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET, >> etc... if it was compensated via TSC already, it might be sufficient to just reset NTP state. >> NTP: >> - utilize the sequence counter information .... OTOH, the question is whether it's worth it. If we assume that the sane case is a cooperative guest and the forced migration is the last resort, then we can just avoid the extra magic and the discussion around the correct value for TOLERABLE_THRESHOLD alltogether. I suggest to start from a TOLERABLE_THRESHOLD=0 assumption to keep it simple in the first step. Once this has been established, you can still experiment with the threshold and figure out whether it matters. In fact, doing the host side TSC compensation is just an excuse for VM operators not to make the guest cooperative, because it might solve their main problems for the vast majority of migrations. Forcing them to doing it right is definitely the better option, which means the 'Cooperate or die hard' policy is the best one you can chose. :) >> That still will leave everything else exposed to >> CLOCK_REALTIME/TAI jumping forward, but there is nothing you can >> do about that and any application which cares about this has to >> be able to deal with it anyway. > > Right. There's no cure-all between hypervisor/guest kernel that could > fix the problem for userspace entirely. In the same way as there is no cure for time jumps caused by settimeofday(), daylight saving changes, leap seconds etc., unless the application is carefully designed to deal with that. > Appreciate you chiming in on this topic yet again. I still hope that this get's fixed _before_ I retire :) Thanks, tglx