Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

Oliver Upton <oupton@xxxxxxxxxx> · Mon, 21 Mar 2022 21:23:24 +0000

On Mon, Mar 21, 2022 at 07:43:21PM +0000, David Woodhouse wrote:
> On Mon, 2022-03-21 at 00:38 +0000, Oliver Upton wrote:
> > On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> > > But coincidentally since then I have started having conversations with
> > > people who really want the guest to have an immediate knowledge of the
> > > adjtimex maxerror etc. on the new host immediately after the migration.
> > > Maybe the "if the migration isn't fast enough then let the guest know it's
> > > now unsynced" is OK, but I'll need to work out what "immediately" means
> > > when we have a guest userspace component involved in it.
> > 
> > This has also been an area of interest to me. I think we've all seen the
> > many ways in which doing migrations behind the guest's can put software
> > in an extremely undesirable state on the other end. If those
> > conversations are taking place on the mailing lists, could you please CC
> > me?
> > 
> > Our (Google) TSC adjustment clamping and userspace notification mechanism
> > was a halfway kludge to keep things happy on the other end. And it
> > generally has worked well, but misses a fundamental point.
> > 
> > The hypervisor should tell the guest kernel about time travel and let it
> > cascade that information throughout the guest system. Regardless of what
> > we do to the TSC, we invariably destroy one of the two guest clocks along
> > the way. If we told the guest "you time traveled X seconds", it could
> > fold that into its own idea of real time. Guest kernel can then fire off
> > events to inform software that wants to keep up with clock changes, and
> > even a new event to let NTP know its probably running on different
> > hardware.
> > 
> > Time sucks :-)
> 
> So, we already have PVCLOCK_GUEST_STOPPED which tells the guest that
> its clock may have experienced a jump. Linux guests will use this to
> kick various watchdogs to prevent them whining. Shouldn't we *also* be
> driving the NTP reset from that same signal?

Right, but I'd argue that interface has some problems too. It
depends on the guest polling instead of an interrupt from the
hypervisor. It also has no way of informing the kernel exactly how much
time has elapsed.

The whole point of all these hacks that we've done internally is that we,
the hypervisor, know full well how much real time hasv advanced during the
VM blackout. If we can at least let the guest know how much to fudge real
time, it can then poke NTP for better refinement. I worry about using NTP
as the sole source of truth for such a mechanism, since you'll need to go
out to the network and any reads until the response comes back are hosed.

--
Thanks,
Oliver