Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Fri, 11 Dec 2020 11:18:22 -0300

On Fri, Dec 11, 2020 at 02:30:34PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 10 2020 at 21:27, Marcelo Tosatti wrote:
> > On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
> >> You really all live in a seperate universe creating your own rules how
> >> things which other people work hard on to get it correct can be screwed
> >> over.
> >
> > 	1. T = read timestamp.
> > 	2. migrate (VM stops for a certain period).
> > 	3. use timestamp T.
> 
> This is exactly the problem. Time stops at pause and continues where it
> stopped on resume.
> 
> But CLOCK_REALTIME and CLOCK_TAI advanced in reality. So up to the point
> where NTP fixes this - if there is NTP at all - the guest CLOCK_REALTIME
> and CLOCK_TAI are off by tpause.
> 
> Now the application gets a packet from the outside world with a
> CLOCK_REALTIME timestamp which is suddenly ahead of the value it reads
> from clock_gettime(CLOCK_REALTIME) by tpause. So what is it supposed to
> do with that? Make stupid assumptions that the other end screwed up
> timekeeping, throw an error that the system it is running on screwed up
> timekeeping? And a second later when NTP catched up it gets the next
> surprise because the systems CLOCK_REALTIME jumped forward unexpectedly
> or if there is no NTP it's confused forever.

This can happen even with a "perfect" solution that syncs time
instantly on the migration destination. See steps 1,2,3.

Unless you notify applications to invalidate their time reads,
i can't see a way to fix this.

Therefore if you use VM migration in the first place, a certain amount of
timestamp accuracy error must be tolerated.

> How can you even assume that this is correct?

As noted above, even without a window of unsynchronized time (due to
delay for NTP to sync time), time reads can be stale.

> It is exactly the same problem as we had many years ago with hardware
> clocks suddenly stopping to tick which caused quite some stuff to go
> belly up.

Customers complained when it was 5 seconds off, now its 0.1ms (and
people seem happy).

> In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced
> (with a certain degree of accuracy) to compensate for the sleep time, so
> the other end of a communication is at least in the same ballpark, but
> not 50 seconds off.

Its 100ms off with migration, and can be reduced further (customers
complained about 5 seconds but seem happy with 0.1ms).

> >> This features first, correctness later frenzy is insane and it better
> >> stops now before you pile even more crap on the existing steaming pile
> >> of insanities.
> >
> > Sure.
> 
> I wish that would be true. OS people - you should know that - are
> fighting forever with hardware people over feature madness and the
> attitude of 'we can fix that in software' which turns often enough out
> to be wrong.
> 
> Now sadly enough people who suffered from that madness work on
> virtualization and instead of trying to avoid the same problem they go
> off and make it even worse.

So you think its important to reduce the 100ms offset? 

> It's the same problem again as with hardware people. Not talking to the
> other people _before_ making uninformed assumptions and decisions.
> 
> We did it that way because big customer asked for it is not a
> justification for inflicting this on everybody else and thereby
> violating correctness. Works for me and my big customer is not a proof
> of correctness either.
> 
> It's another proof that this industry just "works" by chance.
> 
> Thanks,
> 
>         tglx

OK, makes sense, then reducing the 0.1ms window even further
is a useful thing to do. What would be an acceptable 
CLOCK_REALTIME accuracy error, on migration?