Re: Bug in KVM clock backwards compensation

Zachary Amsden <zamsden@xxxxxxxxxx> · Thu, 28 Apr 2011 20:00:57 -0700

On 04/28/2011 01:20 PM, Joerg Roedel wrote:
On Thu, Apr 28, 2011 at 11:34:44AM -0700, Zachary Amsden wrote:

On 04/28/2011 12:13 AM, Roedel, Joerg wrote:

I see it different. This code wants to check if the _guest_ tsc moves
forwared (or at least not backwards). So it is fully legitimate to just
do this by reading the guest-tsc and compare it to the last one the
guest had.

That wasn't the intention when I wrote that code.  It's simply there to
detect backwards motion of the host TSC.  The guest TSC can legally go
backwards whenever the guest decides to change it, so checking the guest
TSC doesn't make sense here.

This code checks how many guest tsc cycles have passed since this vCPU
was de-scheduled last time (and before it is running again). So since
the vCPU hasn't run in the meantime it had no chance to change its TSC.
Further, the other parameters like the TSC offset and the scaling
multiplier havn't changed too, so the only variable in the guest-tsc
calculation is the host-tsc.
So this calculation using the guest-tsc can detect backwards going
host-tsc as good as the old one. The benefit here is that we can feed
consistent values into adjust_tsc_offset().

While true, this is more complex than the original code.  The original 
code here doesn't try to actually compensate for the guest TSC 
difference - instead what it does is NULL any discovered host TSC delta:

        if (tsc_delta < 0)
            mark_tsc_unstable("KVM discovered backwards TSC");
        if (check_tsc_unstable()) {
            kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
            vcpu->arch.tsc_catchup = 1;
        }
        kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

Erasing that delta also erases elapsed time since the VCPU has last been 
run, which isn't desirable, so it then sets tsc_catchup mode, which will 
restore the proper TSC.  The request here triggers code which later 
updates the TSC offset again.

To avoid complexity, I think it's simplest to do the first computation 
in terms of the host TSC.

Yes, with tsc-scaling, the machines already have stable TSCs - the above
test is for older hardware which could have problems, and can be
reverted back to the original code without worrying about switching
units.

This is the case pratically. But architecturally the tsc-scaling feature
does not depend on a constant tsc, so we can make no such assumtions.
Additionally, it may happen that Linux mis-detects an unstable tsc for
some reason (broken BIOS, bug in the code, ...).  Therefore I think it
is dangerous to assume that this code will never run on tsc-scaling
capable hosts. And if it does and we don't manage the tsc-offset units
right, we may see very weird behavior.

I agree, it is best to handle this case - hardware can and will change - 
but the TSC adjustment in terms of guest rate should be done under the 
atomic protection right before entering hardware virtualized mode - here:

I left compute_guest_tsc in place to recompute time in guest units here, 
even if the underlying hardware rate changes.

        /*
         * We may have to catch up the TSC to match elapsed wall clock
         * time for two reasons, even if kvmclock is used.
         *   1) CPU could have been running below the maximum TSC rate
         *   2) Broken TSC compensation resets the base at each VCPU
         *      entry to avoid unknown leaps of TSC even when running
         *      again on the same CPU.  This may cause apparent elapsed
         *      time to disappear, and the guest to stand still or run
         *      very slowly.
         */
        if (vcpu->tsc_catchup) {
                u64 tsc = compute_guest_tsc(v, kernel_ns);
                if (tsc > tsc_timestamp) {
                        kvm_x86_ops->adjust_tsc_offset(v, tsc - 
tsc_timestamp);
                        tsc_timestamp = tsc;
                }
        }

So yeah, the code is getting pretty complex but we'd like to avoid that 
as much as possible - so I would prefer to have the hardware backwards 
compensation separate from the guest rate computation by doing this:

step 1) remove any backwards hardware TSC delta (in hardware units)
step 2) recompute guest TSC from a stable clock (gotten from kernel_ns) 
and apply adjustment (in guest units)

So it appears you can just share most of the logic of guest TSC catchup 
mode.

What do you think?

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html