2016-09-07 00:29+0200, Paolo Bonzini: > Bad things happen if a guest using the TSC deadline timer is migrated. > The guest doesn't re-calibrate the TSC after migration, and the > TSC frequency can and will change unless your processor supports TSC > scaling (on Intel this is only Skylake) or your data center is perfectly > homogeneous. KVM has software scaling thanks to tsc_catchup and virtual_tsc_khz to allow the guest to use hardware TSC directly. The software scaling should recalibrate TSC on vmexits and is therefore losing precision in between -- are you working around the imprecision? Host should translate the requested tsc deadline into nanoseconds based on vcpu->arch.virtual_tsc_khz, which doesn't change on migration, so the solution shouldn't work, because we'd be scaling twice. Don't we have a bug in the TSC recalibration/scaling on KVM side? > The solution in this patch is to skip tsc_khz, and instead derive the > frequency from kvmclock's (mult, shift) pair. Because kvmclock > parameters convert from tsc to nanoseconds, this needs a division > but that's the least of our problems when the TSC_DEADLINE_MSR write > costs 2000 clock cycles. Luckily tsc_khz is really used by very little > outside the tsc clocksource (which kvmclock replaces) and the TSC > deadline timer. Because KVM's local APIC doesn't need quirks, we > provide a paravirt clockevent that still uses the deadline timer > under the hood (as suggested by Andy Lutomirski). I don't the general idea. TSC and TSC_DEADLINE_TIMER is a standard feature; If we don't emulate it, then we should not expose it in CPUID. rdtsc and wrmsr would still be available for kvmclock, but naive operating systems would not be misled. When we are already going the paravirtual route, we could add an interface that accepts the deadline in kvmclock nanoseconds. It would be much more maintanable than adding a fragile paravirtual layer on top of random interfaces. > This patch does not handle the very first deadline, hoping that it > is masked by the migration downtime (i.e. that the timer fires late > anyway). > > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > --- Nonetheless, I also did a code review ... > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c > @@ -245,6 +246,155 @@ static void kvm_shutdown(void) > + > +static int kvmclock_lapic_timer_stop(struct clock_event_device *evt) > +{ > + kvmclock_lapic_timer_setup(APIC_LVT_MASKED); > + wrmsrl(MSR_IA32_TSC_DEADLINE, -1); This wrmsrl() can't inject an interrupt, because later switch to TSCDEADLINE mode disarms the timer, but it would be better to remove it. > + return 0; > +} > + > +/* > + * We already have the inverse of the (mult,shift) pair, though this means > + * we need a division. To avoid it we could compute a multiplicative inverse > + * every time src->version changes. > + */ > +#define KVMCLOCK_TSC_DEADLINE_MAX_BITS 38 > +#define KVMCLOCK_TSC_DEADLINE_MAX ((1ull << KVMCLOCK_TSC_DEADLINE_MAX_BITS) - 1) > + > +static int kvmclock_lapic_next_ktime(ktime_t expires, > + struct clock_event_device *evt) > +{ > + u64 ns, tsc; > + u32 version; > + int cpu; > + struct pvclock_vcpu_time_info *src; > + > + cpu = smp_processor_id(); > + src = &hv_clock[cpu].pvti; > + ns = ktime_to_ns(expires); > + > + do { > + u64 delta_ns; > + int shift; > + > + version = pvclock_read_begin(src); > + if (unlikely(ns < src->system_time)) { > + tsc = src->tsc_timestamp; > + virt_rmb(); > + continue; > + } > + > + delta_ns = ns - src->system_time; > + > + /* Cap the wait to avoid overflow. */ > + if (unlikely(delta_ns > KVMCLOCK_TSC_DEADLINE_MAX)) > + delta_ns = KVMCLOCK_TSC_DEADLINE_MAX; > + > + /* > + * delta_tsc = delta_ns << (32-tsc_shift) / tsc_to_system_mul. > + * The shift is split in two steps so that a 38 bits (275 s) > + * deadline fits into the 64-bit dividend. > + */ > + shift = 32 - src->tsc_shift; > + > + /* First shift step... */ > + delta_ns <<= 64 - KVMCLOCK_TSC_DEADLINE_MAX_BITS; > + shift -= 64 - KVMCLOCK_TSC_DEADLINE_MAX_BITS; Considering that the usual shift seems to be -1, we'd be losing 7 bits of precision. The nice thing is that the precision is bounded up to the cap ... I dislike the cap and LOC count, though, so the following, although not as tightly bounded seems a bit better: u64 mult = div_u64(1ULL << 63, tsc_to_system_mul); int shift = 63 - (32 - tsc_shift)); tsc = src->tsc_timestamp + mul_u64_u64_shr(delta_ns, mult, shift); mult should be quite stable, so we could cache it. (And if we didn't care about performance loss from 4(?) divisions, we could have much nicer shl_div().) > + > + /* ... division... */ > + tsc = div_u64(delta_ns, src->tsc_to_system_mul); > + > + /* ... and second shift step for the remaining bits. */ > + if (shift >= 0) > + tsc <<= shift; > + else > + tsc >>= -shift; > + > + tsc += src->tsc_timestamp; > + } while (pvclock_read_retry(src, version)); > + wrmsrl(MSR_IA32_TSC_DEADLINE, tsc); > + return 0; > +} > + > +/* > + * The local apic timer can be used for any function which is CPU local. > + */ > +static struct clock_event_device kvm_clockevent = { > + .name = "lapic", "lapic" is already used -- what about "kvm-lapic" or "kvm-tsc-deadline"? > + /* Under KVM the LAPIC timer always runs in deep C-states. */ > + .features = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_KTIME, > + .set_state_shutdown = kvmclock_lapic_timer_stop, > + .set_state_oneshot = kvmclock_lapic_timer_set_oneshot, > + .set_next_ktime = kvmclock_lapic_next_ktime, > + .mult = 1, > + /* Make LAPIC timer preferrable over percpu HPET */ > + .rating = 150, > + .irq = -1, > +}; > +static DEFINE_PER_CPU(struct clock_event_device, kvm_events); > + > +static void kvmclock_local_apic_timer_interrupt(void) > +{ > + int cpu = smp_processor_id(); > + struct clock_event_device *evt = &per_cpu(kvm_events, cpu); > + > + /* > + * Defer to the native clockevent if ours hasn't been setup yet. > + */ > + if (!evt->event_handler) { > + native_local_apic_timer_interrupt(); Don't we completely replace the native apic timer, so it can't even be registered? The comment doesn't explain what we expect from the call, so it's a bit confusing. I think the call expects that per_cpu(lapic_events, cpu).event_handler == NULL, so we do it to print the warning and disable the LAPIC timer. > + return; > + } > + > + inc_irq_stat(apic_timer_irqs); > + evt->event_handler(evt); > +} > + > +/* > + * Setup the local APIC timer for this CPU. Copy the initialized values > + * of the boot CPU and register the clock event in the framework. > + */ > +static void setup_kvmclock_timer(void) > +{ > + struct clock_event_device *evt = this_cpu_ptr(&kvm_events); > + > + kvmclock_lapic_timer_stop(evt); Why stop the timer here? We don't even know if the clockevent device will be used, so it seems out of order. > + memcpy(evt, &kvm_clockevent, sizeof(*evt)); > + evt->cpumask = cpumask_of(smp_processor_id()); > + clockevents_register_device(evt); > +} > +#endif > + > void __init kvmclock_init(void) > { > struct pvclock_vcpu_time_info *vcpu_time; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html