On 22/05/2016, Alan Jenkins <alan.christopher.jenkins@xxxxxxxxx> wrote: > On 22/05/2016, Wanpeng Li <kernellwp@xxxxxxxxx> wrote: >> 2016-05-22 9:23 GMT+08:00 Wanpeng Li <kernellwp@xxxxxxxxx>: >>> 2016-05-20 19:47 GMT+08:00 Alan Jenkins >>> <alan.christopher.jenkins@xxxxxxxxx>: >>>> Hi >>>> >>>> I'm getting a hard lockup in wait_lapic_expire(). I'm not certain why, >>>> and it didn't seem to reproduce on a second setup. However I found a >>>> suspect in the code for TSC deadline timers. Maybe someone is >>>> interested >>>> in my analysis. >>>> >>>> If a guest has a TSC deadline timer set, it's not re-done when the TSC >>>> is adjusted, and will fire at the wrong time. The hrtimer used for >>>> emulation is not being recalculated. If the TSC was set _backwards_, I >>>> think it could trigger a lockup in wait_lapic_expire(). This function >>>> is >>>> a busy-loop optimization, which could be tricked into busy-looping for >>>> too long. The expected busy-wait is `lapic_timer_advance_ns`, which >>>> defaults to 0 (I haven't changed it). >>> >>> The default clockevent device for hrtimer is lapic, and tsc works as a >>> clocksource device, even if tsc in guest is backwards/adjusted, >>> clockevent device is not influenced and work normally I think, so we >>> just need to keep clockevent device can fire asap when it expires. How >>> about someting like below(I can't reproduce your bug and just can send >>> out a formal patch after your confirm): > > I don't understand your explanation. I don't mind testing such a > patch, as it will fix the issue I experience :). > > In my understanding this is only a workaround, so we should also add a > (ratelimited) warning message. > >>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >>> index bbb5b28..02a21d3 100644 >>> --- a/arch/x86/kvm/lapic.c >>> +++ b/arch/x86/kvm/lapic.c >>> @@ -1310,7 +1310,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) >>> >>> /* __delay is delay_tsc whenever the hardware has TSC, thus >>> always. */ >>> if (guest_tsc < tsc_deadline) >>> - __delay(tsc_deadline - guest_tsc); >>> + __delay(max((tsc_deadline - guest_tsc), >>> lapic_timer_advance_ns)); >>> } But that's comparing TSC ticks and nanoseconds? >> s/max/min > > wait I found another one > >>> - __delay(tsc_deadline - guest_tsc); > > >>> __delay( >>> tsc_deadline - guest_tsc) > > > (tsc_deadline - guest_tsc) gives a value in terms of guest TSC rate. > But guest TSCs are scaled. __delay() must expect a value in terms of > host TSC rate. So the busy-wait could take shorter or longer than > expected depending on how the guest TSC is scaled. Right? I wrote an expanded test patch, which is posted with full results on the Fedora bug https://bugzilla.redhat.com/show_bug.cgi?id=1337667 It confirmed the lockup can be avoided by sanity-checking the delay in wait_lapic_expire(). With my test patch, I get a warning message but no lockup. Wanpeng, I have another description here, maybe it is more convincing? https://github.com/torvalds/linux/commit/5dda13b540c0baeebb69f612036e9e1d9d754ad8 My test patch also added checks for guest TSC being set backwards. The log confirms this is happening, though I haven't confirmed exactly how the lockup is created. It's non-obvious if there's a match between the adjustments shown and the exact lockup duration. (Conversion factor = reported guest tsc rate = 1.6 Ghz = host tsc rate). Note the kernel log shows a VM being resumed from a snapshot, twice. I don't stop the VM before I ask virt-manager to resume from snapshot the second time. Normally the lockup happens on that second resume. While I'm ranting, adding the TSC checks made me think I understand enough to detail the "real" fix. Not tested: https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1 Alan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html