Re: hard lockup in wait_lapic_expire() - bug in TSC deadline timer?

Alan Jenkins <alan.christopher.jenkins@xxxxxxxxx> · Mon, 23 May 2016 15:30:01 +0100

On 22/05/2016, Alan Jenkins <alan.christopher.jenkins@xxxxxxxxx> wrote:
> On 22/05/2016, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
>> 2016-05-22 9:23 GMT+08:00 Wanpeng Li <kernellwp@xxxxxxxxx>:
>>> 2016-05-20 19:47 GMT+08:00 Alan Jenkins
>>> <alan.christopher.jenkins@xxxxxxxxx>:
>>>> Hi
>>>>
>>>> I'm getting a hard lockup in wait_lapic_expire(). I'm not certain why,
>>>> and it didn't seem to reproduce on a second setup. However I found a
>>>> suspect in the code for TSC deadline timers. Maybe someone is
>>>> interested
>>>> in my analysis.
>>>>
>>>> If a guest has a TSC deadline timer set, it's not re-done when the TSC
>>>> is adjusted, and will fire at the wrong time. The hrtimer used for
>>>> emulation is not being recalculated. If the TSC was set _backwards_, I
>>>> think it could trigger a lockup in wait_lapic_expire(). This function
>>>> is
>>>> a busy-loop optimization, which could be tricked into busy-looping for
>>>> too long. The expected busy-wait is `lapic_timer_advance_ns`, which
>>>> defaults to 0 (I haven't changed it).
>>>
>>> The default clockevent device for hrtimer is lapic, and tsc works as a
>>> clocksource device, even if tsc in guest is backwards/adjusted,
>>> clockevent device is not influenced and work normally I think, so we
>>> just need to keep clockevent device can fire asap when it expires. How
>>> about someting like below(I can't reproduce your bug and just can send
>>> out a formal patch after your confirm):
>
> I don't understand your explanation.  I don't mind testing such a
> patch, as it will fix the issue I experience :).
>
> In my understanding this is only a workaround, so we should also add a
> (ratelimited) warning message.
>
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index bbb5b28..02a21d3 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -1310,7 +1310,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>>
>>>         /* __delay is delay_tsc whenever the hardware has TSC, thus
>>> always.  */
>>>         if (guest_tsc < tsc_deadline)
>>> -               __delay(tsc_deadline - guest_tsc);
>>> +               __delay(max((tsc_deadline - guest_tsc),
>>> lapic_timer_advance_ns));
>>>  }

But that's comparing TSC ticks and nanoseconds?

>> s/max/min
>
> wait I found another one
>
>>> -               __delay(tsc_deadline - guest_tsc);
>
>
>>>                __delay(
>>>                        tsc_deadline - guest_tsc)
>
>
> (tsc_deadline - guest_tsc) gives a value in terms of guest TSC rate.
> But guest TSCs are scaled.  __delay() must expect a value in terms of
> host TSC rate.  So the busy-wait could take shorter or longer than
> expected depending on how the guest TSC is scaled.  Right?

I wrote an expanded test patch, which is posted with full results on
the Fedora bug https://bugzilla.redhat.com/show_bug.cgi?id=1337667

It confirmed the lockup can be avoided by sanity-checking the delay in
wait_lapic_expire().  With my test patch, I get a warning message but
no lockup.

Wanpeng, I have another description here, maybe it is more convincing?

https://github.com/torvalds/linux/commit/5dda13b540c0baeebb69f612036e9e1d9d754ad8

My test patch also added checks for guest TSC being set backwards.
The log confirms this is happening, though I haven't confirmed exactly
how the lockup is created.  It's non-obvious if there's a match
between the adjustments shown and the exact lockup duration.
(Conversion factor = reported guest tsc rate = 1.6 Ghz = host tsc
rate).

Note the kernel log shows a VM being resumed from a snapshot, twice.
I don't stop the VM before I ask virt-manager to resume from snapshot
the second time.  Normally the lockup happens on that second resume.

While I'm ranting, adding the TSC checks made me think I understand
enough to detail the "real" fix.  Not tested:

https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1

Alan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html