Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes

Radim Krčmář <rkrcmar@xxxxxxxxxx> · Fri, 10 Oct 2014 14:50:38 +0200

2014-10-10 11:45+0200, Paolo Bonzini:
> Il 10/10/2014 03:55, Nadav Amit ha scritto:
> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> index b8345dd..51428dd 100644
> >> --- a/arch/x86/kvm/lapic.c
> >> +++ b/arch/x86/kvm/lapic.c
> >> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
> >>  		if (likely(tscdeadline > guest_tsc)) {
> >>  			ns = (tscdeadline - guest_tsc) * 1000000ULL;
> >>  			do_div(ns, this_tsc_khz);
> >> +			hrtimer_start(&apic->lapic_timer.timer,
> >> +				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> >> +		} else {
> >> +			atomic_inc(&ktimer->pending);
> >> +			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
> >>  		}
> >> -		hrtimer_start(&apic->lapic_timer.timer,
> >> -			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> >>
> >>  		local_irq_restore(flags);
> >>  	}
> >> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
> >>  		return;
> >>
> >>  	hrtimer_cancel(&apic->lapic_timer.timer);
> >> -	/* Inject here so clearing tscdeadline won't override new value */
> >> -	if (apic_has_pending_timer(vcpu))
> >> -		kvm_inject_apic_timer_irqs(vcpu);
> >>  	apic->lapic_timer.tscdeadline = data;
> >>  	start_apic_timer(apic);
> >> }
> > 
> > Perhaps I am missing something, but I don’t see how it solves the problem I encountered.
> > Recall the scenario:
> > 1. A TSC deadline timer interrupt is pending.
> > 2. TSC deadline was still not cleared (which happens during vcpu_run).
> > 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
> > 
> > It appears that in such scenario the guest would still get spurious
> > interrupt for no reason, as ktimer->pending may already be increased in
> > apic_timer_fn.
> 
> If ktimer->pending == 2 you still get only one interrupt, not two.  Am I
> misunderstanding your objection?

(I don't see it either, the patch removed kvm_inject_apic_timer_irqs(),
 so we inject as if the MSR write didn't happen.)

> > Second, I think that the solution I proposed would perform better.
> > Currently, there are many unnecessary cancellations and setups of the
> > timer. This solution does not resolve this problem.
> 
> I think it does.  You do not get an hrtimer_start if tscdeadline <=
> guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueued
> before calling hrtimer_cancel, or go straight to the source and avoid
> taking the lock in the easy cases:

I think the useless cancels were when the timer is set to the same value
in the future.

Which makes sense to optimize, but I wasn't sure how it would affect the
guest if the TSC offset was be changing or TSC itself wasn't stable ...
so I chose not to do it.

> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 1c2fe7de2842..6ce725007424 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
>  {
>  	struct hrtimer_clock_base *base;
>  	unsigned long flags;
> -	int ret = -1;
> +	unsigned long state = timer->state;
> +	int ret;
> +
> +	if (state & HRTIMER_STATE_ENQUEUED)
> +		return 0;

It doesn't try very hard to cancel it ;)

> +	if (state & HRTIMER_STATE_CALLBACK)
> +		return -1;
> 
>  	base = lock_hrtimer_base(timer, &flags);
> 
> +	ret = -1;
>  	if (!hrtimer_callback_running(timer))
>  		ret = remove_hrtimer(timer, base);
> 
> > Last, I think that having less interrupts on deadline changes is not
> > completely according to the SDM which says: "If software disarms the
> > timer or postpones the deadline, race conditions may result in the
> > delivery of a spurious timer interrupt.” It never says interrupts may
> > be lost if you reprogram the deadline before you check it expired.

Yeah, too bad it wasn't written in a formally defined language ...
They could be just reserving design space for concurrent implementation,
not making race conditions mandatory.

Our race windows are much bigger -- a disarm that is hundreds of cycles
in the future can easily be hit after a msr write vm-exit, but it would
be very unlikely to get an interrupt on bare metal.

And thinking about the meaning of postpone or disarm -- software doesn't
want the old interrupt anymore, so it would just add useless work.

(And I liked the code better without it, but it'd be ok if we fixed all
 the cases.)

> But the case when you rewrite the same value to the MSR is neither
> disarming nor postponing.  You would be getting two interrupts for the
> same event.  That is why I agree with Radim that checking host_initiated
> is wrong.

Current implementation also injects two interrupts when the timer is set
to a lower nonzero value, which should give us just one under both
interpretations.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html