RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Wed, 25 May 2011 20:55:39 +0800

> From: Nadav Har'El [mailto:nyh@xxxxxxxxxxxxxxxxxxx]
> Sent: Wednesday, May 25, 2011 8:34 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 23/31] nVMX:
> Correct handling of interrupt injection":
> > >  static void enable_irq_window(struct kvm_vcpu *vcpu)
> > >  {
> > >  	u32 cpu_based_vm_exec_control;
> > > +	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> > > +		/* We can get here when nested_run_pending caused
> > > +		 * vmx_interrupt_allowed() to return false. In this case, do
> > > +		 * nothing - the interrupt will be injected later.
> > > +		 */
> >
> > I think this is not a rare path? when vcpu is in guest mode with L2 as current
> > vmx context, this function could be invoked multiple times since kvm thread
> > can be scheduled out/in randomly.
> 
> As I wrote in this comment, this can only happen on nested_run_pending
> (i.e., VMLAUNCH/VMRESUME emulation), because if !nested_run_pending,
> and nested_exit_on_intr(), vmx_interrupt_allowed() would have already
> exited L2, and we wouldn't be in this case.
> 
> I don't know if to classify this as a "rare" path - it's definitely not
> very common. But what does it matter if it's rare or common?

It doesn't matter much. I just tried to understand your comment.

> 
> 
> > > +		if (to_vmx(vcpu)->nested.nested_run_pending)
> > > +			return 0;
> >
> > Well, now I can see why you require this special 'nested_run_pending' flag
> > because there're places where L0 injects virtual interrupts right after
> > VMLAUNCH/VMRESUME emulation and before entering L2. :-)
> 
> Indeed. I tried to explain that in the patch description, where I wrote
> 
>  We keep a new flag, "nested_run_pending", which can override the decision
> of
>  which should run next, L1 or L2. nested_run_pending=1 means that we
> *must* run
>  L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
>  and therefore expects L2 to be run (and perhaps be injected with an event it
>  specified, etc.). Nested_run_pending is especially intended to avoid switching
>  to L1 in the injection decision-point described above.

atm when nested_run_pending is first introduced, its usage is simple which made
me think this field may not be required. But later several key patches do depend
on this flag for correctness. :-)

> 
> > > +		nested_vmx_vmexit(vcpu);
> > > +		vmcs12 = get_vmcs12(vcpu);
> > > +		vmcs12->vm_exit_reason =
> EXIT_REASON_EXTERNAL_INTERRUPT;
> > > +		vmcs12->vm_exit_intr_info = 0;
> > > +		/* fall through to normal code, but now in L1, not L2 */
> > > +	}
> > > +
> >
> > This is a bad place to add this logic. vmx_interrupt_allowed is simply a
> > query function but you make it an absolute trigger point for switching from
> > L2 to L1. This is fine as now only point calling vmx_interrupt_allowed is
> > when there's vNMI pending. But it's dangerous to have such assumption
> > for pending events inside vmx_interrupt_allowed.
> 
> Now you're beating a dead horse ;-)
> 
> Gleb, and to some degree Avi, already argued that this is the wrong place
> to do this exit, and if anything the exit should be done (or just decided on)
> in enable_irq_window().
> 
> My counter-argument was that the right way is *neither* of these approaches
> -
> any attempt to "commandeer" one of the existing x86 ops, be they
> vmx_interrupt_allowed() or enable_irq_window() to do in the L2 case things
> they were never designed to do is both ugly, and dangerous if the call sites
> change at some time in the future.
> 
> So rather than changing one ugly abuse of one function, to the (arguably
> also ugly) abuse of another function, what I'd like to see is a better overall
> design, where the call sites in x86.c know about the possibility of a nested
> guest (they already do - like we previously discussed, an is_guest_mode()
> function was recently added), and when they need, *they* will call an
> exit-to-L1 function, rather than calling a function called "enable_irq_window"
> or "vmx_interrupt_allowed" which mysteriously will do the exit.
> 

I agree with your point here.

> 
> > On the other hand, I think there's one area which is not handled timely.
> > I think you need to kick a L2->L1 transition when L0 wants to inject
> > virtual interrupt. Consider your current logic:
> >
> > a) L2 is running on cpu1
> > b) L0 on cpu 0 decides to post a virtual interrupt to L1. An IPI is issued to
> > cpu1 after updating virqchip
> > c) L2 on cpu0 vmexit to L0, and checks whether L0 or L1 should handle
> > the event. As it's an external interrupt, L0 will handle it. As it's a notification
> > IPI, nothing is required.
> > d) L0 on cpu0 then decides to resume, and find KVM_REQ_EVENT
> >
> > At this point you only add logic to enable_irq_window, but there's no
> > action to trigger L2->L1 transition. So what will happen? Will the event
> > be injected into L2 instead or pend until next switch happens due to
> > other cause?
> 
> I'm afraid I'm missing something in your explanation... In step d, L0 finds
> an interrupt in the injection queue, so isn't the first thing it does is to
> call vmx_interrupt_allowed(), to check if injection is allowed now?
> In our code, "vmx_interrupt_allowed()" was bastardized to exit to L1 in
> this case. Isn't that the missing exit you were looking for?
> 

This is a false alarm. In my earlier search I thought that vmx_interrupt_allowed
is only invoked in vmx.c for pending vNMI check which actually led me wonder
for a bigger problem. But actually this .interrupt_allowed is checked in common
path as expected. So my own problem here. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html