2017-12-05 10:16+0200, Liran Alon: > When L1 wants to send a posted-interrupt to another L1 CPU running L2, > it sets the relevant bit in vmx->nested.pi_desc->pir and ON bit in > vmx->nested.pi_desc->control. Then it attempts to send a > notification-vector IPI to dest L1 CPU. > This attempt to send IPI will exit to L0 which will reach > vmx_deliver_nested_posted_interrupt() which does the following: > 1. If dest L0 CPU is currently running guest (vcpu->mode == > IN_GUEST_MODE), it sends a physical IPI of PI nested-notification-vector. > 2. It sets KVM_REQ_EVENT in dest vCPU. This is done such that if dest > L0 CPU exits from guest to host and therefore doesn't recognize physical > IPI (or it wasn't sent), then KVM_REQ_EVENT will be consumed on next > vmentry which will call vmx_check_nested_events() which should call > (in theory) vmx_complete_nested_posted_interrupt(). That function > should see vmx->nested.pi_desc->control ON bit is set and therefore > "emulate" posted-interrupt delivery for L1 (sync PIR to IRR in L1 > virtual-apic-page & update vmcs02 RVI). > > The above logic regarding nested-posted-interrupts contains multiple > issues: > > A) Race-condition: > On step (1) it is possible sender will see vcpu->mode == IN_GUEST_MODE > but before sending physical IPI, the dest CPU will exit to host. > Therefore, physical IPI could be received at host which it's hanlder > does nothing. In addition, assume that dest CPU passes the checks > for pending kvm requests before sender sets KVM_REQ_EVENT. Therefore, > dest CPU will resume L2 without evaluating nested-posted-interrupts. > > B) vmx_complete_nested_posted_interrupt() is not always called > when needed. Even if dest CPU consumed KVM_REQ_EVENT, there is a bug > that vmx_check_nested_events() could exit from L2 to L1 before calling > vmx_complete_nested_posted_interrupt(). Therefore, on next resume of > L1 into L2, nested-posted-interrupts won't be evaluated even though L0 > resume L2 (We may resume L2 without having KVM_REQ_EVENT set). > > This commit removes nested-posted-interrupts processing from > check_nested_events() and instead makes sure to process > nested-posted-interrupts on vmentry after interrupts > disabled. Processing of nested-posted-interrupts is delegated > to hardware by issueing a self-IPI of relevant notification-vector > which will be delivered to CPU when CPU is in guest. > > * Bug (A) is solved by the fact that processing of > nested-posted-interrupts is not dependent on KVM_REQ_EVENT and > happens before every vmentry to L2. > * Bug (B) is now trivially solved by processing > nested-posted-interrupts before each vmentry to L2 guest. > > An alternative could have been to just call > vmx_complete_nested_posted_interrupt() at this call-site. However, we > have decided to go with this approach because: > 1. It would require modifying vmx_complete_nested_posted_interrupt() > to be able to work with interrupts disabled (not-trivial). > 2. We preffer to avoid software-emulation of hardware behavior if it > is possible. Nice solution! Self-IPI was slower on non-nested, but even if it is slower here, the code saving is definitely worth it. > Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing") > > Signed-off-by: Liran Alon <liran.alon@xxxxxxxxxx> > Co-authored-by: Nikita Leshenko <nikita.leshchenko@xxxxxxxxxx> > Reviewed-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx> > Signed-off-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > --- > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > @@ -5156,6 +5114,17 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) > kvm_vcpu_kick(vcpu); > } > > +static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) > +{ > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + > + WARN_ON(!vcpu->arch.apicv_active); > + > + if (is_guest_mode(vcpu) && vmx->nested.pi_desc && > + pi_test_on(vmx->nested.pi_desc)) > + kvm_vcpu_trigger_posted_interrupt(vcpu, true); We're always sending to self, so the function would be better with apic->send_IPI_self() as it might be accelerated by hardware. > +} > + > /* > * Set up the vmcs's constant host-state fields, i.e., host-state fields that > * will not change in the lifetime of the guest. > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > @@ -6962,12 +6962,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > smp_mb__after_srcu_read_unlock(); > > /* > - * This handles the case where a posted interrupt was > - * notified with kvm_vcpu_kick. > + * In case guest got the posted-interrupt notification > + * vector while running in host, we need to make sure > + * it arrives to guest. > + * For L1 posted-interrupts, we manually sync PIR to IRR. > + * For L2 posted-interrupts, we send notification-vector > + * again by self IPI such that it will now be received in guest. > */ > - if (kvm_lapic_enabled(vcpu)) { > - if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active) > + if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) { > + if (kvm_x86_ops->sync_pir_to_irr) > kvm_x86_ops->sync_pir_to_irr(vcpu); > + if (kvm_x86_ops->complete_nested_posted_interrupt) > + kvm_x86_ops->complete_nested_posted_interrupt(vcpu); I think that vcpu->arch.apicv_active is enough to guarantee presence of both sync_pir_to_irr and complete_nested_posted_interrupt, and considering this is a hot path, I'd check is_guest_mode(vcpu) before calling the second function. (Saving the function call with a new x86_op for those two seems viable too, but would need deeper analysis and benchmarking as it is putting more code into hot cache ...) Thanks.