Hi Sean, On Thu, May 06, 2021 at 09:35:46PM +0000, Sean Christopherson wrote: > On Thu, May 06, 2021, Marcelo Tosatti wrote: > > Index: kvm/arch/x86/kvm/vmx/posted_intr.c > > =================================================================== > > --- kvm.orig/arch/x86/kvm/vmx/posted_intr.c > > +++ kvm/arch/x86/kvm/vmx/posted_intr.c > > @@ -114,7 +114,7 @@ static void __pi_post_block(struct kvm_v > > } while (cmpxchg64(&pi_desc->control, old.control, > > new.control) != old.control); > > > > - if (!WARN_ON_ONCE(vcpu->pre_pcpu == -1)) { > > + if (vcpu->pre_pcpu != -1) { > > spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu)); > > list_del(&vcpu->blocked_vcpu_list); > > spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu)); > > @@ -135,20 +135,13 @@ static void __pi_post_block(struct kvm_v > > * this case, return 1, otherwise, return 0. > > * > > */ > > -int pi_pre_block(struct kvm_vcpu *vcpu) > > +static int __pi_pre_block(struct kvm_vcpu *vcpu) > > { > > unsigned int dest; > > struct pi_desc old, new; > > struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); > > > > - if (!kvm_arch_has_assigned_device(vcpu->kvm) || > > - !irq_remapping_cap(IRQ_POSTING_CAP) || > > - !kvm_vcpu_apicv_active(vcpu)) > > - return 0; > > - > > - WARN_ON(irqs_disabled()); > > - local_irq_disable(); > > - if (!WARN_ON_ONCE(vcpu->pre_pcpu != -1)) { > > + if (vcpu->pre_pcpu == -1) { > > vcpu->pre_pcpu = vcpu->cpu; > > spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu)); > > list_add_tail(&vcpu->blocked_vcpu_list, > > @@ -188,12 +181,33 @@ int pi_pre_block(struct kvm_vcpu *vcpu) > > if (pi_test_on(pi_desc) == 1) > > __pi_post_block(vcpu); > > > > + return (vcpu->pre_pcpu == -1); > > Nothing checks the return of __pi_pre_block(), this can be dropped and the > helper can be a void return. Done. > > +} > > + > > +int pi_pre_block(struct kvm_vcpu *vcpu) > > +{ > > + struct vcpu_vmx *vmx = to_vmx(vcpu); > > + > > + vmx->in_blocked_section = true; > > + > > + if (!kvm_arch_has_assigned_device(vcpu->kvm) || > > + !irq_remapping_cap(IRQ_POSTING_CAP) || > > + !kvm_vcpu_apicv_active(vcpu)) > > Opportunistically fix the indentation? Done. > > + return 0; > > + > > + WARN_ON(irqs_disabled()); > > + local_irq_disable(); > > + __pi_pre_block(vcpu); > > local_irq_enable(); > > + > > return (vcpu->pre_pcpu == -1); > > } > > > > void pi_post_block(struct kvm_vcpu *vcpu) > > { > > + struct vcpu_vmx *vmx = to_vmx(vcpu); > > + > > + vmx->in_blocked_section = false; > > if (vcpu->pre_pcpu == -1) > > return; > > > > @@ -236,6 +250,52 @@ bool pi_has_pending_interrupt(struct kvm > > (pi_test_sn(pi_desc) && !pi_is_pir_empty(pi_desc)); > > } > > > > +static void pi_update_wakeup_vector(void *data) > > +{ > > + struct vcpu_vmx *vmx; > > + struct kvm_vcpu *vcpu = data; > > + > > + vmx = to_vmx(vcpu); > > + > > + /* race with pi_post_block ? */ > > + if (vcpu->pre_pcpu != -1) > > This seems wrong. The funky code in __pi_pre_block() regarding pre_cpu muddies > the waters, but I don't think it's safe to call __pi_pre_block() from a pCPU > other than the pCPU that is associated with the vCPU. >From Intel's manual: "29.6 POSTED-INTERRUPT PROCESSING ... Use of the posted-interrupt descriptor differs from that of other data structures that are referenced by pointers in a VMCS. There is a general requirement that software ensure that each such data structure is modified only when no logical processor with a current VMCS that references it is in VMX non-root operation. That requirement does not apply to the posted-interrupt descriptor. There is a requirement, however, that such modifications be done using locked read-modify-write instructions." > If the vCPU is migrated after vmx_pi_start_assignment() grabs vcpu->cpu but > before the IPI arrives (to run pi_update_wakeup_vector()), then it's possible > that a different pCPU could be running __pi_pre_block() concurrently with this > code. If that happens, both pcPUs could see "vcpu->pre_cpu == -1" and corrupt > the list due to a double list_add_tail. Good point. > The existing code is unnecessarily confusing, but unless I'm missing something, > it's guaranteed to call pi_pre_block() from the pCPU that is associated with the > pCPU, i.e. arguably it could/should use this_cpu_ptr(). Well problem is it might not exit kvm_vcpu_block(). However that can be fixed. > Because the existing > code grabs vcpu->cpu with IRQs disabled and is called only from KVM_RUN, > vcpu->cpu is guaranteed to match the current pCPU since vcpu->cpu will be set to > the current pCPU when the vCPU is scheduled in. > > Assuming my analysis is correct (definitely not guaranteed), I'm struggling to > come up with an elegant solution. But, do we need an elegant solution? E.g. > can the start_assignment() hook simply kick all vCPUs with APICv active? > > > + return; > > + > > + if (!vmx->in_blocked_section) > > + return; > > + > > + __pi_pre_block(vcpu); > > +} > > + > > +void vmx_pi_start_assignment(struct kvm *kvm, int device_count) > > +{ > > + struct kvm_vcpu *vcpu; > > + int i; > > + > > + if (!irq_remapping_cap(IRQ_POSTING_CAP)) > > + return; > > + > > + /* only care about first device assignment */ > > + if (device_count != 1) > > + return; > > + > > + /* Update wakeup vector and add vcpu to blocked_vcpu_list */ > > + kvm_for_each_vcpu(i, vcpu, kvm) { > > + struct vcpu_vmx *vmx = to_vmx(vcpu); > > + int pcpu; > > + > > + if (!kvm_vcpu_apicv_active(vcpu)) > > + continue; > > + > > + preempt_disable(); > > Any reason not to do "cpu = get_cpu()"? Might make sense to do that outside of > the for-loop, too. kvm_vcpu_kick seems cleaner, just need to add another arch hook to allow kvm_vcpu_block() to return. Thanks for the review! Will resend after testing.