+Suravee, +Alejandro On 29/06/2023 23:35, Sean Christopherson wrote: > On Thu, May 18, 2023, Joao Martins wrote: >> On 18/05/2023 09:19, Maxim Levitsky wrote: >>> I think that we do need to a flag indicating if the vCPU is currently >>> running and if yes, then use svm->vcpu.cpu (or put -1 to it when it not >>> running or something) (currently the vcpu->cpu remains set when vCPU is >>> put) >>> >>> In other words if a vCPU is running, then avic_pi_update_irte should put >>> correct pCPU number, and if it raced with vCPU put/load, then later should >>> win and put the correct value. This can be done either with a lock or >>> barriers. >>> >> If this could be done, it could remove cost from other places and avoid this >> little dance of the galog (and avoid its usage as it's not the greatest design >> aspect of the IOMMU). We anyways already need to do IRT flushes in all these >> things with regards to updating any piece of the IRTE, but we need some care >> there two to avoid invalidating too much (which is just as expensive and per-VCPU). > > ... > >> But still quite expensive (as many IPIs as vCPUs updated), but it works as >> intended and guest will immediately see the right vcpu affinity. But I honestly >> prefer going towards your suggestion (via vcpu.pcpu) if we can have some >> insurance that vcpu.cpu is safe to use in pi_update_irte if protected against >> preemption/blocking of the VCPU. > > I think we have all the necessary info, and even a handy dandy spinlock to ensure > ordering. Disclaimers: compile tested only, I know almost nothing about the IOMMU > side of things, and I don't know if I understood the needs for the !IsRunning cases. > I was avoiding grabbing that lock, but now that I think about it it shouldn't do much harm. My only concern has mostly been whether we mark the IRQ isRunning=1 on a vcpu that is about to block as then the doorbell rang by the IOMMU won't do anything to the guest. But IIUC the physical ID cache read-once should cover that > Disclaimers aside, this should point the IOMMU at the right pCPU when the target > vCPU changes and the new vCPU is actively running. > Yeap, it should. I am gonna see if we can test this next week (even if not me, but someone from my team) > --- > arch/x86/kvm/svm/avic.c | 44 +++++++++++++++++++++++++++++++++-------- > 1 file changed, 36 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c > index cfc8ab773025..703ad9af73eb 100644 > --- a/arch/x86/kvm/svm/avic.c > +++ b/arch/x86/kvm/svm/avic.c > @@ -791,6 +791,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) > int ret = 0; > unsigned long flags; > struct amd_svm_iommu_ir *ir; > + u64 entry; > > /** > * In some cases, the existing irte is updated and re-set, > @@ -824,6 +825,18 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi) > ir->data = pi->ir_data; > > spin_lock_irqsave(&svm->ir_list_lock, flags); > + > + /* > + * Update the target pCPU for IOMMU doorbells if the vCPU is running. > + * If the vCPU is NOT running, i.e. is blocking or scheduled out, KVM > + * will update the pCPU info when the vCPU awkened and/or scheduled in. > + * See also avic_vcpu_load(). > + */ > + entry = READ_ONCE(*(svm->avic_physical_id_cache)); > + if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) > + amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK, > + true, pi->ir_data); > + Ah! Totally forgot about the ID cache from AVIC. And it's already paired with barriers Maxim was alluding to. Much better than trying to get a safe read of vcpu::cpu. > list_add(&ir->node, &svm->ir_list); > spin_unlock_irqrestore(&svm->ir_list_lock, flags); > out: > @@ -986,10 +999,11 @@ static inline int > avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r) > { > int ret = 0; > - unsigned long flags; > struct amd_svm_iommu_ir *ir; > struct vcpu_svm *svm = to_svm(vcpu); > > + lockdep_assert_held(&svm->ir_list_lock); > + > if (!kvm_arch_has_assigned_device(vcpu->kvm)) > return 0; > > @@ -997,19 +1011,15 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r) > * Here, we go through the per-vcpu ir_list to update all existing > * interrupt remapping table entry targeting this vcpu. > */ > - spin_lock_irqsave(&svm->ir_list_lock, flags); > - > if (list_empty(&svm->ir_list)) > - goto out; > + return 0; > > list_for_each_entry(ir, &svm->ir_list, node) { > ret = amd_iommu_update_ga(cpu, r, ir->data); > if (ret) > - break; > + return ret; > } > -out: > - spin_unlock_irqrestore(&svm->ir_list_lock, flags); > - return ret; > + return 0; > } > > void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > @@ -1017,6 +1027,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > u64 entry; > int h_physical_id = kvm_cpu_get_apicid(cpu); > struct vcpu_svm *svm = to_svm(vcpu); > + unsigned long flags; > > lockdep_assert_preemption_disabled(); > > @@ -1033,6 +1044,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > if (kvm_vcpu_is_blocking(vcpu)) > return; > > + /* > + * Grab the per-vCPU interrupt remapping lock even if the VM doesn't > + * _currently_ have assigned devices, as that can change. Holding > + * ir_list_lock ensures that either svm_ir_list_add() will consume > + * up-to-date entry information, or that this task will wait until > + * svm_ir_list_add() completes to set the new target pCPU. > + */ > + spin_lock_irqsave(&svm->ir_list_lock, flags); > + > entry = READ_ONCE(*(svm->avic_physical_id_cache)); > WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK); > > @@ -1042,12 +1062,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > > WRITE_ONCE(*(svm->avic_physical_id_cache), entry); > avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true); > + > + spin_unlock_irqrestore(&svm->ir_list_lock, flags); > } > > void avic_vcpu_put(struct kvm_vcpu *vcpu) > { > u64 entry; > struct vcpu_svm *svm = to_svm(vcpu); > + unsigned long flags; > > lockdep_assert_preemption_disabled(); > > @@ -1057,10 +1080,15 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu) > if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)) > return; > > + spin_lock_irqsave(&svm->ir_list_lock, flags); > + > avic_update_iommu_vcpu_affinity(vcpu, -1, 0); > > entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; > WRITE_ONCE(*(svm->avic_physical_id_cache), entry); > + > + spin_unlock_irqrestore(&svm->ir_list_lock, flags); > + > } Other than that, looks to be good. To some extent we are doing an update_ga() just like Dengqiao was doing, but with the right protection both from IR perspective and safe read of the VCPU cpu.