On 16/02/2023 11:42, Suthikulpanit, Suravee wrote: > On 2/8/2023 8:19 PM, Joao Martins wrote: >> On KVM GSI routing table updates, specially those where they have vIOMMUs >> with interrupt remapping enabled (e.g. to boot >255vcpus guests without >> relying on KVM_FEATURE_MSI_EXT_DEST_ID), a VMM may update the backing VF >> MSIs with new VCPU affinities. >> >> On AMD this translates to calls to amd_ir_set_vcpu_affinity() and >> eventually to amd_iommu_{de}activate_guest_mode() with a new GATag >> outlining the VM ID and (new) VCPU ID. On vCPU blocking and unblocking >> paths it disables AVIC, and rely on GALog to convey the wakeups to any >> sleeping vCPUs. KVM will store a list of GA-mode IR entries to each >> running/blocked vCPU. So any vCPU Affinity update to a VF interrupt happen >> via KVM, and it will change already-configured-guest-mode IRTEs with a new >> GATag. > > Could we simplify this paragraph to: > > On AMD with AVIC enabled, the new vcpu affinity info is updated via: > avic_pi_update_irte() > irq_set_vcpu_affinity() > amd_ir_set_vcpu_affinity() > amd_iommu_{de}activate_guest_mode() > > where the IRTE[GATag] is updated with the new vcpu affinity. The GATag contains > VM ID and VCPU ID, and is used by IOMMU hardware to signal KVM (via GALog) when > interrupt cannot be delivered due to vCPU is in blocking state. > Will change for v2. >> The issue is that amd_iommu_activate_guest_mode() will essentially only >> change IRTE fields on transitions from non-guest-mode to guest-mode and >> otherwise returns *with no changes to IRTE* on already configured >> guest-mode interrupts. To the guest this means that the VF interrupts >> remain affined to the first vCPU these were first configured, and guest >> will be unable to either VF interrupts and receive messages like this from >> spurious interrupts (e.g. from waking the wrong vCPU in GALog): >> >> [ 167.759472] __common_interrupt: 3.34 No irq handler for vector >> [ 230.680927] mlx5_core 0000:00:02.0: mlx5_cmd_eq_recover:247:(pid >> 3122): Recovered 1 EQEs on cmd_eq >> [ 230.681799] mlx5_core 0000:00:02.0: >> wait_func_handle_exec_timeout:1113:(pid 3122): cmd[0]: CREATE_CQ(0x400) >> recovered after timeout >> [ 230.683266] __common_interrupt: 3.34 No irq handler for vector >> >> Given that amd_ir_set_vcpu_affinity() uses amd_iommu_activate_guest_mode() >> underneath it essentially means that VCPU affinity changes of IRTEs are >> nops if it was called once for the IRTE already (on VMENTER). Fix it by >> dropping the check for guest-mode at amd_iommu_activate_guest_mode(). Same >> thing is applicable to amd_iommu_deactivate_guest_mode() although, even if >> the IRTE doesn't change underlying DestID on the host, the VFIO IRQ handler >> will still be able to poke at the right guest-vCPU. >> >> Fixes: b9c6ff94e43a ("iommu/amd: Re-factor guest virtual APIC (de-)activation >> code") >> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> >> --- >> Some notes in other related flaws as I looked at this: >> >> 1) amd_iommu_deactivate_guest_mode() suffers from the same issue as this patch, >> but it should only matter for the case where you rely on irqbalance-like >> daemons balancing VFIO IRQs in the hypervisor. Though, it doesn't translate >> into guest failures, more like performance "misdirection". Happy to fix it, if >> folks also deem it as a problem. >> >> 2) This patch doesn't attempt at changing semantics around what >> amd_iommu_activate_guest_mode() has been doing for a long time [since v5.4] >> (i.e. clear the whole IRTE and then changes its fields). As such when >> updating the IRTEs the interrupts get isRunning and DestId cleared, thus >> we rely on the GALog to inject IRQs into vCPUs /until/ the vCPUs block >> and unblock again (which is when they update the IOMMU affinity), or the >> AVIC gets momentarily disabled. I have patches that improve this part as a >> follow-up, but I thought that this patch had value on its own onto fixing >> what has been broken since v5.4 ... and that it could be easily carried >> to stable trees. >> >> --- >> drivers/iommu/amd/iommu.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c >> index cbeaab55c0db..afe1f35a4dd9 100644 >> --- a/drivers/iommu/amd/iommu.c >> +++ b/drivers/iommu/amd/iommu.c >> @@ -3476,7 +3476,7 @@ int amd_iommu_activate_guest_mode(void *data) >> u64 valid; >> if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || >> - !entry || entry->lo.fields_vapic.guest_mode) >> + !entry) >> return 0; >> valid = entry->lo.fields_vapic.valid; > > Apart from the commit message change: > > Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@xxxxxxx> > Thanks! With respect to my notes please ignore item 1 as it's not a problem as far as I repro-ed. With respect to item 2 in the notes, I have the diff below to address it under testing, which avoids the inneficiency of marking an IRTE as isRunning=0 (so avoid relying on the GALog after the IRTE update). This is so far 2 additional patches. Essentially it boils down to fixing amd_iommu_update_ga() to avoid @entry and @ref going out of sync and then switching only the GAVector and GATag while not touching the DestID and isRunning bit (which are more tied to the running CPU). @@ -4395,14 +4395,17 @@ int amd_iommu_activate_guest_mode(void *data) !entry) return 0; - valid = entry->lo.fields_vapic.valid; + if (!entry->lo.fields_vapic.guest_mode) { + valid = entry->lo.fields_vapic.valid; - entry->lo.val = 0; - entry->hi.val = 0; + entry->lo.val = 0; + entry->hi.val = 0; + + entry->lo.fields_vapic.valid = valid; + entry->lo.fields_vapic.guest_mode = 1; + entry->lo.fields_vapic.ga_log_intr = 1; + } - entry->lo.fields_vapic.valid = valid; - entry->lo.fields_vapic.guest_mode = 1; - entry->lo.fields_vapic.ga_log_intr = 1; entry->hi.fields.ga_root_ptr = ir_data->ga_root_ptr; entry->hi.fields.vector = ir_data->ga_vector; entry->lo.fields_vapic.ga_tag = ir_data->ga_tag; @@ -4579,6 +4582,7 @@ int amd_iommu_create_irq_domain(struct amd_iommu *iommu) int amd_iommu_update_ga(int cpu, bool is_run, void *data) { + int ret; unsigned long flags; struct amd_iommu *iommu; struct irq_remap_table *table; @@ -4601,15 +4605,18 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data) raw_spin_lock_irqsave(&table->lock, flags); - if (ref->lo.fields_vapic.guest_mode) { + if (entry->lo.fields_vapic.guest_mode) { if (cpu >= 0) { - ref->lo.fields_vapic.destination = + entry->lo.fields_vapic.destination = APICID_TO_IRTE_DEST_LO(cpu); - ref->hi.fields.destination = + entry->hi.fields.destination = APICID_TO_IRTE_DEST_HI(cpu); } - ref->lo.fields_vapic.is_run = is_run; - barrier(); + entry->lo.fields_vapic.is_run = is_run; + ret = cmpxchg_double(&ref->lo.val, &ref->hi.val, + ref->lo.val, ref->hi.val, + entry->lo.val, entry->hi.val); + WARN_ON(!ret); }