Re: [PATCH RFCv1 0/7] vfio: Allow userspace to specify the address for each MSI vector

Robin Murphy <robin.murphy@xxxxxxx> · Thu, 14 Nov 2024 15:35:15 +0000




On 13/11/2024 9:11 pm, Alex Williamson wrote:
On Tue, 12 Nov 2024 21:34:30 -0400
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

On Tue, Nov 12, 2024 at 01:54:58PM -0800, Nicolin Chen wrote:
On Mon, Nov 11, 2024 at 01:09:20PM +0000, Robin Murphy wrote:
On 2024-11-09 5:48 am, Nicolin Chen wrote:
To solve this problem the VMM should capture the MSI IOVA allocated by the
guest kernel and relay it to the GIC driver in the host kernel, to program
the correct MSI IOVA. And this requires a new ioctl via VFIO.

Once VFIO has that information from userspace, though, do we really need
the whole complicated dance to push it right down into the irqchip layer
just so it can be passed back up again? AFAICS
vfio_msi_set_vector_signal() via VFIO_DEVICE_SET_IRQS already explicitly
rewrites MSI-X vectors, so it seems like it should be pretty
straightforward to override the message address in general at that
level, without the lower layers having to be aware at all, no?

Didn't see that clearly!! It works with a simple following override:
--------------------------------------------------------------------
@@ -497,6 +497,10 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
                 struct msi_msg msg;

                 get_cached_msi_msg(irq, &msg);
+               if (vdev->msi_iovas) {
+                       msg.address_lo = lower_32_bits(vdev->msi_iovas[vector]);
+                       msg.address_hi = upper_32_bits(vdev->msi_iovas[vector]);
+               }
                 pci_write_msi_msg(irq, &msg);
         }
  
--------------------------------------------------------------------

With that, I think we only need one VFIO change for this part :)

Wow, is that really OK from a layering perspective? The comment is
pretty clear on the intention that this is to resync the irq layer
view of the device with the physical HW.

Editing the msi_msg while doing that resync smells bad.

Also, this is only doing MSI-X, we should include normal MSI as
well. (it probably should have a resync too?)

This was added for a specific IBM HBA that clears the vector table
during a built-in self test, so it's possible the MSI table being in
config space never had the same issue, or we just haven't encountered
it.  I don't expect anything else actually requires this.

Yeah, I wasn't really suggesting to literally hook into this exact case; 
it was more just a general observation that if VFIO already has one 
justification for tinkering with pci_write_msi_msg() directly without 
going through the msi_domain layer, then adding another (wherever it 
fits best) can't be *entirely* unreasonable.

At the end of the day, the semantic here is that VFIO does know more 
than the IRQ layer, and does need to program the endpoint differently 
from what the irqchip assumes, so I don't see much benefit in dressing 
that up more than functionally necessary.

I'd want Thomas/Marc/Alex to agree.. (please read the cover letter for
context)

It seems suspect to me too.  In a sense it is still just synchronizing
the MSI address, but to a different address space.

Is it possible to do this with the existing write_msi_msg callback on
the msi descriptor?  For instance we could simply translate the msg
address and call pci_write_msi_msg() (while avoiding an infinite
recursion).  Or maybe there should be an xlate_msi_msg callback we can
register.  Or I suppose there might be a way to insert an irqchip that
does the translation on write.  Thanks,

I'm far from keen on the idea, but if there really is an appetite for 
more indirection, then I guess the least-worst option would be yet 
another type of iommu_dma_cookie to work via the existing 
iommu_dma_compose_msi_msg() flow, with some interface for VFIO to update 
per-device addresses directly. But then it's still going to need some 
kind of "layering violation" for VFIO to poke the IRQ layer into 
re-composing and re-writing a message whenever userspace feels like 
changing an address, because we're fundamentally stepping outside the 
established lifecycle of a kernel-managed IRQ around which said layering 
was designed...

Thanks,
Robin.