Re: [PATCH RFC/RFT] vfio/pci: Create feature to disable MSI virtualization

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 13 Dec 2024 09:10:30 +0000

On Tue, 2024-08-13 at 19:30 +0200, Thomas Gleixner wrote:
> On Tue, Aug 13 2024 at 13:30, Jason Gunthorpe wrote:
> > On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote:
> > > vfio-pci has always virtualized the MSI address and data registers as
> > > MSI programming is performed through the SET_IRQS ioctl.  Often this
> > > virtualization is not used, and in specific cases can be unhelpful.
> > > 
> > > One such case where the virtualization is a hinderance is when the
> > > device contains an onboard interrupt controller programmed by the guest
> > > driver.  Userspace VMMs have a chance to quirk this programming,
> > > injecting the host physical MSI information, but only if the userspace
> > > driver can get access to the host physical address and data registers.
> > > 
> > > This introduces a device feature which allows the userspace driver to
> > > disable virtualization of the MSI capability address and data registers
> > > in order to provide read-only access the the physical values.
> > 
> > Personally, I very much dislike this. Encouraging such hacky driver
> > use of the interrupt subsystem is not a good direction. Enabling this
> > in VMs will further complicate fixing the IRQ usages in these drivers
> > over the long run.
> > 
> > If the device has it's own interrupt sources then the device needs to
> > create an irq_chip and related and hook them up properly. Not hackily
> > read the MSI-X registers and write them someplace else.
> > 
> > Thomas Gleixner has done alot of great work recently to clean this up.
> > 
> > So if you imagine the driver is fixed, then this is not necessary.
> 
> Yes. I looked at the at11k driver when I was reworking the PCI/MSI
> subsystem and that's a perfect candidate for a proper device specific
> interrupt domain to replace the horrible MSI hackery it has.

The ath11k hacks may be awful, but in their defence, that's because the
whole way the hardware works is awful.

Q: With PCI passthrough to a guest, how does the guest OS tell the
device where to do DMA?

A: The guest OS just hands the device a guest physical address and the
IOMMU does the rest. Nothing 'intercedes' between the guest and the
device to mess with that address.

Q: MSIs are just DMA. So with PCI passthrough to a guest, how does the
guest OS configure the device's MSIs? 

<fantasy>
A: The guest OS just hands the device a standard MSI message encoding
the target guest APIC ID and vector (etc.), and the IOMMU does the
rest. Nothing 'intercedes' between the guest and the device to mess
with that MSI message.

And thus ath11k didn't need to do *any* hacks to work around a stupid
hardware design with the VMM snooping on stuff it ideally shouldn't
have had any business touching in the first place.

Posted interrupts are almost the *default* because the IOMMU receives a
<source-id, vCPU APIC ID, vector> tuple on the bus. If receiving an
interrupt for a vCPU which isn't currently running, that's when the
IOMMU sets a bit in a table somewhere and notifies the host OS.

All that special case MSI handling and routing code that I had
nightmares about because it fell through a wormhole from a parallel
universe, doesn't exist.

And look, DPDK drivers which run in polling mode and 'abuse' MSIs by
using real memory addresses and asking the device to "write <these> 32
bits to <this> structure if you want attention" just work nicely in
virtual machines too, just as they do on real hardware.
</fantasy>

/me wakes up...

Shit.

And we have to enable this Interrupt Remapping crap even to address
more than 255 CPUs *without* virtualization? Even a *guest* has to see
a virtual IOMMU and enable Interrupt Remapping to be able to use more
than 255 vCPUs? Even though there were a metric shitload of spare bits
in the MSI message we could have used¹.

Wait, so that means we have to offer an IOMMU with *DMA* remapping to
guests, which means 2-stage translations and/or massive overhead, just
for that guest to be able to use >255 vCPUs?

Screw you all, I'm going back to bed.

¹ And *should* use, if we ever do something similar like, say, expand
  the vector# space past 8 bits. Intel and AMD take note. 
Attachment:
smime.p7s

Description: S/MIME cryptographic signature