RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Mon, 13 Dec 2021 07:50:50 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Monday, December 13, 2021 7:37 AM
> 
> On Sun, Dec 12, 2021 at 09:55:32PM +0100, Thomas Gleixner wrote:
> > Kevin,
> >
> > On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> > >> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> > >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> > >> vIR related there.
> > >
> > > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> > >
> > > In reality there are also fully emulated vIOMMU implementations (e.g.
> > > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > > the IR logic in existing iommu drivers just apply:
> > >
> > > 	drivers/iommu/intel/irq_remapping.c
> > > 	drivers/iommu/amd/iommu.c
> >
> > thanks for the explanation. So that's a full IOMMU emulation. I was more
> > expecting a paravirtualized lightweight one.
> 
> Kevin can you explain what on earth vIR is for and how does it work??
> 
> Obviously we don't expose the IR machinery to userspace, so at best
> this is somehow changing what the MSI trap does?
> 

Initially it was introduced for supporting more than 255 vCPUs. Due to 
full emulation this capability can certainly support other vIR usages
as observed on bare metal.

vIR doesn't rely on the physical IR presence.

First if the guest doesn't have vfio device then the physical capability
doesn't matter.

Even with vfio device, IR by definition is just about remapping instead
of injection (talk about this part later). The interrupts are always routed
to the host handler first (vfio_msihandler() in this case), which then 
triggers irqfd to call virtual interrupt injection handler (irqfd_wakeup())
in kvm.

This suggests a clear role split between vfio and kvm:

  - vfio is responsible for irq allocation/startup as it is the device driver;
  - kvm takes care of virtual interrupt injection, being a VMM;

The two are connected via irqfd.

Following this split vIR information is completely hidden in userspace.
Qemu calculates the routing information between vGSI and vCPU
(with or without vIR, and for whatever trapped interrupt storages) 
and then registers it to kvm.

When kvm receives a notification via irqfd, it follows irqfd->vGSI->vCPU
and injects a virtual interrupt into the target vCPU.

Then comes an interesting scenario about IOMMU posted interrupt (PI).
This capability allows the IR engine directly converting a physical
interrupt into virtual and then inject it into the guest. Kinda offloading 
the virtual routing information into the hardware.

This is currently achieved via IRQ bypass manager, which helps connect 
vfio (IRQ producer) to kvm (IRQ consumer) around a specific Linux irq 
number. Once the connection is established, kvm calls 
irq_set_vcpu_affinity() to update IRTE with virtual routing information 
for that irq number.

With that design Qemu doesn't know whether IR or PI is enabled 
physically. It always talks to vfio for having IRQ resource allocated and
to kvm for registering virtual routing information.

Then adding the new hypercall machinery into this picture:

  1) The hypercall needs to carry all necessary virtual routing 
     information due to no-trap;

  2) Before querying IRTE data/pair, Qemu needs to complete necessary
     operations as of today to have IRTE ready:

	a) Register irqfd and related GSI routing info to kvm
	b) Allocates/startup IRQs via vfio;

     When PI is enabled, IRTE is ready only after both are completed.

  3) Qemu gets IRTE data/pair from kernel and return to the guest.

Thanks
Kevin