Hi Jason On Fri, Nov 06, 2020 at 09:14:15AM -0400, Jason Gunthorpe wrote: > On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote: > > > The interrupt controller is responsible to create an addr/data pair > > > for an interrupt message. It sets the message format and ensures it > > > routes to the proper CPU interrupt handler. Everything about the > > > addr/data pair is owned by the platform interrupt controller. > > > > > > Devices do not create interrupts. They only trigger the addr/data pair > > > the platform gives them. > > > > I guess that we may just view it from different angles. On x86 platform, > > a MSI/IMS capable device directly composes interrupt messages, with > > addr/data pair filled by OS. > > Yes, all platforms work like that. The addr/data pair is *opaque* to > the device. Only the platform interrupt controller component > understands how to form those values. True, the addr/data pair is opaque. IMS doesn't dictate what the contents of addr/data pair is made of. That is still a platform attribute. IMS simply controls where the pair is physically stored. Which only the device dictates. > > > If there is no IOMMU remapping enabled in the middle, the message > > just hits the CPU. Your description possibly is from software side, > > e.g. describing the hierarchical IRQ domain concept? > > I suppose you could say that. Technically the APIC doesn't form any > addr/data pairs, but the configuration of the APIC, IOMMU and other > platform components define what addr/data pairs are acceptable. > > The IRQ domain stuff broadly puts responsibilty to form these values > in the IRQ layer which abstracts all the platform detatils. In Linux > we expect the platform to provide the IRQ Domain tha can specify > working addr/data pairs. > > > I agree with this point, just as how pci-hyperv.c works. In concept Linux > > guest driver should be able to use IMS when running on Hyper-v. There > > is no such thing for KVM, but possibly one day we will need similar stuff. > > Before that happens the guest could choose to simply disallow devmsi > > by default in the platform code (inventing a hypercall just for 'disable' > > doesn't make sense) and ignore the IMS cap. One small open is whether > > this can be done in one central-place. The detection of running as guest > > is done in arch-specific code. Do we need disabling devmsi for every arch? > > > > But when talking about virtualization it's not good to assume the guest > > behavior. It's perfectly sane to run a guest OS which doesn't implement > > any PV stuff (thus don't know running in a VM) but do support IMS. In > > such scenario the IMS cap allows the hypervisor to educate the guest > > driver to use MSI instead of IMS, as long as the driver follows the device > > spec. In this regard I don't think that the IMS cap will be a short-term > > thing, although Linux may choose to not use it. > > The IMS flag belongs in the platform not in the devices. This support is mostly a SW thing right? we don't need to muck with platform/ACPI for that matter. > > For instance you could put a "disable IMS" flag in the ACPI tables, in > the config space of the emuulated root port, or any other areas that > clearly belong to the platform. Maybe there is a different interpretation for IMS that I'm missing. Devices that need more interrupt support than supported by PCIe standards, and how device has grouped the storage needs for the addr/data pair is a device attribute. I missed why ACPI tables should carry such information. If kernel doesn't want to support those devices its within kernel control. Which means kernel will only use the available MSIx interfaces. This is legacy support. > > The OS logic would be > - If no IMS information found then use IMS (Bare metal) > - If the IMS disable flag is found then > - If (future) hypercall available and the OS knows how to use it > then use IMS > - If no hypercall found, or no OS knowledge, fail IMS > > Our devices can use IMS even in a pure no-emulation This is true for IMS as well. But probably not implemented in the kernel as such. From a HW point of view (take idxd for instance) the facility is available to native OS as well. The early RFC supported this for native. Native devices can have both MSIx and IMS capability. But as I understand this isn't how we have partitioned things in SW today. We left IMS only for mdev's. And I agree this would be very useful. In cases where we want to support interrupt handles for user space notification (when application specifies that in the descriptor). Those could be IMS. The device HW has support for it. Remember the "Why PASID in IMS entry" discussion? https://lore.kernel.org/lkml/20201008233210.GH4734@xxxxxxxxxx/ Cheers, Ashok