RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Wed, 11 Nov 2020 07:14:05 +0000

> From: Raj, Ashok <ashok.raj@xxxxxxxxx>
> Sent: Tuesday, November 10, 2020 10:13 PM
> 
> Thomas,
> 
> With all these interrupt message storms ;-), I'm missing how to move
> towards
> an end goal.
> 
> On Tue, Nov 10, 2020 at 11:27:29AM +0100, Thomas Gleixner wrote:
> > Ashok,
> >
> > On Mon, Nov 09 2020 at 21:14, Ashok Raj wrote:
> > > On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> > >> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> > > Approach to IMS is more of a phased approach.
> > >
> > > #1 Allow physical device to scale beyond limits of PCIe MSIx
> > >    Follows current methodology for guest interrupt programming and
> > >    evolutionary changes rather than drastic.
> >
> > Trapping MSI[X] writes is there because it allows to hand a device to an
> > unmodified guest OS and to handle the case where the MSI[X] entries
> > storage cannot be mapped exclusively to the guest.
> >
> > But aside of this, it's not required if the storage can be mapped
> > exclusively, the guest is hypervisor aware and can get a host composed
> > message via a hypercall. That works for physical functions and SRIOV,
> > but not for SIOV.
> 
> It would greatly help if you can put down what you see is blocking
> to move forward in the following areas.
> 

Agree. We really need some guidance on how to move forward. I think all
people in this thread are aligned now that it's not Intel or IDXD specific thing,
e.g. need architectural solution, enabling IMS on PF/VF is important, etc. But
what we are not sure is whether we need complete all requirements in one
batch, or could evolve step-by-step as long as the growing path is clearly
defined. 

IMHO finding a way to disable IMS in guest is more important than supporting
IMS on PF/VF, since the latter requires hypercall which is not always available
in all scenarios. Even if Linux includes hypercall support for all existing archs
and hypervisors, it could run as an unmodified guest on a new hypervisor 
before this hypervisor gets its enlightenments into the Linux. So it is more
prominent to find a way to force using MSI/MSI-x inside guest, as it allows
such PFs/VFs still functional though not benefiting all scalability merits of IMS.

If such two-step plans can be agreed, then the next open is about how to
disable IMS in guest. We need a sane solution when checking in the initial 
host-only-IMS support. There are several options discussed in this thread:

1. Industry standard (e.g. a vendor-agnostic ACPI flag) followed by all 
platforms, hypervisors and OSes. It will require collaboration beyond 
Linux community;

2. IOMMU-vendor specific standards (DMAR, IORT, etc.) to report whether
IMS is allowed, implying that IMS is tied to the IOMMU. This tradeoff is 
acceptable since IMS alone cannot make SIOV working which relies on the 
IOMMU anyway. and this might be an easier path to move forward and
even not require to wait for all vendors to extend their tables together.
On physical platform the FW always reports IMS as 'allowed' and there is
time to change it. On virtual platform the hypervisor can choose to hide 
IMS in three ways:
	a) do not expose IOMMU
	b) expose IOMMU, but using the old format
	c) expose IOMMU, using the new format with IMS reported 'disallowed'

a/b can well support legacy software stack.

However, there is one potential issue with option 1/2. The construction
of the virtual ACPI table is at VM creation time, likely based on whether a 
PV interrupt controller is exposed to this guest. However, in most cases the
hypervisor doesn't know which guest OS is running and whether it will
use the PV controller when the VM is being created. If IMS is marked as
'allowed' in the virtual DMAR table, an unmodified guest might just go to 
enable it as if it's on the native platform. Maybe what we really required is 
a flag to tell the guest that although IMS is available you cannot use it with 
traditional interrupt controllers?

3. Use IOMMU 'caching mode' as the hint of running as guest and disable
IMS by default as long as 'caching mode' is detected. iirc all IOMMU vendors 
provide such capability for constructing shadow IOMMU page table. Later
when hypercall support is detected for a specific hypervisor/arch, that path 
can override the IOMMU hint to enable IMS.

Unlike the first two options, this will be a Linux-specific policy but self
contained. Other guest OSes may not follow this way though.

4. Using CPUID to detect running as guest. But as Thomas pointed out, this
approach is less reliable as not all hypervisors do this way.

Thoughts?

Thanks
Kevin