Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach

Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx> · Wed, 24 Apr 2024 20:29:20 +0800

On 2024/4/24 10:57, Tian, Kevin wrote:
From: Jason Gunthorpe <jgg@xxxxxxxxxx>
Sent: Wednesday, April 24, 2024 8:12 AM

On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe <jgg@xxxxxxxxxx>
Sent: Tuesday, April 23, 2024 8:02 PM

On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
I'm not sure how userspace can fully handle this w/o certain assistance
from the kernel.

So I kind of agree that emulated PASID capability is probably the only
contract which the kernel should provide:
   - mapped 1:1 at the physical location, or
   - constructed at an offset according to DVSEC, or
   - constructed at an offset according to a look-up table

The VMM always scans the vfio pci config space to expose vPASID.

Then the remaining open is what VMM could do when a VF supports
PASID but unfortunately it's not reported by vfio. W/o the capability
of inspecting the PASID state of PF, probably the only feasible option
is to maintain a look-up table in VMM itself and assumes the kernel
always enables the PASID cap on PF.

I'm still not sure I like doing this in the kernel - we need to do the
same sort of thing for ATS too, right?

VF is allowed to implement ATS.

PRI has the same problem as PASID.

I'm surprised by this, I would have guessed ATS would be the device
global one, PRI not being per-VF seems problematic??? How do you
disable PRI generation to get a clean shutdown?

Here is what the PCIe spec says:

   For SR-IOV devices, a single Page Request Interface is permitted for
   the PF and is shared between the PF and its associated VFs, in which
   case the PF implements this capability and its VFs must not.

I'll let Baolu chime in for the potential impact to his PRI cleanup
effort, e.g. whether disabling PRI generation is mandatory if the
IOMMU side is already put in a mode auto-responding error to
new PRI request instead of reporting to sw.

The PRI cleanup steps are defined like this:

 * - Disable new PRI reception: Turn off PRI generation in the IOMMU 
hardware
 *   and flush any hardware page request queues. This should be done before
 *   calling into this helper.
 * - Acknowledge all outstanding PRQs to the device: Respond to all 
outstanding
 *   page requests with IOMMU_PAGE_RESP_INVALID, indicating the device 
should
 *   not retry. This helper function handles this.
 * - Disable PRI on the device: After calling this helper, the caller could
 *   then disable PRI on the device.

Disabling PRI on the device is the last step and optional because the
IOMMU is required to support a PRI blocking state and has already been
put in that state at the first step.

For the VF case, it probably is a no-op except for maintaining a
reference count. Once PRI is disabled on all PFs and VFs, it can then be
physically disabled on the PF.

But I do see another problem for shared capabilities between PF/VFs.

Now those shared capabilities are enabled/disabled when the PF is
attached to/detached from a domain, w/o counting the shared usage
from VFs.

Looks we have a gap here.

Yes, there's a gap at least for the Intel IOMMU driver. I'll soon fix
this by moving the handling of ATS out of the driver, especially from
the default domain attach/detach paths.

Best regards,
baolu