On Thu, 25 Apr 2024 17:26:54 +0800 Yi Liu <yi.l.liu@xxxxxxxxx> wrote: > On 2024/4/25 02:24, Alex Williamson wrote: > > On Tue, 23 Apr 2024 21:12:21 -0300 > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > >> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: > >>>> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > >>>> Sent: Tuesday, April 23, 2024 8:02 PM > >>>> > >>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: > >>>>> I'm not sure how userspace can fully handle this w/o certain assistance > >>>>> from the kernel. > >>>>> > >>>>> So I kind of agree that emulated PASID capability is probably the only > >>>>> contract which the kernel should provide: > >>>>> - mapped 1:1 at the physical location, or > >>>>> - constructed at an offset according to DVSEC, or > >>>>> - constructed at an offset according to a look-up table > >>>>> > >>>>> The VMM always scans the vfio pci config space to expose vPASID. > >>>>> > >>>>> Then the remaining open is what VMM could do when a VF supports > >>>>> PASID but unfortunately it's not reported by vfio. W/o the capability > >>>>> of inspecting the PASID state of PF, probably the only feasible option > >>>>> is to maintain a look-up table in VMM itself and assumes the kernel > >>>>> always enables the PASID cap on PF. > >>>> > >>>> I'm still not sure I like doing this in the kernel - we need to do the > >>>> same sort of thing for ATS too, right? > >>> > >>> VF is allowed to implement ATS. > >>> > >>> PRI has the same problem as PASID. > >> > >> I'm surprised by this, I would have guessed ATS would be the device > >> global one, PRI not being per-VF seems problematic??? How do you > >> disable PRI generation to get a clean shutdown? > >> > >>>> It feels simpler if the indicates if PASID and ATS can be supported > >>>> and userspace builds the capability blocks. > >>> > >>> this routes back to Alex's original question about using different > >>> interfaces (a device feature vs. PCI PASID cap) for VF and PF. > >> > >> I'm not sure it is different interfaces.. > >> > >> The only reason to pass the PF's PASID cap is to give free space to > >> the VMM. If we are saying that gaps are free space (excluding a list > >> of bad devices) then we don't acutally need to do that anymore. > > > > Are we saying that now?? That's new. > > > >> VMM will always create a synthetic PASID cap and kernel will always > >> suppress a real one. > >> > >> An iommufd query will indicate if the vIOMMU can support vPASID on > >> that device. > >> > >> Same for all the troublesome non-physical caps. > >> > >>>> There are migration considerations too - the blocks need to be > >>>> migrated over and end up in the same place as well.. > >>> > >>> Can you elaborate what is the problem with the kernel emulating > >>> the PASID cap in this consideration? > >> > >> If the kernel changes the algorithm, say it wants to do PASID, PRI, > >> something_new then it might change the layout > >> > >> We can't just have the kernel decide without also providing a way for > >> userspace to say what the right layout actually is. :\ > > > > The capability layout is only relevant to migration, right? A variant > > driver that supports migration is a prerequisite and would also be > > responsible for exposing the PASID capability. This isn't as disjoint > > as it's being portrayed. > > > >>> Does it talk about a case where the devices between src/dest are > >>> different versions (but backward compatible) with different unused > >>> space layout and the kernel approach may pick up different offsets > >>> while the VMM can guarantee the same offset? > >> > >> That is also a concern where the PCI cap layout may change a bit but > >> they are still migration compatible, but my bigger worry is that the > >> kernel just lays out the fake caps in a different way because the > >> kernel changes. > > > > Outside of migration, what does it matter if the cap layout is ^^^^^^^^^^^^^^^^^^^^ > > different? A driver should never hard code the address for a > > capability. > > > > But it may store the offset of capability to make next cap access more > convenient. I noticted struct pci_dev stores the offset of PRI and PASID > cap. So if the layout of config space changed between src and dst, it may > result in problem in guest when guest driver uses the offsets to access > PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi, > msix). So there is already a problem even put aside the PRI and PASID cap. Yes, I had noted "outside of migration" above. Config space must be consistent to a running VM. But the possibility of config space changing like this only exists in the case where the driver supports migration, so I think we're inventing an unrealistic concern that a driver that supports migration would arbitrarily modify the config space layout in order to make an argument for VMM managed layout. Thanks, Alex > #ifdef CONFIG_PCI_PRI > u16 pri_cap; /* PRI Capability offset */ > u32 pri_reqs_alloc; /* Number of PRI requests allocated */ > unsigned int pasid_required:1; /* PRG Response PASID Required */ > #endif > #ifdef CONFIG_PCI_PASID > u16 pasid_cap; /* PASID Capability offset */ > u16 pasid_features; > #endif > #ifdef CONFIG_PCI_P2PDMA > struct pci_p2pdma __rcu *p2pdma; > #endif > #ifdef CONFIG_PCI_DOE > struct xarray doe_mbs; /* Data Object Exchange mailboxes */ > #endif > u16 acs_cap; /* ACS Capability offset */ > > https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350 > > >> At least if the VMM is doing this then the VMM can include the > >> information in its migration scheme and use it to recreate the PCI > >> layout withotu having to create a bunch of uAPI to do so. > > > > We're again back to migration compatibility, where again the capability > > layout would be governed by the migration support in the in-kernel > > variant driver. Once migration is involved the location of a PASID > > shouldn't be arbitrary, whether it's provided by the kernel or the VMM. > > > > Regardless, the VMM ultimately has the authority what the guest > > sees in config space. The VMM is not bound to expose the PASID at the > > offset provided by the kernel, or bound to expose it at all. The > > kernel exposed PASID can simply provide an available location and set > > of enabled capabilities. Thanks, > > > > Alex > > >