On Wed, Jun 12, 2024 at 08:23:01PM -0300, Jason Gunthorpe wrote: > On Wed, Jun 12, 2024 at 04:29:03PM -0500, Bjorn Helgaas wrote: > > [+cc Alex since VFIO entered the conversation; thread at > > https://lore.kernel.org/r/20240523063528.199908-1-vidyas@xxxxxxxxxx] > > > > On Mon, Jun 10, 2024 at 08:38:49AM -0300, Jason Gunthorpe wrote: > > > On Fri, Jun 07, 2024 at 02:30:55PM -0500, Bjorn Helgaas wrote: > > > > "Correctly" is not quite the right word here; it's just a fact that > > > > the ACS settings determined at boot time result in certain IOMMU > > > > groups. If the user desires different groups, it's not that something > > > > is "incorrect"; it's just that the user may have to accept less > > > > isolation to get the desired IOMMU groups. > > > > > > That is not quite accurate.. There are HW configurations where ACS > > > needs to be a certain way for the HW to work with P2P at all. It isn't > > > just an optimization or the user accepts something, if they want P2P > > > at all they must get a ACS configuration appropriate for their system. > > > > The current wording of "For iommu_groups to form correctly, the ACS > > settings in the PCIe fabric need to be setup early" suggests that the > > way we currently configure ACS is incorrect in general, regardless of > > P2PDMA. > > Yes, I'd agree with this. We don't have enough information to > configurate it properly in the kernel in an automatic way. We don't > know if pairs of devices even have SW enablement to do P2P in the > kernel and we don't accurately know what issues the root complex > has. All of this information goes into choosing the right ACS bits. > > > But my impression is that there's a trade-off between isolation and > > the ability to do P2PDMA, and users have different requirements, and > > the preference for less isolation/more P2PDMA is no more "correct" > > than a preference for more isolation/less P2PDMA. > > Sure, that makes sense > > > Maybe something like this: > > > > PCIe ACS settings determine how devices are put into iommu_groups. > > The iommu_groups in turn determine which devices can be passed > > through to VMs and whether P2PDMA between them is possible. The > > iommu_groups are built at enumeration-time and are currently static. > > Not quite, the iommu_groups don't have alot to do with the P2P. Even > devices in the same kernel group can still have non working P2P. > > Maybe: > > PCIe ACS settings control the level of isolation and the possible P2P > paths between devices. With greater isolation the kernel will create > smaller iommu_groups and with less isolation there is more HW that > can achieve P2P transfers. From a virtualization perspective all > devices in the same iommu_group must be assigned to the same VM as > they lack security isolation. > > There is no way for the kernel to automatically know the correct > ACS settings for any given system and workload. Existing command line > options allow only for large scale change, disabling all > isolation, but this is not sufficient for more complex cases. > > Add a kernel command-line option to directly control all the ACS bits > for specific devices, which allows the operator to setup the right > level of isolation to achieve the desired P2P configuration. The > definition is future proof, when new ACS bits are added to the spec > the open syntax can be extended. > > ACS needs to be setup early in the kernel boot as the ACS settings > effect how iommu_groups are formed. iommu_group formation is a one > time event during initial device discovery, changing ACS bits after > kernel boot can result in an inaccurate view of the iommu_groups > compared to the current isolation configuration. > > ACS applies to PCIe Downstream Ports and multi-function devices. > The default ACS settings are strict and deny any direct traffic > between two functions. This results in the smallest iommu_group the > HW can support. Frequently these values result in slow or > non-working P2PDMA. > > ACS offers a range of security choices controlling how traffic is > allowed to go directly between two devices. Some popular choices: > - Full prevention > - Translated requests can be direct, with various options > - Asymetric direct traffic, A can reach B but not the reverse > - All traffic can be direct > Along with some other less common ones for special topologies. > > The intention is that this option would be used with expert knowledge > of the HW capability and workload to achieve the desired > configuration. That all sounds good. IIUC the current default is full prevention (I guess you said that a few paragraphs up). It's unfortunate that this requires so much expert knowledge to use, but I guess we don't really have a good alternative. The only way I can think of to help would be some kind of white paper or examples in Documentation/PCI/. Bjorn