On 20/03/2019 03:36, Alex Williamson wrote: > On Fri, 15 Mar 2019 19:18:35 +1100 > Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > >> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and >> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct >> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV >> platform puts all interconnected GPUs to the same IOMMU group. >> >> However the user may want to pass individual GPUs to the userspace so >> in order to do so we need to put them into separate IOMMU groups and >> cut off the interconnects. >> >> Thankfully V100 GPUs implement an interface to do by programming link >> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using >> this interface, it cannot be re-enabled until the secondary bus reset is >> issued to the GPU. >> >> This defines a reset_done() handler for V100 NVlink2 device which >> determines what links need to be disabled. This relies on presence >> of the new "ibm,nvlink-peers" device tree property of a GPU telling which >> PCI peers it is connected to (which includes NVLink bridges or peer GPUs). >> >> This does not change the existing behaviour and instead adds >> a new "isolate_nvlink" kernel parameter to allow such isolation. >> >> The alternative approaches would be: >> >> 1. do this in the system firmware (skiboot) but for that we would need >> to tell skiboot via an additional OPAL call whether or not we want this >> isolation - skiboot is unaware of IOMMU groups. >> >> 2. do this in the secondary bus reset handler in the POWERNV platform - >> the problem with that is at that point the device is not enabled, i.e. >> config space is not restored so we need to enable the device (i.e. MMIO >> bit in CMD register + program valid address to BAR0) in order to disable >> links and then perhaps undo all this initialization to bring the device >> back to the state where pci_try_reset_function() expects it to be. > > The trouble seems to be that this approach only maintains the isolation > exposed by the IOMMU group when vfio-pci is the active driver for the > device. IOMMU groups can be used by any driver and the IOMMU core is > incorporating groups in various ways. So, if there's a device specific > way to configure the isolation reported in the group, which requires > some sort of active management against things like secondary bus > resets, then I think we need to manage it above the attached endpoint > driver. Fair point. So for now I'll go for 2) then. > Ideally I'd see this as a set of PCI quirks so that we might > leverage it beyond POWER platforms. I'm not sure how we get past the > reliance on device tree properties that we won't have on other > platforms though, if only NVIDIA could at least open a spec addressing > the discovery and configuration of NVLink registers on their > devices :-\ Thanks, This would be nice, yes... -- Alexey