Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

Alexey Kardashevskiy <aik@xxxxxxxxx> · Wed, 20 Mar 2019 12:57:44 +1100

On 20/03/2019 03:36, Alex Williamson wrote:
> On Fri, 15 Mar 2019 19:18:35 +1100
> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote:
> 
>> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
>> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
>> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
>> platform puts all interconnected GPUs to the same IOMMU group.
>>
>> However the user may want to pass individual GPUs to the userspace so
>> in order to do so we need to put them into separate IOMMU groups and
>> cut off the interconnects.
>>
>> Thankfully V100 GPUs implement an interface to do by programming link
>> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
>> this interface, it cannot be re-enabled until the secondary bus reset is
>> issued to the GPU.
>>
>> This defines a reset_done() handler for V100 NVlink2 device which
>> determines what links need to be disabled. This relies on presence
>> of the new "ibm,nvlink-peers" device tree property of a GPU telling which
>> PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
>>
>> This does not change the existing behaviour and instead adds
>> a new "isolate_nvlink" kernel parameter to allow such isolation.
>>
>> The alternative approaches would be:
>>
>> 1. do this in the system firmware (skiboot) but for that we would need
>> to tell skiboot via an additional OPAL call whether or not we want this
>> isolation - skiboot is unaware of IOMMU groups.
>>
>> 2. do this in the secondary bus reset handler in the POWERNV platform -
>> the problem with that is at that point the device is not enabled, i.e.
>> config space is not restored so we need to enable the device (i.e. MMIO
>> bit in CMD register + program valid address to BAR0) in order to disable
>> links and then perhaps undo all this initialization to bring the device
>> back to the state where pci_try_reset_function() expects it to be.
> 
> The trouble seems to be that this approach only maintains the isolation
> exposed by the IOMMU group when vfio-pci is the active driver for the
> device.  IOMMU groups can be used by any driver and the IOMMU core is
> incorporating groups in various ways.  So, if there's a device specific
> way to configure the isolation reported in the group, which requires
> some sort of active management against things like secondary bus
> resets, then I think we need to manage it above the attached endpoint
> driver.

Fair point. So for now I'll go for 2) then.

> Ideally I'd see this as a set of PCI quirks so that we might
> leverage it beyond POWER platforms.  I'm not sure how we get past the
> reliance on device tree properties that we won't have on other
> platforms though, if only NVIDIA could at least open a spec addressing
> the discovery and configuration of NVLink registers on their
> devices :-\  Thanks,

This would be nice, yes...

-- 
Alexey