Re: [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

Alexey Kardashevskiy <aik@xxxxxxxxx> · Mon, 12 Nov 2018 15:56:32 +1100

On 12/11/2018 15:23, David Gibson wrote:
> On Mon, Nov 12, 2018 at 01:36:45PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 12/11/2018 12:08, David Gibson wrote:
>>> On Fri, Oct 19, 2018 at 11:53:53AM +1100, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 19/10/2018 05:05, Alex Williamson wrote:
>>>>> On Thu, 18 Oct 2018 10:37:46 -0700
>>>>> Piotr Jaroszynski <pjaroszynski@xxxxxxxxxx> wrote:
>>>>>
>>>>>> On 10/18/18 9:55 AM, Alex Williamson wrote:
>>>>>>> On Thu, 18 Oct 2018 11:31:33 +1100
>>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote:
>>>>>>>   
>>>>>>>> On 18/10/2018 08:52, Alex Williamson wrote:  
>>>>>>>>> On Wed, 17 Oct 2018 12:19:20 +1100
>>>>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote:
>>>>>>>>>      
>>>>>>>>>> On 17/10/2018 06:08, Alex Williamson wrote:  
>>>>>>>>>>> On Mon, 15 Oct 2018 20:42:33 +1100
>>>>>>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote:    
>>>>>>>>>>>> +
>>>>>>>>>>>> +	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
>>>>>>>>>>>> +			pdev->device == 0x04ea) {
>>>>>>>>>>>> +		ret = vfio_pci_ibm_npu2_init(vdev);
>>>>>>>>>>>> +		if (ret) {
>>>>>>>>>>>> +			dev_warn(&vdev->pdev->dev,
>>>>>>>>>>>> +					"Failed to setup NVIDIA NV2 ATSD region\n");
>>>>>>>>>>>> +			goto disable_exit;
>>>>>>>>>>>>   		}  
>>>>>>>>>>>
>>>>>>>>>>> So the NPU is also actually owned by vfio-pci and assigned to the VM?  
>>>>>>>>>>
>>>>>>>>>> Yes. On a running system it looks like:
>>>>>>>>>>
>>>>>>>>>> 0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
>>>>>>>>>> 0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
>>>>>>>>>> 0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
>>>>>>>>>> 0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
>>>>>>>>>> 0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
>>>>>>>>>> 0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
>>>>>>>>>> 0035:00:00.0 PCI bridge: IBM Device 04c1
>>>>>>>>>> 0035:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
>>>>>>>>>> 0035:02:04.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
>>>>>>>>>> 0035:02:05.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
>>>>>>>>>> 0035:02:0d.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
>>>>>>>>>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>>>>>>>>>> (rev a1
>>>>>>>>>> 0035:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>>>>>>>>>> (rev a1)
>>>>>>>>>> 0035:05:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>>>>>>>>>> (rev a1)
>>>>>>>>>>
>>>>>>>>>> One "IBM Device" bridge represents one NVLink2, i.e. a piece of NPU.
>>>>>>>>>> They all and 3 GPUs go to the same IOMMU group and get passed through to
>>>>>>>>>> a guest.
>>>>>>>>>>
>>>>>>>>>> The entire NPU does not have representation via sysfs as a whole though.  
>>>>>>>>>
>>>>>>>>> So the NPU is a bridge, but it uses a normal header type so vfio-pci
>>>>>>>>> will bind to it?  
>>>>>>>>
>>>>>>>> An NPU is a NVLink bridge, it is not PCI in any sense. We (the host
>>>>>>>> powerpc firmware known as "skiboot" or "opal") have chosen to emulate a
>>>>>>>> virtual bridge per 1 NVLink on the firmware level. So for each physical
>>>>>>>> NPU there are 6 virtual bridges. So the NVIDIA driver does not need to
>>>>>>>> know much about NPUs.
>>>>>>>>  
>>>>>>>>> And the ATSD register that we need on it is not
>>>>>>>>> accessible through these PCI representations of the sub-pieces of the
>>>>>>>>> NPU?  Thanks,  
>>>>>>>>
>>>>>>>> No, only via the device tree. The skiboot puts the ATSD register address
>>>>>>>> to the PHB's DT property called 'ibm,mmio-atsd' of these virtual bridges.  
>>>>>>>
>>>>>>> Ok, so the NPU is essential a virtual device already, mostly just a
>>>>>>> stub.  But it seems that each NPU is associated to a specific GPU, how
>>>>>>> is that association done?  In the use case here it seems like it's just
>>>>>>> a vehicle to provide this ibm,mmio-atsd property to guest DT and the tgt
>>>>>>> routing information to the GPU.  So if both of those were attached to
>>>>>>> the GPU, there'd be no purpose in assigning the NPU other than it's in
>>>>>>> the same IOMMU group with a type 0 header, so something needs to be
>>>>>>> done with it.  If it's a virtual device, perhaps it could have a type 1
>>>>>>> header so vfio wouldn't care about it, then we would only assign the
>>>>>>> GPU with these extra properties, which seems easier for management
>>>>>>> tools and users.  If the guest driver needs a visible NPU device, QEMU
>>>>>>> could possibly emulate one to make the GPU association work
>>>>>>> automatically.  Maybe this isn't really a problem, but I wonder if
>>>>>>> you've looked up the management stack to see what tools need to know to
>>>>>>> assign these NPU devices and whether specific configurations are
>>>>>>> required to make the NPU to GPU association work.  Thanks,  
>>>>>>
>>>>>> I'm not that familiar with how this was originally set up, but note that 
>>>>>> Alexey is just making it work exactly like baremetal does. The baremetal 
>>>>>> GPU driver works as-is in the VM and expects the same properties in the 
>>>>>> device-tree. Obviously it doesn't have to be that way, but there is 
>>>>>> value in keeping it identical.
>>>>>>
>>>>>> Another probably bigger point is that the NPU device also implements the 
>>>>>> nvlink HW interface and is required for actually training and 
>>>>>> maintaining the link up. The driver in the guest trains the links by 
>>>>>> programming both the GPU end and the NPU end of each link so the NPU 
>>>>>> device needs to be exposed to the guest.
>>>>>
>>>>> Ok, so there is functionality in assigning the NPU device itself, it's
>>>>> not just an attachment point for meta data.  But it still seems there
>>>>> must be some association of NPU to GPU, the tgt address seems to pair
>>>>> the NPU with a specific GPU, they're not simply a fungible set of NPUs
>>>>> and GPUs.  Is that association explicit anywhere or is it related to
>>>>> the topology or device numbering that needs to match between the host
>>>>> and guest?  Thanks,
>>>>
>>>> It is in the device tree (phandle is a node ID).
>>>
>>> Hrm.  But the device tree just publishes information about the
>>> hardware.  What's the device tree value actually exposing here?
>>>
>>> Is there an inherent hardware connection between one NPU and one GPU?
>>> Or is there just an arbitrary assignment performed by the firmware
>>> which it then exposed to the device tree?
>>
>> I am not sure I understood the question...
>>
>> The ibm,gpu and ibm,npu values  (which are phandles) of NPUs and GPUs
>> represent physical wiring.
> 
> So you're saying there is specific physical wiring between one
> particular NPU and one particular GPU?  And the device tree properties
> describe that wiring?

Yes.

> I think what Alex and I are both trying to determine is if the binding
> of NPUs to GPUs is as a result of physical wiring constraints, or just
> a firmware imposed convention.

It is physical wiring which cannot change with a firmware update - there
are NVLink wires between CPU socket and GPU socket without bridges in
between.

-- 
Alexey