On 12/11/2018 15:23, David Gibson wrote: > On Mon, Nov 12, 2018 at 01:36:45PM +1100, Alexey Kardashevskiy wrote: >> >> >> On 12/11/2018 12:08, David Gibson wrote: >>> On Fri, Oct 19, 2018 at 11:53:53AM +1100, Alexey Kardashevskiy wrote: >>>> >>>> >>>> On 19/10/2018 05:05, Alex Williamson wrote: >>>>> On Thu, 18 Oct 2018 10:37:46 -0700 >>>>> Piotr Jaroszynski <pjaroszynski@xxxxxxxxxx> wrote: >>>>> >>>>>> On 10/18/18 9:55 AM, Alex Williamson wrote: >>>>>>> On Thu, 18 Oct 2018 11:31:33 +1100 >>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>>>>>> >>>>>>>> On 18/10/2018 08:52, Alex Williamson wrote: >>>>>>>>> On Wed, 17 Oct 2018 12:19:20 +1100 >>>>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>>> On 17/10/2018 06:08, Alex Williamson wrote: >>>>>>>>>>> On Mon, 15 Oct 2018 20:42:33 +1100 >>>>>>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>>>>>>>>>>> + >>>>>>>>>>>> + if (pdev->vendor == PCI_VENDOR_ID_IBM && >>>>>>>>>>>> + pdev->device == 0x04ea) { >>>>>>>>>>>> + ret = vfio_pci_ibm_npu2_init(vdev); >>>>>>>>>>>> + if (ret) { >>>>>>>>>>>> + dev_warn(&vdev->pdev->dev, >>>>>>>>>>>> + "Failed to setup NVIDIA NV2 ATSD region\n"); >>>>>>>>>>>> + goto disable_exit; >>>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> So the NPU is also actually owned by vfio-pci and assigned to the VM? >>>>>>>>>> >>>>>>>>>> Yes. On a running system it looks like: >>>>>>>>>> >>>>>>>>>> 0007:00:00.0 Bridge: IBM Device 04ea (rev 01) >>>>>>>>>> 0007:00:00.1 Bridge: IBM Device 04ea (rev 01) >>>>>>>>>> 0007:00:01.0 Bridge: IBM Device 04ea (rev 01) >>>>>>>>>> 0007:00:01.1 Bridge: IBM Device 04ea (rev 01) >>>>>>>>>> 0007:00:02.0 Bridge: IBM Device 04ea (rev 01) >>>>>>>>>> 0007:00:02.1 Bridge: IBM Device 04ea (rev 01) >>>>>>>>>> 0035:00:00.0 PCI bridge: IBM Device 04c1 >>>>>>>>>> 0035:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) >>>>>>>>>> 0035:02:04.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) >>>>>>>>>> 0035:02:05.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) >>>>>>>>>> 0035:02:0d.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) >>>>>>>>>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2] >>>>>>>>>> (rev a1 >>>>>>>>>> 0035:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2] >>>>>>>>>> (rev a1) >>>>>>>>>> 0035:05:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2] >>>>>>>>>> (rev a1) >>>>>>>>>> >>>>>>>>>> One "IBM Device" bridge represents one NVLink2, i.e. a piece of NPU. >>>>>>>>>> They all and 3 GPUs go to the same IOMMU group and get passed through to >>>>>>>>>> a guest. >>>>>>>>>> >>>>>>>>>> The entire NPU does not have representation via sysfs as a whole though. >>>>>>>>> >>>>>>>>> So the NPU is a bridge, but it uses a normal header type so vfio-pci >>>>>>>>> will bind to it? >>>>>>>> >>>>>>>> An NPU is a NVLink bridge, it is not PCI in any sense. We (the host >>>>>>>> powerpc firmware known as "skiboot" or "opal") have chosen to emulate a >>>>>>>> virtual bridge per 1 NVLink on the firmware level. So for each physical >>>>>>>> NPU there are 6 virtual bridges. So the NVIDIA driver does not need to >>>>>>>> know much about NPUs. >>>>>>>> >>>>>>>>> And the ATSD register that we need on it is not >>>>>>>>> accessible through these PCI representations of the sub-pieces of the >>>>>>>>> NPU? Thanks, >>>>>>>> >>>>>>>> No, only via the device tree. The skiboot puts the ATSD register address >>>>>>>> to the PHB's DT property called 'ibm,mmio-atsd' of these virtual bridges. >>>>>>> >>>>>>> Ok, so the NPU is essential a virtual device already, mostly just a >>>>>>> stub. But it seems that each NPU is associated to a specific GPU, how >>>>>>> is that association done? In the use case here it seems like it's just >>>>>>> a vehicle to provide this ibm,mmio-atsd property to guest DT and the tgt >>>>>>> routing information to the GPU. So if both of those were attached to >>>>>>> the GPU, there'd be no purpose in assigning the NPU other than it's in >>>>>>> the same IOMMU group with a type 0 header, so something needs to be >>>>>>> done with it. If it's a virtual device, perhaps it could have a type 1 >>>>>>> header so vfio wouldn't care about it, then we would only assign the >>>>>>> GPU with these extra properties, which seems easier for management >>>>>>> tools and users. If the guest driver needs a visible NPU device, QEMU >>>>>>> could possibly emulate one to make the GPU association work >>>>>>> automatically. Maybe this isn't really a problem, but I wonder if >>>>>>> you've looked up the management stack to see what tools need to know to >>>>>>> assign these NPU devices and whether specific configurations are >>>>>>> required to make the NPU to GPU association work. Thanks, >>>>>> >>>>>> I'm not that familiar with how this was originally set up, but note that >>>>>> Alexey is just making it work exactly like baremetal does. The baremetal >>>>>> GPU driver works as-is in the VM and expects the same properties in the >>>>>> device-tree. Obviously it doesn't have to be that way, but there is >>>>>> value in keeping it identical. >>>>>> >>>>>> Another probably bigger point is that the NPU device also implements the >>>>>> nvlink HW interface and is required for actually training and >>>>>> maintaining the link up. The driver in the guest trains the links by >>>>>> programming both the GPU end and the NPU end of each link so the NPU >>>>>> device needs to be exposed to the guest. >>>>> >>>>> Ok, so there is functionality in assigning the NPU device itself, it's >>>>> not just an attachment point for meta data. But it still seems there >>>>> must be some association of NPU to GPU, the tgt address seems to pair >>>>> the NPU with a specific GPU, they're not simply a fungible set of NPUs >>>>> and GPUs. Is that association explicit anywhere or is it related to >>>>> the topology or device numbering that needs to match between the host >>>>> and guest? Thanks, >>>> >>>> It is in the device tree (phandle is a node ID). >>> >>> Hrm. But the device tree just publishes information about the >>> hardware. What's the device tree value actually exposing here? >>> >>> Is there an inherent hardware connection between one NPU and one GPU? >>> Or is there just an arbitrary assignment performed by the firmware >>> which it then exposed to the device tree? >> >> I am not sure I understood the question... >> >> The ibm,gpu and ibm,npu values (which are phandles) of NPUs and GPUs >> represent physical wiring. > > So you're saying there is specific physical wiring between one > particular NPU and one particular GPU? And the device tree properties > describe that wiring? Yes. > I think what Alex and I are both trying to determine is if the binding > of NPUs to GPUs is as a result of physical wiring constraints, or just > a firmware imposed convention. It is physical wiring which cannot change with a firmware update - there are NVLink wires between CPU socket and GPU socket without bridges in between. -- Alexey