On 19/10/2018 05:05, Alex Williamson wrote: > On Thu, 18 Oct 2018 10:37:46 -0700 > Piotr Jaroszynski <pjaroszynski@xxxxxxxxxx> wrote: > >> On 10/18/18 9:55 AM, Alex Williamson wrote: >>> On Thu, 18 Oct 2018 11:31:33 +1100 >>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>> >>>> On 18/10/2018 08:52, Alex Williamson wrote: >>>>> On Wed, 17 Oct 2018 12:19:20 +1100 >>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>>>> >>>>>> On 17/10/2018 06:08, Alex Williamson wrote: >>>>>>> On Mon, 15 Oct 2018 20:42:33 +1100 >>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>>>>>>> + >>>>>>>> + if (pdev->vendor == PCI_VENDOR_ID_IBM && >>>>>>>> + pdev->device == 0x04ea) { >>>>>>>> + ret = vfio_pci_ibm_npu2_init(vdev); >>>>>>>> + if (ret) { >>>>>>>> + dev_warn(&vdev->pdev->dev, >>>>>>>> + "Failed to setup NVIDIA NV2 ATSD region\n"); >>>>>>>> + goto disable_exit; >>>>>>>> } >>>>>>> >>>>>>> So the NPU is also actually owned by vfio-pci and assigned to the VM? >>>>>> >>>>>> Yes. On a running system it looks like: >>>>>> >>>>>> 0007:00:00.0 Bridge: IBM Device 04ea (rev 01) >>>>>> 0007:00:00.1 Bridge: IBM Device 04ea (rev 01) >>>>>> 0007:00:01.0 Bridge: IBM Device 04ea (rev 01) >>>>>> 0007:00:01.1 Bridge: IBM Device 04ea (rev 01) >>>>>> 0007:00:02.0 Bridge: IBM Device 04ea (rev 01) >>>>>> 0007:00:02.1 Bridge: IBM Device 04ea (rev 01) >>>>>> 0035:00:00.0 PCI bridge: IBM Device 04c1 >>>>>> 0035:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) >>>>>> 0035:02:04.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) >>>>>> 0035:02:05.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) >>>>>> 0035:02:0d.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) >>>>>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2] >>>>>> (rev a1 >>>>>> 0035:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2] >>>>>> (rev a1) >>>>>> 0035:05:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2] >>>>>> (rev a1) >>>>>> >>>>>> One "IBM Device" bridge represents one NVLink2, i.e. a piece of NPU. >>>>>> They all and 3 GPUs go to the same IOMMU group and get passed through to >>>>>> a guest. >>>>>> >>>>>> The entire NPU does not have representation via sysfs as a whole though. >>>>> >>>>> So the NPU is a bridge, but it uses a normal header type so vfio-pci >>>>> will bind to it? >>>> >>>> An NPU is a NVLink bridge, it is not PCI in any sense. We (the host >>>> powerpc firmware known as "skiboot" or "opal") have chosen to emulate a >>>> virtual bridge per 1 NVLink on the firmware level. So for each physical >>>> NPU there are 6 virtual bridges. So the NVIDIA driver does not need to >>>> know much about NPUs. >>>> >>>>> And the ATSD register that we need on it is not >>>>> accessible through these PCI representations of the sub-pieces of the >>>>> NPU? Thanks, >>>> >>>> No, only via the device tree. The skiboot puts the ATSD register address >>>> to the PHB's DT property called 'ibm,mmio-atsd' of these virtual bridges. >>> >>> Ok, so the NPU is essential a virtual device already, mostly just a >>> stub. But it seems that each NPU is associated to a specific GPU, how >>> is that association done? In the use case here it seems like it's just >>> a vehicle to provide this ibm,mmio-atsd property to guest DT and the tgt >>> routing information to the GPU. So if both of those were attached to >>> the GPU, there'd be no purpose in assigning the NPU other than it's in >>> the same IOMMU group with a type 0 header, so something needs to be >>> done with it. If it's a virtual device, perhaps it could have a type 1 >>> header so vfio wouldn't care about it, then we would only assign the >>> GPU with these extra properties, which seems easier for management >>> tools and users. If the guest driver needs a visible NPU device, QEMU >>> could possibly emulate one to make the GPU association work >>> automatically. Maybe this isn't really a problem, but I wonder if >>> you've looked up the management stack to see what tools need to know to >>> assign these NPU devices and whether specific configurations are >>> required to make the NPU to GPU association work. Thanks, >> >> I'm not that familiar with how this was originally set up, but note that >> Alexey is just making it work exactly like baremetal does. The baremetal >> GPU driver works as-is in the VM and expects the same properties in the >> device-tree. Obviously it doesn't have to be that way, but there is >> value in keeping it identical. >> >> Another probably bigger point is that the NPU device also implements the >> nvlink HW interface and is required for actually training and >> maintaining the link up. The driver in the guest trains the links by >> programming both the GPU end and the NPU end of each link so the NPU >> device needs to be exposed to the guest. > > Ok, so there is functionality in assigning the NPU device itself, it's > not just an attachment point for meta data. But it still seems there > must be some association of NPU to GPU, the tgt address seems to pair > the NPU with a specific GPU, they're not simply a fungible set of NPUs > and GPUs. Is that association explicit anywhere or is it related to > the topology or device numbering that needs to match between the host > and guest? Thanks, It is in the device tree (phandle is a node ID). NPU: xscom@623fc00000000/npu@5011000 NVLinks: xscom@623fc00000000/npu@5011000/link@0 xscom@623fc00000000/npu@5011000/link@1 xscom@623fc00000000/npu@5011000/link@2 xscom@623fc00000000/npu@5011000/link@3 xscom@623fc00000000/npu@5011000/link@5 xscom@623fc00000000/npu@5011000/link@6 GPU RAM: memory@240000000000 memory@242000000000 memory@244000000000 GPUs: pciex@620c3c0500000/pci@0/pci@0/pci@4/3d-controller@0 ibm,npu property - 2 phandles of associated virtual bridges as in my config a GPU has 2 NVLinks to the CPU (or NPU in particular) pciex@620c3c0500000/pci@0/pci@0/pci@5/3d-controller@0 pciex@620c3c0500000/pci@0/pci@0/pci@d/3d-controller@0 Virtual bridges: pciex@6230200000000/pci@0 ibm,gpu property - a phandle of associated GPU memory-region property - a phandle of a GPU RAM block ibm,nvlink property - a phandle of an NVLink ibm,device-tgt-addr property - the short physical address of a GPU RAM (0x00000c00.00000000 in this example) pciex@6230200000000/pci@0,1 pciex@6230200000000/pci@1 pciex@6230200000000/pci@1,1 pciex@6230200000000/pci@2 pciex@6230200000000/pci@2,1 -- Alexey