On Tue, 30 Apr 2019 16:14:35 +1000 Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > On 30/04/2019 15:45, Alistair Popple wrote: > > Alexey, > > > >>>>> +void pnv_try_isolate_nvidia_v100(struct pci_dev *bridge) > >>>>> +{ > >>>>> + u32 mask, val; > >>>>> + void __iomem *bar0_0, *bar0_120000, *bar0_a00000; > >>>>> + struct pci_dev *pdev; > >>>>> + u16 cmd = 0, cmdmask = PCI_COMMAND_MEMORY; > >>>>> + > >>>>> + if (!bridge->subordinate) > >>>>> + return; > >>>>> + > >>>>> + pdev = list_first_entry_or_null(&bridge->subordinate->devices, > >>>>> + struct pci_dev, bus_list); > >>>>> + if (!pdev) > >>>>> + return; > >>>>> + > >>>>> + if (pdev->vendor != PCI_VENDOR_ID_NVIDIA) > > > > Don't you also need to check the PCIe devid to match only [PV]100 devices as > > well? I doubt there's any guarantee these registers will remain the same for > > all future (or older) NVIDIA devices. > > > I do not have the complete list of IDs and I already saw 3 different > device ids and this only works for machines with ibm,npu/gpu/nvlinks > properties so for now it works and for the future we are hoping to > either have an open source nvidia driver or some small minidriver (also > from nvidia, or may be a spec allowing us to write one) to allow > topology discovery on the host so we would not depend on the skiboot's > powernv DT. > > > IMHO this should really be done in the device driver in the guest. A malcious > > guest could load a modified driver that doesn't do this, but that should not > > compromise other guests which presumably load a non-compromised driver that > > disables the links on that guests GPU. However I guess in practice what you > > have here should work equally well. > > Doing it in the guest means a good guest needs to have an updated > driver, we do not really want to depend on this. The idea of IOMMU > groups is that the hypervisor provides isolation irrespective to what > the guest does. +1 It's not the user/guest driver's responsibility to maintain the isolation of the device. Thanks, Alex > Also vfio+qemu+slof needs to convey the nvlink topology to the guest, > seems like an unnecessary complication. > > > > > - Alistair > > > >>>>> + return; > >>>>> + > >>>>> + mask = nvlinkgpu_get_disable_mask(&pdev->dev); > >>>>> + if (!mask) > >>>>> + return; > >>>>> + > >>>>> + bar0_0 = pci_iomap_range(pdev, 0, 0, 0x10000); > >>>>> + if (!bar0_0) { > >>>>> + pci_err(pdev, "Error mapping BAR0 @0\n"); > >>>>> + return; > >>>>> + } > >>>>> + bar0_120000 = pci_iomap_range(pdev, 0, 0x120000, 0x10000); > >>>>> + if (!bar0_120000) { > >>>>> + pci_err(pdev, "Error mapping BAR0 @120000\n"); > >>>>> + goto bar0_0_unmap; > >>>>> + } > >>>>> + bar0_a00000 = pci_iomap_range(pdev, 0, 0xA00000, 0x10000); > >>>>> + if (!bar0_a00000) { > >>>>> + pci_err(pdev, "Error mapping BAR0 @A00000\n"); > >>>>> + goto bar0_120000_unmap; > >>>>> + } > >>>> > >>>> Is it really necessary to do three separate ioremaps vs one that would > >>>> cover them all here? I suspect you're just sneaking in PAGE_SIZE with > >>>> the 0x10000 size mappings anyway. Seems like it would simplify setup, > >>>> error reporting, and cleanup to to ioremap to the PAGE_ALIGN'd range > >>>> of the highest register accessed. Thanks, > >>> > >>> Sure I can map it once, I just do not see the point in mapping/unmapping > >>> all 0xa10000>>16=161 system pages for a very short period of time while > >>> we know precisely that we need just 3 pages. > >>> > >>> Repost? > >> > >> Ping? > >> > >> Can this go in as it is (i.e. should I ping Michael) or this needs > >> another round? It would be nice to get some formal acks. Thanks, > >> > >>>> Alex > >>>> > >>>>> + > >>>>> + pci_restore_state(pdev); > >>>>> + pci_read_config_word(pdev, PCI_COMMAND, &cmd); > >>>>> + if ((cmd & cmdmask) != cmdmask) > >>>>> + pci_write_config_word(pdev, PCI_COMMAND, cmd | cmdmask); > >>>>> + > >>>>> + /* > >>>>> + * The sequence is from "Tesla P100 and V100 SXM2 NVLink Isolation on > >>>>> + * Multi-Tenant Systems". > >>>>> + * The register names are not provided there either, hence raw values. > >>>>> + */ > >>>>> + iowrite32(0x4, bar0_120000 + 0x4C); > >>>>> + iowrite32(0x2, bar0_120000 + 0x2204); > >>>>> + val = ioread32(bar0_0 + 0x200); > >>>>> + val |= 0x02000000; > >>>>> + iowrite32(val, bar0_0 + 0x200); > >>>>> + val = ioread32(bar0_a00000 + 0x148); > >>>>> + val |= mask; > >>>>> + iowrite32(val, bar0_a00000 + 0x148); > >>>>> + > >>>>> + if ((cmd | cmdmask) != cmd) > >>>>> + pci_write_config_word(pdev, PCI_COMMAND, cmd); > >>>>> + > >>>>> + pci_iounmap(pdev, bar0_a00000); > >>>>> +bar0_120000_unmap: > >>>>> + pci_iounmap(pdev, bar0_120000); > >>>>> +bar0_0_unmap: > >>>>> + pci_iounmap(pdev, bar0_0); > >>>>> +} > > > > >