On 31/07/2018 02:29, Alex Williamson wrote: > On Mon, 30 Jul 2018 18:58:49 +1000 > Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > >> On 11/07/2018 19:26, Alexey Kardashevskiy wrote: >>> On Tue, 10 Jul 2018 16:37:15 -0600 >>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: >>> >>>> On Tue, 10 Jul 2018 14:10:20 +1000 >>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>>> >>>>> On Thu, 7 Jun 2018 23:03:23 -0600 >>>>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: >>>>> >>>>>> On Fri, 8 Jun 2018 14:14:23 +1000 >>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>>>>> >>>>>>> On 8/6/18 1:44 pm, Alex Williamson wrote: >>>>>>>> On Fri, 8 Jun 2018 13:08:54 +1000 >>>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>>>>>>> >>>>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote: >>>>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000 >>>>>>>>>> Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote: >>>>>>>>>>>> >>>>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink >>>>>>>>>>>> connected devices makes sense? AIUI we have a PCI view of these >>>>>>>>>>>> devices and from that perspective they're isolated. That's the view of >>>>>>>>>>>> the device used to generate the grouping. However, not visible to us, >>>>>>>>>>>> these devices are interconnected via NVLink. What isolation properties >>>>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to >>>>>>>>>>>> be to provide a high performance link for p2p between devices? >>>>>>>>>>> >>>>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device >>>>>>>>>>> and the CPU which is running significantly faster than PCIe. >>>>>>>>>>> >>>>>>>>>>> But yes, there are cross-links and those should probably be accounted >>>>>>>>>>> for in the grouping. >>>>>>>>>> >>>>>>>>>> Then after we fix the grouping, can we just let the host driver manage >>>>>>>>>> this coherent memory range and expose vGPUs to guests? The use case of >>>>>>>>>> assigning all 6 GPUs to one VM seems pretty limited. (Might need to >>>>>>>>>> convince NVIDIA to support more than a single vGPU per VM though) >>>>>>>>> >>>>>>>>> These are physical GPUs, not virtual sriov-alike things they are >>>>>>>>> implementing as well elsewhere. >>>>>>>> >>>>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like >>>>>>>> either. That's why we have mdev devices now to implement software >>>>>>>> defined devices. I don't have first hand experience with V-series, but >>>>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU. >>>>>>> >>>>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and >>>>>>> using mediated vGPUs instead, correct? >>>>>> >>>>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't >>>>>> account for lack of isolation on the NVLink side and we correct that, >>>>>> limiting assignment to sets of 3 interconnected GPUs, is that still a >>>>>> useful feature? OTOH, it's entirely an NVIDIA proprietary decision >>>>>> whether they choose to support vGPU on these GPUs or whether they can >>>>>> be convinced to support multiple vGPUs per VM. >>>>>> >>>>>>>>> My current understanding is that every P9 chip in that box has some NVLink2 >>>>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and >>>>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links >>>>>>>>> as well. >>>>>>>>> >>>>>>>>> From small bits of information I have it seems that a GPU can perfectly >>>>>>>>> work alone and if the NVIDIA driver does not see these interconnects >>>>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it >>>>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer >>>>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some >>>>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to >>>>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU. >>>>>>>>> >>>>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per >>>>>>>>> interconnected group). >>>>>>>> >>>>>>>> I'm not gaining much confidence that we can rely on isolation between >>>>>>>> NVLink connected GPUs, it sounds like you're simply expecting that >>>>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA >>>>>>>> is going to play nice and nobody will figure out how to do bad things >>>>>>>> because... obfuscation? Thanks, >>>>>>> >>>>>>> Well, we already believe that a proprietary firmware of a sriov-capable >>>>>>> adapter like Mellanox ConnextX is not doing bad things, how is this >>>>>>> different in principle? >>>>>> >>>>>> It seems like the scope and hierarchy are different. Here we're >>>>>> talking about exposing big discrete devices, which are peers of one >>>>>> another (and have history of being reverse engineered), to userspace >>>>>> drivers. Once handed to userspace, each of those devices needs to be >>>>>> considered untrusted. In the case of SR-IOV, we typically have a >>>>>> trusted host driver for the PF managing untrusted VFs. We do rely on >>>>>> some sanity in the hardware/firmware in isolating the VFs from each >>>>>> other and from the PF, but we also often have source code for Linux >>>>>> drivers for these devices and sometimes even datasheets. Here we have >>>>>> neither of those and perhaps we won't know the extent of the lack of >>>>>> isolation between these devices until nouveau (best case) or some >>>>>> exploit (worst case) exposes it. IOMMU grouping always assumes a lack >>>>>> of isolation between devices unless the hardware provides some >>>>>> indication that isolation exists, for example ACS on PCIe. If NVIDIA >>>>>> wants to expose isolation on NVLink, perhaps they need to document >>>>>> enough of it that the host kernel can manipulate and test for isolation, >>>>>> perhaps even enabling virtualization of the NVLink interconnect >>>>>> interface such that the host can prevent GPUs from interfering with >>>>>> each other. Thanks, >>>>> >>>>> >>>>> So far I got this from NVIDIA: >>>>> >>>>> 1. An NVLink2 state can be controlled via MMIO registers, there is a >>>>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is >>>>> "confidential" though) from NVIDIA with the MMIO addresses to block if >>>>> we want to disable certain links. In order to NVLink to work it needs to >>>>> be enabled on both sides so by filtering certains MMIO ranges we can >>>>> isolate a GPU. >>>> >>>> Where are these MMIO registers, on the bridge or on the endpoint device? >>> >>> The endpoint GPU device. >>> >>>> I'm wondering when you say block MMIO if these are ranges on the device >>>> that we disallow mmap to and all the overlapping PAGE_SIZE issues that >>>> come with that or if this should essentially be device specific >>>> enable_acs and acs_enabled quirks, and maybe also potentially used by >>>> Logan's disable acs series to allow GPUs to be linked and have grouping >>>> to match. >>> >>> An update, I confused P100 and V100, P100 would need filtering but >>> ours is V100 and it has a couple of registers which we can use to >>> disable particular links and once disabled, the link cannot be >>> re-enabled till the next secondary bus reset. >>> >>> >>>>> 2. We can and should also prohibit the GPU firmware update, this is >>>>> done via MMIO as well. The protocol is not open but at least register >>>>> ranges might be in order to filter these accesses, and there is no >>>>> plan to change this. >>>> >>>> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys >>>> along with it. >>> >>> Yes, however NVIDIA says there is no performance critical stuff with >>> this 64K page. >>> >>>> Also, there are certainly use cases of updating >>>> firmware for an assigned device, we don't want to impose a policy, but >>>> we should figure out the right place for that policy to be specified by >>>> the admin. >>> >>> May be but NVIDIA is talking about some "out-of-band" command to the GPU >>> to enable firmware update so firmware update is not really supported. >>> >>> >>>>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for >>>>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link), >>>>> and UT=0 for direct host memory access. UT stands for "use >>>>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is >>>>> possible over the PCIe link. >>>>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a >>>>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id), >>>>> mmu context id (guest userspace mm id), a virtual address and translates >>>>> to the host physical and that result is used for UT=0 DMA, this is >>>>> called "ATS" although it is not PCIe ATS afaict. >>>>> NVIDIA says that the hardware is designed in a way that it can only do >>>>> DMA UT=0 to addresses which ATS translated to, and there is no way to >>>>> override this behavior and this is what guarantees the isolation. >>>> >>>> I'm kinda lost here, maybe we can compare it to PCIe ATS where an >>>> endpoint requests a translation of an IOVA to physical address, the >>>> IOMMU returns a lookup based on PCIe requester ID, and there's an >>>> invalidation protocol to keep things coherent. >>> >>> Yes there is. The current approach is to have an MMU notifier in >>> the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2 >>> and NVIDIA nest MMU) to invalidate translations and that in turn pokes >>> the GPU till that confirms that it invalidated tlbs and there is no >>> ongoing DMA. >>> >>>> In the case above, who provides a guest id and mmu context id? >>> >>> We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to >>> an LPID (== guest id) and MMU context id comes from the guest. The nest >>> MMU knows where the partition table and this table contains all the >>> pointers needs for the translation. >>> >>> >>>> Additional software >>>> somewhere? Is the virtual address an IOVA or a process virtual >>>> address? >>> >>> A guest kernel or a guest userspace virtual address. >>> >>>> Do we assume some sort of invalidation protocol as well? >>> >>> I am little confused, is this question about the same invalidation >>> protocol as above or different? >>> >>> >>>>> So isolation can be achieved if I do not miss something. >>>>> >>>>> How do we want this to be documented to proceed? I assume if I post >>>>> patches filtering MMIOs, this won't do it, right? If just 1..3 are >>>>> documented, will we take this t&c or we need a GPU API spec (which is >>>>> not going to happen anyway)? >>>> >>>> "t&c"? I think we need what we're actually interacting with to be well >>>> documented, but that could be _thorough_ comments in the code, enough >>>> to understand the theory of operation, as far as I'm concerned. A pdf >>>> lost on a corporate webserver isn't necessarily an improvement over >>>> that, but there needs to be sufficient detail to understand what we're >>>> touching such that we can maintain, adapt, and improve the code over >>>> time. Only item #3 above appears POWER specific, so I'd hope that #1 >>>> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel >>>> vfio-pci, but I'm not sure that's necessary), and I don't know where #3 >>>> goes. Thanks, >>> >>> Ok, understood. Thanks! >> >> After some local discussions, it was pointed out that force disabling >> nvlinks won't bring us much as for an nvlink to work, both sides need to >> enable it so malicious guests cannot penetrate good ones (or a host) >> unless a good guest enabled the link but won't happen with a well >> behaving guest. And if two guests became malicious, then can still only >> harm each other, and so can they via other ways such network. This is >> different from PCIe as once PCIe link is unavoidably enabled, a well >> behaving device cannot firewall itself from peers as it is up to the >> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still >> has means to protect itself, just like a guest can run "firewalld" for >> network. >> >> Although it would be a nice feature to have an extra barrier between >> GPUs, is inability to block the links in hypervisor still a blocker for >> V100 pass through? > > How is the NVLink configured by the guest, is it 'on'/'off' or are > specific routes configured? The GPU-GPU links need not to be blocked and need to be enabled (==trained) by a driver in the guest. There are no routes between GPUs in NVLink fabric, these are direct links, it is just a switch on each side, both switches need to be on for a link to work. The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state is controlled via the emulated PCI bridges which I pass through together with the GPU. > If the former, then isn't a non-malicious > guest still susceptible to a malicious guest? A non-malicious guest needs to turn its switch on for a link to a GPU which belongs to a malicious guest. > If the latter, how is > routing configured by the guest given that the guest view of the > topology doesn't match physical hardware? Are these routes > deconfigured by device reset? Are they part of the save/restore > state? Thanks, -- Alexey