On 8/6/18 1:44 pm, Alex Williamson wrote: > On Fri, 8 Jun 2018 13:08:54 +1000 > Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > >> On 8/6/18 8:15 am, Alex Williamson wrote: >>> On Fri, 08 Jun 2018 07:54:02 +1000 >>> Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> wrote: >>> >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote: >>>>> >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink >>>>> connected devices makes sense? AIUI we have a PCI view of these >>>>> devices and from that perspective they're isolated. That's the view of >>>>> the device used to generate the grouping. However, not visible to us, >>>>> these devices are interconnected via NVLink. What isolation properties >>>>> does NVLink provide given that its entire purpose for existing seems to >>>>> be to provide a high performance link for p2p between devices? >>>> >>>> Not entire. On POWER chips, we also have an nvlink between the device >>>> and the CPU which is running significantly faster than PCIe. >>>> >>>> But yes, there are cross-links and those should probably be accounted >>>> for in the grouping. >>> >>> Then after we fix the grouping, can we just let the host driver manage >>> this coherent memory range and expose vGPUs to guests? The use case of >>> assigning all 6 GPUs to one VM seems pretty limited. (Might need to >>> convince NVIDIA to support more than a single vGPU per VM though) >> >> These are physical GPUs, not virtual sriov-alike things they are >> implementing as well elsewhere. > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like > either. That's why we have mdev devices now to implement software > defined devices. I don't have first hand experience with V-series, but > I would absolutely expect a PCIe-based Tesla V100 to support vGPU. So assuming V100 can do vGPU, you are suggesting ditching this patchset and using mediated vGPUs instead, correct? >> My current understanding is that every P9 chip in that box has some NVLink2 >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links >> as well. >> >> From small bits of information I have it seems that a GPU can perfectly >> work alone and if the NVIDIA driver does not see these interconnects >> (because we do not pass the rest of the big 3xGPU group to this guest), it >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer >> which simply refuses to work until all 3 GPUs are passed so there is some >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to >> get a confirmation from NVIDIA that it is ok to pass just a single GPU. >> >> So we will either have 6 groups (one per GPU) or 2 groups (one per >> interconnected group). > > I'm not gaining much confidence that we can rely on isolation between > NVLink connected GPUs, it sounds like you're simply expecting that > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA > is going to play nice and nobody will figure out how to do bad things > because... obfuscation? Thanks, Well, we already believe that a proprietary firmware of a sriov-capable adapter like Mellanox ConnextX is not doing bad things, how is this different in principle? ps. their obfuscation is funny indeed :) -- Alexey