Re: [PATCH v4 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 5 Jul 2023 16:16:04 -0600

On Wed, 5 Jul 2023 18:37:42 +0000
Ankit Agrawal <ankita@xxxxxxxxxx> wrote:

> > I had also asked in the previous review whether "nvgpu" is already overused.  I
> > see a python tool named nvgpu, an OpenXLA tool, various nvgpu things related
> > to Tegra, an nvgpu dialect for MLIR, etc.  There are over 5,000 hits on google for
> > "nvgpu", only a few of which reference development of this module.  Is there a
> > more unique name we can use?  Thanks,  
> 
> Sorry, had missed this comment. Are you suggesting changing the module name
> or just reduce the number of times we use the nvgpu keyword in all the functions
> of the module? I don't see any in-tree or vfio-pci module with a similar *nvgpu*
> name, and the clash appears to be with items outside of the kernel tree. Given
> that, should we still change the module name as nvgpu-vfio-pci sounds a relevant
> name here? Thanks.

I'm referring to the module name, which in turn would be reflected in
various function names.  The fact that there's no in-tree *nvgpu*
driver seems irrelevant when a web search for the term shows a variety
of tools and drivers, I believe there's even an out-of-tree NVIDIA
sponsored nvgpu driver for Android, correct?  How does this relate to
that?  I don't think it does, so why generate confusion?

I don't know your future plans for this driver, but it's currently
limited to exposing essentially a single feature on a very, very small
product subset, while "nvgpu" seems to project something much more
generic.

If we're going to see more of devices exposing coherent memory with
CXL, does that mean this driver might be short lived and perhaps won't
see further expansion in functionality?  If so maybe it should be named
more specifically for the product it supports.  I see some NVIDIA pages
referring to the GH200 superchip, maybe "GH", ex. "nvgh", "nvgh-gpu"?

Reading through the datasheet, I'm also reminded of issues we had with
the POWER implementation relative to isolation, since this coherent
memory is enabled via NVLink-C2C, which is opaque to Linux.  The
datasheet claims "[f]ourth-generation NVLink allows accessing peer
memory with direct loads, sotres, and atomic operations...", are those
direct accesses reflected in the PCI topology, ie. the PCIe ACS exposed
isolation, or is the peer here limited to the CPU?  Thanks,

Alex