Re: [PATCH v10 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

Alex Williamson <alex.williamson@xxxxxxxxxx> · Mon, 18 Sep 2023 08:27:48 -0600

On Mon, 18 Sep 2023 10:02:56 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Fri, Sep 15, 2023 at 08:24:30AM -0600, Alex Williamson wrote:
> > On Thu, 14 Sep 2023 19:54:15 -0700
> > <ankita@xxxxxxxxxx> wrote:
> >   
> > > From: Ankit Agrawal <ankita@xxxxxxxxxx>
> > > 
> > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > > for the on-chip GPU that is the logical OS representation of the
> > > internal proprietary cache coherent interconnect.
> > > 
> > > This representation has a number of limitations compared to a real PCI
> > > device, in particular, it does not model the coherent GPU memory
> > > aperture as a PCI config space BAR, and PCI doesn't know anything
> > > about cacheable memory types.
> > > 
> > > Provide a VFIO PCI variant driver that adapts the unique PCI
> > > representation into a more standard PCI representation facing
> > > userspace. The GPU memory aperture is obtained from ACPI using
> > > device_property_read_u64(), according to the FW specification,
> > > and exported to userspace as a separate VFIO_REGION. Since the device
> > > implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped
> > > to the next available PCI BAR (BAR2). Qemu will then naturally generate a
> > > PCI device in the VM with two 64-bit BARs (where the cacheable aperture
> > > reported in BAR2).
> > > 
> > > Since this memory region is actually cache coherent with the CPU, the
> > > VFIO variant driver will mmap it into VMA using a cacheable mapping. The
> > > mapping is done using remap_pfn_range().
> > > 
> > > PCI BAR are aligned to the power-of-2, but the actual memory on the
> > > device may not. A read or write access to the physical address from the
> > > last device PFN up to the next power-of-2 aligned physical address
> > > results in reading ~0 and dropped writes.
> > > 
> > > Lastly the presence of CPU cache coherent device memory is exposed
> > > through sysfs for use by user space.  
> > 
> > This looks like a giant red flag that this approach of masquerading the
> > coherent memory as a PCI BAR is the wrong way to go.  If the VMM needs
> > to know about this coherent memory, it needs to get that information
> > in-band.   
> 
> The VMM part doesn't need this flag, nor does the VM. The
> orchestration needs to know when to setup the pxm stuff.

Subject: [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
...
+static bool vfio_pci_read_cohmem_support_sysfs(VFIODevice *vdev)
+{
+    gchar *contents = NULL;
+    gsize length;
+    char *path;
+    bool ret = false;
+    uint32_t supported;
+
+    path = g_strdup_printf("%s/coherent_mem", vdev->sysfsdev);
+    if (g_file_get_contents(path, &contents, &length, NULL) && length > 0) {
+        if ((sscanf(contents, "%u", &supported) == 1) && supported) {
+            ret = true;
+        }
+    }

> I think we should drop the sysfs for now until the qemu thread about
> the pxm stuff settles into an idea.
> 
> When the qemu API is clear we can have a discussion on what component
> should detect this driver and setup the pxm things, then answer the
> how should the detection work from the kernel side.
> 
> > be reaching out to arbitrary sysfs attributes.  Minimally this
> > information should be provided via a capability on the region info
> > chain,   
> 
> That definitely isn't suitable, eg libvirt won't have access to inband
> information if it turns out libvirt is supposed to setup the pxm qemu
> arguments?

Why would libvirt look for a "coherent_mem" attribute in sysfs when it
can just look at the driver used by the device.  Part of the QEMU
series is also trying to invoke the VM configuration based only on this
device being attached to avoid libvirt orchestration changes:

Subject: [PATCH v1 2/4] vfio: assign default values to node params

    It may be desirable for some deployments to have QEMU automatically
    pick a range and create the NUMA nodes. So the admin need not care
    about passing any additional params. Another advantage is that the
    feature is not dependent on newer libvirt that support the new
    parameters pxm-ns and pxm-nc.

> > A "coherent_mem" attribute on the device provides a very weak
> > association to the memory region it's trying to describe.  
> 
> That's because it's use has nothing to do with the memory region :)

So we're creating a very generic sysfs attribute, which is meant to be
used by orchestration to invoke device specific configuration, but is
currently only proposed for use by the VMM.  The orchestration problem
doesn't really exist, libvirt could know simply by the driver name that
the device requires this configuration.  And the VMM usage is self
inflicted because we insist on masquerading the coherent memory as a
nondescript PCI BAR rather than providing a device specific region to
enlighten the VMM to this unique feature.  Thanks,

Alex