On Mon, 18 Sep 2023 10:02:56 -0300 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Fri, Sep 15, 2023 at 08:24:30AM -0600, Alex Williamson wrote: > > On Thu, 14 Sep 2023 19:54:15 -0700 > > <ankita@xxxxxxxxxx> wrote: > > > > > From: Ankit Agrawal <ankita@xxxxxxxxxx> > > > > > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > > > for the on-chip GPU that is the logical OS representation of the > > > internal proprietary cache coherent interconnect. > > > > > > This representation has a number of limitations compared to a real PCI > > > device, in particular, it does not model the coherent GPU memory > > > aperture as a PCI config space BAR, and PCI doesn't know anything > > > about cacheable memory types. > > > > > > Provide a VFIO PCI variant driver that adapts the unique PCI > > > representation into a more standard PCI representation facing > > > userspace. The GPU memory aperture is obtained from ACPI using > > > device_property_read_u64(), according to the FW specification, > > > and exported to userspace as a separate VFIO_REGION. Since the device > > > implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped > > > to the next available PCI BAR (BAR2). Qemu will then naturally generate a > > > PCI device in the VM with two 64-bit BARs (where the cacheable aperture > > > reported in BAR2). > > > > > > Since this memory region is actually cache coherent with the CPU, the > > > VFIO variant driver will mmap it into VMA using a cacheable mapping. The > > > mapping is done using remap_pfn_range(). > > > > > > PCI BAR are aligned to the power-of-2, but the actual memory on the > > > device may not. A read or write access to the physical address from the > > > last device PFN up to the next power-of-2 aligned physical address > > > results in reading ~0 and dropped writes. > > > > > > Lastly the presence of CPU cache coherent device memory is exposed > > > through sysfs for use by user space. > > > > This looks like a giant red flag that this approach of masquerading the > > coherent memory as a PCI BAR is the wrong way to go. If the VMM needs > > to know about this coherent memory, it needs to get that information > > in-band. > > The VMM part doesn't need this flag, nor does the VM. The > orchestration needs to know when to setup the pxm stuff. Subject: [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c ... +static bool vfio_pci_read_cohmem_support_sysfs(VFIODevice *vdev) +{ + gchar *contents = NULL; + gsize length; + char *path; + bool ret = false; + uint32_t supported; + + path = g_strdup_printf("%s/coherent_mem", vdev->sysfsdev); + if (g_file_get_contents(path, &contents, &length, NULL) && length > 0) { + if ((sscanf(contents, "%u", &supported) == 1) && supported) { + ret = true; + } + } > I think we should drop the sysfs for now until the qemu thread about > the pxm stuff settles into an idea. > > When the qemu API is clear we can have a discussion on what component > should detect this driver and setup the pxm things, then answer the > how should the detection work from the kernel side. > > > be reaching out to arbitrary sysfs attributes. Minimally this > > information should be provided via a capability on the region info > > chain, > > That definitely isn't suitable, eg libvirt won't have access to inband > information if it turns out libvirt is supposed to setup the pxm qemu > arguments? Why would libvirt look for a "coherent_mem" attribute in sysfs when it can just look at the driver used by the device. Part of the QEMU series is also trying to invoke the VM configuration based only on this device being attached to avoid libvirt orchestration changes: Subject: [PATCH v1 2/4] vfio: assign default values to node params It may be desirable for some deployments to have QEMU automatically pick a range and create the NUMA nodes. So the admin need not care about passing any additional params. Another advantage is that the feature is not dependent on newer libvirt that support the new parameters pxm-ns and pxm-nc. > > A "coherent_mem" attribute on the device provides a very weak > > association to the memory region it's trying to describe. > > That's because it's use has nothing to do with the memory region :) So we're creating a very generic sysfs attribute, which is meant to be used by orchestration to invoke device specific configuration, but is currently only proposed for use by the VMM. The orchestration problem doesn't really exist, libvirt could know simply by the driver name that the device requires this configuration. And the VMM usage is self inflicted because we insist on masquerading the coherent memory as a nondescript PCI BAR rather than providing a device specific region to enlighten the VMM to this unique feature. Thanks, Alex