On Fri, Sep 15, 2023 at 08:24:30AM -0600, Alex Williamson wrote: > On Thu, 14 Sep 2023 19:54:15 -0700 > <ankita@xxxxxxxxxx> wrote: > > > From: Ankit Agrawal <ankita@xxxxxxxxxx> > > > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > > for the on-chip GPU that is the logical OS representation of the > > internal proprietary cache coherent interconnect. > > > > This representation has a number of limitations compared to a real PCI > > device, in particular, it does not model the coherent GPU memory > > aperture as a PCI config space BAR, and PCI doesn't know anything > > about cacheable memory types. > > > > Provide a VFIO PCI variant driver that adapts the unique PCI > > representation into a more standard PCI representation facing > > userspace. The GPU memory aperture is obtained from ACPI using > > device_property_read_u64(), according to the FW specification, > > and exported to userspace as a separate VFIO_REGION. Since the device > > implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped > > to the next available PCI BAR (BAR2). Qemu will then naturally generate a > > PCI device in the VM with two 64-bit BARs (where the cacheable aperture > > reported in BAR2). > > > > Since this memory region is actually cache coherent with the CPU, the > > VFIO variant driver will mmap it into VMA using a cacheable mapping. The > > mapping is done using remap_pfn_range(). > > > > PCI BAR are aligned to the power-of-2, but the actual memory on the > > device may not. A read or write access to the physical address from the > > last device PFN up to the next power-of-2 aligned physical address > > results in reading ~0 and dropped writes. > > > > Lastly the presence of CPU cache coherent device memory is exposed > > through sysfs for use by user space. > > This looks like a giant red flag that this approach of masquerading the > coherent memory as a PCI BAR is the wrong way to go. If the VMM needs > to know about this coherent memory, it needs to get that information > in-band. The VMM part doesn't need this flag, nor does the VM. The orchestration needs to know when to setup the pxm stuff. I think we should drop the sysfs for now until the qemu thread about the pxm stuff settles into an idea. When the qemu API is clear we can have a discussion on what component should detect this driver and setup the pxm things, then answer the how should the detection work from the kernel side. > be reaching out to arbitrary sysfs attributes. Minimally this > information should be provided via a capability on the region info > chain, That definitely isn't suitable, eg libvirt won't have access to inband information if it turns out libvirt is supposed to setup the pxm qemu arguments? > A "coherent_mem" attribute on the device provides a very weak > association to the memory region it's trying to describe. That's because it's use has nothing to do with the memory region :) Jason