Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 3 Jan 2024 12:57:27 -0400

On Tue, Jan 02, 2024 at 09:10:01AM -0700, Alex Williamson wrote:

> Yes, it's possible to add support that these ranges honor the memory
> enable bit, but it's not trivial and unfortunately even vfio-pci isn't
> a great example of this.

We talked about this already, the HW architects here confirm there is
no issue with reset and memory enable. You will get all 1's on read
and NOP on write. It doesn't need to implement VMA zap.

> around device reset or relative to the PCI command register.  The
> variant driver becomes a trivial implementation that masks BARs 2 & 4
> and exposes the ACPI range as a device specific region with only mmap
> support.  QEMU can then map the device specific region into VM memory
> and create an equivalent ACPI table for the guest.

Well, no, probably not. There is an NVIDIA specification for how the
vPCI function should be setup within the VM and it uses the BAR
method, not the ACPI.

There are a lot of VMMs and OSs this needs to support so it must all
be consistent. For better or worse the decision was taken for the vPCI
spec to use BAR not ACPI, in part due to feedback from the broader VMM
ecosystem, and informed by future product plans.

So, if vfio does special regions then qemu and everyone else has to
fix it to meet the spec.

> I know Jason had described this device as effectively pre-CXL to
> justify the coherent memory mapping, but it seems like there's still a
> gap here that we can't simply hand wave that this PCI BAR follows a
> different set of semantics.  

I thought all the meaningful differences are fixed now?

The main remaining issue seems to be around the config space
emulation?

> We don't typically endorse complexity in the kernel only for the
> purpose of avoiding work in userspace.  The absolute minimum should
> be some justification of the design decision and analysis relative
> to standard PCI behavior.  Thanks,

If we strictly took that view in VFIO a lot of stuff wouldn't be here
:)

I've made this argument before and gave up - the ecosystem wants to
support multiple VMMs and the sanest path to do that is via VFIO
kernel drivers that plug into existing vfio-pci support in the VMM
ecosystem.

>From a HW supplier perspective it is quite vexing to have to support
all these different (and often proprietary!) VMM implementations. It
is not just top of tree qemu.

If we instead did complex userspace drivers and userspace emulation of
config space and so on then things like the IDXD SIOV support would
look *very* different and not use VFIO at all. That would probably be
somewhat better for security, but I was convinced it is a long and
technically complex road.

At least with this approach the only VMM issue is the NUMA nodes, and
as we have discussed that hackery is to make up for current Linux
kernel SW limitations, not actually reflecting anything about the
HW. If some other OS or future Linux doesn't require the ACPI NUMA
nodes to create an OS visible NUMA object then the VMM will not
require any changes.

Jason