On Tue, 6 Jun 2023 14:16:42 -0300 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Tue, Jun 06, 2023 at 11:05:10AM -0600, Alex Williamson wrote: > > > It actually seems more complicated this way. We're masquerading this > > region as a BAR, but then QEMU needs to know based on device IDs that > > it's really not a BAR, it has special size properties, mapping > > attributes, error handling, etc. > > This seems like something has gone wrong then. ie the SIGUBS error > handling stuff should be totally generic in the qemu side. Mapping > attributes are set by the kernel, qemu shouldn't know, doesn't need to > know. You asked me to look at the v1 posting to see why there's so much more going on here than a quirk. That's what I read from the first public posting, a coherent memory region masqueraded as a BAR which requires different memory mapping and participates in ECC. I agree that the actual mapping is done by the kernel, but it doesn't really make a difference if that's a vfio-pci variant driver providing a different mmap callback for a BAR region or a device specific region handler. > The size issue is going to a be a problem in future anyhow, I expect > some new standards coming to support non-power-two sizes and they will > want to map to PCI devices in VMs still. Ok, but a PCI BAR has specific constraints and a non-power-of-2 BAR is not software compatible with those constraints. That's obviously not to say that a new capability couldn't expose arbitrary resources sizes on a PCI-like device though. I don't see how a non-power-of-2 BAR at this stage helps or fits within any spec, which is exactly what's being proposed through this BAR masquerade. > It seems OK to me if qemu can do this generically for any "BAR" > region, at least creating an entire "nvidia only" code path just for > non power 2 BAR sizing seems like a bad ABI choice. Have you looked at Ankit's QEMU series? It's entirely NVIDIA-only code paths. Also nothing here precludes that shared code in QEMU might expose some known arbitrary sized regions as a BAR, or whatever spec defined thing allows that in the future. It would only be a slight modification in the QEMU code to key on the presence of a device specific region rather than PCI vendor and device IDs, to then register that region as a PCI BAR and proceed with all this NVIDIA specific PXM/SRAT setup. IMO it makes a lot more sense to create memory-only NUMA nodes based on a device specific region than it does a PCI BAR. > > I'm not privy to a v1, the earliest I see is this (v3): > > > > https://lore.kernel.org/all/20230405180134.16932-1-ankita@xxxxxxxxxx/ > > > > That outlines that we have a proprietary interconnect exposing cache > > coherent memory which requires use of special mapping attributes vs a > > standard PCI BAR and participates in ECC. All of which seems like it > > would be easier to setup in QEMU if the vfio-pci representation of the > > device didn't masquerade this regions as a standard BAR. In fact it > > also reminds me of NVlink2 coherent RAM on POWER machines that was > > similarly handled as device specific regions. > > It wasn't so good on POWER and if some of that stuff has been done > more generally we would have been further ahead here.. Specifics? Nothing here explained why masquerading the coherent memory as a BAR in the vfio-pci ABI is anything more than a hack that QEMU could assemble on its own with a device specific region. Thanks, Alex