Thanks Kevin for the review. Comments inline. >> >> Note that the usemem memory is added by the VM Nvidia device driver [5] >> to the VM kernel as memblocks. Hence make the usable memory size >> memblock >> aligned. > > Is memblock size defined in spec or purely a guest implementation choice? The MEMBLOCK value is a hardwired and a constant ABI value between the GPU FW and VFIO driver. >> >> If the bare metal properties are not present, the driver registers the >> vfio-pci-core function pointers. > > so if qemu doesn't generate such property the variant driver running > inside guest will always go to use core functions and guest vfio userspace > will observe both resmem and usemem bars. But then there is nothing > in field to prohibit mapping resmem bar as cacheable. > > should this driver check the presence of either ACPI property or > resmem/usemem bars to enable variant function pointers? Maybe I am missing something here; but if the ACPI property is absent, the real physical BARs present on the device will be exposed by the vfio-pci-core functions to the VM. So I think if the variant driver is ran within the VM, it should not see the fake usemem and resmem BARs. >> +config NVGRACE_GPU_VFIO_PCI >> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper >> Superchip" >> + depends on ARM64 || (COMPILE_TEST && 64BIT) >> + select VFIO_PCI_CORE >> + help >> + VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is >> + required to assign the GPU device using KVM/qemu/etc. > > "assign the GPU device to userspace" Ack, will make the change. >> + >> +/* Memory size expected as non cached and reserved by the VM driver */ >> +#define RESMEM_SIZE 0x40000000 >> +#define MEMBLK_SIZE 0x20000000 > > also add a comment for MEMBLK_SIZE Yes. >> + >> +struct nvgrace_gpu_vfio_pci_core_device { > > will nvgrace refer to a non-gpu device? if not probably all prefixes with > 'nvgrace_gpu' can be simplified to 'nvgrace'. nvgrace_gpu indicates the GPU associated with grace CPU here. That is a one of the reason behind keeping the nomenclature as nvgrace_gpu. Calling it nvgrace may not be apt as it is a CPU. > btw following other variant drivers 'vfio' can be removed too. Ack. >> + >> +/* >> + * Both the usable (usemem) and the reserved (resmem) device memory >> region >> + * are 64b fake BARs in the VM. These fake BARs must respond > > s/VM/device/ I can make it clearer to something like "64b fake device BARs in the VM". >> + int ret; >> + >> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos); >> + if (ret < 0) >> + return ret; > > here if core_read succeeds *ppos has been updated... Thanks for pointing that out, will fix the code handling *ppos. >> + * Read the data from the device memory (mapped either through ioremap >> + * or memremap) into the user buffer. >> + */ >> +static int >> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device >> *nvdev, >> + char __user *buf, size_t mem_count, loff_t *ppos) >> +{ >> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); >> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK; >> + int ret; >> + >> + /* >> + * Handle read on the BAR regions. Map to the target device memory >> + * physical address and copy to the request read buffer. >> + */ > > duplicate with the earlier comment for the function. Ack. >> +/* >> + * Read count bytes from the device memory at an offset. The actual device >> + * memory size (available) may not be a power-of-2. So the driver fakes >> + * the size to a power-of-2 (reported) when exposing to a user space driver. >> + * >> + * Reads extending beyond the reported size are truncated; reads starting >> + * beyond the reported size generate -EINVAL; reads extending beyond the >> + * actual device size is filled with ~0. > > slightly clearer to order the description: read starting beyond reported > size, then read extending beyond device size, and read extending beyond > reported size. Yes, will make the change. >> + * exposing the rest (termed as usable memory and represented >> using usemem) >> + * as cacheable 64b BAR (region 4 and 5). >> + * >> + * devmem (memlength) >> + * |-------------------------------------------------| >> + * | | >> + * usemem.phys/memphys resmem.phys > > there is no usemem.phys and resmem.phys Silly miss. Will fix that. >> + */ >> + nvdev->usemem.memphys = memphys; >> + >> + /* >> + * The device memory exposed to the VM is added to the kernel by >> the >> + * VM driver module in chunks of memory block size. Only the usable >> + * memory (usemem) is added to the kernel for usage by the VM >> + * workloads. Make the usable memory size memblock aligned. >> + */ > > If memblock size is defined by hw spec then say so. > otherwise this sounds a broken contract if it's a guest-decided value. Answered above. Will update the comment to mention that. >> + if (check_sub_overflow(memlength, RESMEM_SIZE, >> + &nvdev->usemem.memlength)) { >> + ret = -EOVERFLOW; >> + goto done; >> + } > > does resmem require 1G-aligned? No, the requirement is it to be 1G+. No alignment restrictions there. > if usemem.memlength becomes 0 then should return error too. I suppose you mean if it becomes 0 after the subtraction above. We are checking for the 0 size in the host ACPI table otherwise.