RE: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Thu, 8 Feb 2024 07:14:11 +0000

> From: ankita@xxxxxxxxxx <ankita@xxxxxxxxxx>
> Sent: Tuesday, February 6, 2024 7:01 AM
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size
> memblock
> aligned.

Is memblock size defined in spec or purely a guest implementation choice?

> 
> If the bare metal properties are not present, the driver registers the
> vfio-pci-core function pointers.

so if qemu doesn't generate such property the variant driver running
inside guest will always go to use core functions and guest vfio userspace
will observe both resmem and usemem bars. But then there is nothing
in field to prohibit mapping resmem bar as cacheable.

should this driver check the presence of either ACPI property or 
resmem/usemem bars to enable variant function pointers?

> +config NVGRACE_GPU_VFIO_PCI
> +	tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
> Superchip"
> +	depends on ARM64 || (COMPILE_TEST && 64BIT)
> +	select VFIO_PCI_CORE
> +	help
> +	  VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
> +	  required to assign the GPU device using KVM/qemu/etc.

"assign the GPU device to userspace"

> +
> +/* Memory size expected as non cached and reserved by the VM driver */
> +#define RESMEM_SIZE 0x40000000
> +#define MEMBLK_SIZE 0x20000000

also add a comment for MEMBLK_SIZE

> +
> +struct nvgrace_gpu_vfio_pci_core_device {

will nvgrace refer to a non-gpu device? if not probably all prefixes with
'nvgrace_gpu' can be simplified to 'nvgrace'.

btw following other variant drivers 'vfio' can be removed too.

> +
> +/*
> + * Both the usable (usemem) and the reserved (resmem) device memory
> region
> + * are exposed as a 64b fake BARs in the VM. These fake BARs must respond

s/VM/device/

> + * to the accesses on their respective PCI config space offsets.
> + *
> + * resmem BAR owns PCI_BASE_ADDRESS_2 & PCI_BASE_ADDRESS_3.
> + * usemem BAR owns PCI_BASE_ADDRESS_4 & PCI_BASE_ADDRESS_5.
> + */
> +static ssize_t
> +nvgrace_gpu_read_config_emu(struct vfio_device *core_vdev,
> +			    char __user *buf, size_t count, loff_t *ppos)
> +{
> +	struct nvgrace_gpu_vfio_pci_core_device *nvdev =
> +		container_of(core_vdev, struct
> nvgrace_gpu_vfio_pci_core_device,
> +			     core_device.vdev);
> +	struct mem_region *memregion = NULL;
> +	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	__le64 val64;
> +	size_t register_offset;
> +	loff_t copy_offset;
> +	size_t copy_count;
> +	int ret;
> +
> +	ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> +	if (ret < 0)
> +		return ret;

here if core_read succeeds *ppos has been updated...

> +
> +	if (vfio_pci_core_range_intersect_range(pos, count,
> PCI_BASE_ADDRESS_2,
> +						sizeof(val64),
> +						&copy_offset, &copy_count,
> +						&register_offset))
> +		memregion =
> nvgrace_gpu_memregion(RESMEM_REGION_INDEX, nvdev);
> +	else if (vfio_pci_core_range_intersect_range(pos, count,
> +						     PCI_BASE_ADDRESS_4,
> +						     sizeof(val64),
> +						     &copy_offset,
> &copy_count,
> +						     &register_offset))
> +		memregion =
> nvgrace_gpu_memregion(USEMEM_REGION_INDEX, nvdev);
> +
> +	if (memregion) {
> +		val64 = nvgrace_gpu_get_read_value(memregion->bar_size,
> +
> PCI_BASE_ADDRESS_MEM_TYPE_64 |
> +
> PCI_BASE_ADDRESS_MEM_PREFETCH,
> +						   memregion->bar_val);
> +		if (copy_to_user(buf + copy_offset,
> +				 (void *)&val64 + register_offset, copy_count))
> +			return -EFAULT;

...but here it's not adjusted back upon error.

> +
> +/*
> + * Read the data from the device memory (mapped either through ioremap
> + * or memremap) into the user buffer.
> + */
> +static int
> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device
> *nvdev,
> +			 char __user *buf, size_t mem_count, loff_t *ppos)
> +{
> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> +	int ret;
> +
> +	/*
> +	 * Handle read on the BAR regions. Map to the target device memory
> +	 * physical address and copy to the request read buffer.
> +	 */

duplicate with the earlier comment for the function. 

> +/*
> + * Read count bytes from the device memory at an offset. The actual device
> + * memory size (available) may not be a power-of-2. So the driver fakes
> + * the size to a power-of-2 (reported) when exposing to a user space driver.
> + *
> + * Reads extending beyond the reported size are truncated; reads starting
> + * beyond the reported size generate -EINVAL; reads extending beyond the
> + * actual device size is filled with ~0.

slightly clearer to order the description: read starting beyond reported
size, then read extending beyond device size, and read extending beyond
reported size.

> +static int
> +nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
> +			      struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> +			      u64 memphys, u64 memlength)
> +{
> +	int ret = 0;
> +
> +	/*
> +	 * The VM GPU device driver needs a non-cacheable region to
> support
> +	 * the MIG feature. Since the device memory is mapped as NORMAL
> cached,
> +	 * carve out a region from the end with a different NORMAL_NC
> +	 * property (called as reserved memory and represented as resmem).
> This
> +	 * region then is exposed as a 64b BAR (region 2 and 3) to the VM,
> while
> +	 * exposing the rest (termed as usable memory and represented
> using usemem)
> +	 * as cacheable 64b BAR (region 4 and 5).
> +	 *
> +	 *               devmem (memlength)
> +	 * |-------------------------------------------------|
> +	 * |                                           |
> +	 * usemem.phys/memphys                         resmem.phys

there is no usemem.phys and resmem.phys

> +	 */
> +	nvdev->usemem.memphys = memphys;
> +
> +	/*
> +	 * The device memory exposed to the VM is added to the kernel by
> the
> +	 * VM driver module in chunks of memory block size. Only the usable
> +	 * memory (usemem) is added to the kernel for usage by the VM
> +	 * workloads. Make the usable memory size memblock aligned.
> +	 */

If memblock size is defined by hw spec then say so.

otherwise this sounds a broken contract if it's a guest-decided value.

> +	if (check_sub_overflow(memlength, RESMEM_SIZE,
> +			       &nvdev->usemem.memlength)) {
> +		ret = -EOVERFLOW;
> +		goto done;
> +	}

does resmem require 1G-aligned?

if usemem.memlength becomes 0 then should return error too.