>> Signed-off-by: Ankit Agrawal <ankita@xxxxxxxxxx> >> Signed-off-by: Aniket Agashe <aniketa@xxxxxxxxxx> >> Tested-by: Ankit Agrawal <ankita@xxxxxxxxxx> > > Dunno about others, but I sure hope and assume the author tests ;) > Sometimes I'm proven wrong. Yeah, does not hurt to keep it then I suppose. :) >> + >> +#include "nvgrace_gpu_vfio_pci.h" > > Could probably just shorten this to nvgrace_gpu.h, but with just a > single source file, we don't need a separate header. Put it inline here. Ack. >> + >> +/* Choose the structure corresponding to the fake BAR with a given index. */ >> +struct mem_region * > > static Yes. >> + * >> + * The available GPU memory size may not be power-of-2 aligned. Map up >> + * to the size of the device memory. If the memory access is beyond the >> + * actual GPU memory size, it will be handled by the vfio_device_ops >> + * read/write. > > The phrasing "[m]ap up to the size" suggests the behavior of previous > versions where we'd truncate mappings. Maybe something like: > > * The available GPU memory size may not be power-of-2 aligned. > * The remainder is only backed by read/write handlers. Got it. Will fix. >> + >> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64), >> + ©_offset, ©_count, >> + ®ister_offset)) { >> + val64 = nvgrace_gpu_get_read_value(roundup_pow_of_two(nvdev->resmem.memlength), >> + PCI_BASE_ADDRESS_MEM_TYPE_64 | >> + PCI_BASE_ADDRESS_MEM_PREFETCH, >> + nvdev->resmem.u64_reg); >> + if (copy_to_user(buf + copy_offset, >> + (void *)&val64 + register_offset, copy_count)) >> + return -EFAULT; >> + } >> + >> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64), >> + ©_offset, ©_count, >> + ®ister_offset)) { >> + val64 = nvgrace_gpu_get_read_value(roundup_pow_of_two(nvdev->usemem.memlength), >> + PCI_BASE_ADDRESS_MEM_TYPE_64 | >> + PCI_BASE_ADDRESS_MEM_PREFETCH, >> + nvdev->usemem.u64_reg); >> + if (copy_to_user(buf + copy_offset, >> + (void *)&val64 + register_offset, copy_count)) >> + return -EFAULT; >> + } > > Both read and write could be simplified a bit: > > if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64), > ©_offset, ©_count, > ®ister_offset)) > memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(RESMEM_REGION_INDEX, nvdev); > else if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64), > ©_offset, ©_count, > ®ister_offset)) > memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(USEMEM_REGION_INDEX, nvdev); > > if (memregion) { > val64 = nvgrace_gpu_get_read_value(roundup_pow_of_two(memregion->memlength), > PCI_BASE_ADDRESS_MEM_TYPE_64 | > PCI_BASE_ADDRESS_MEM_PREFETCH, > memregion->u64_reg); > if (copy_to_user(buf + copy_offset, > (void *)&val64 + register_offset, copy_count)) > return -EFAULT; > } Yes thanks, will make the change. >> +static ssize_t >> +nvgrace_gpu_write_config_emu(struct vfio_device *core_vdev, >> + const char __user *buf, size_t count, loff_t *ppos) >> +{ >> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = >> + container_of(core_vdev, struct nvgrace_gpu_vfio_pci_core_device, >> + core_device.vdev); >> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK; >> + size_t register_offset; >> + loff_t copy_offset; >> + size_t copy_count; >> + >> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(u64), >> + ©_offset, ©_count, >> + ®ister_offset)) { >> + if (copy_from_user((void *)&nvdev->resmem.u64_reg + register_offset, >> + buf + copy_offset, copy_count)) >> + return -EFAULT; >> + *ppos += copy_count; >> + return copy_count; >> + } >> + >> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(u64), >> + ©_offset, ©_count, ®ister_offset)) { >> + if (copy_from_user((void *)&nvdev->usemem.u64_reg + register_offset, >> + buf + copy_offset, copy_count)) >> + return -EFAULT; >> + *ppos += copy_count; >> + return copy_count; >> + } > > Likewise: > > if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(u64), > ©_offset, ©_count, > ®ister_offset)) > memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(RESMEM_REGION_INDEX, nvdev); > else if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(u64), > ©_offset, ©_count, ®ister_offset)) { > memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(USEMEM_REGION_INDEX, nvdev); > > if (memregion) { > if (copy_from_user((void *)&memregion->u64_reg + register_offset, > buf + copy_offset, copy_count)) > return -EFAULT; > > *ppos += copy_count; > return copy_count; > } Ack. >> + >> + return vfio_pci_core_write(core_vdev, buf, count, ppos); >> +} >> + >> +/* >> + * Ad hoc map the device memory in the module kernel VA space. Primarily needed >> + * to support Qemu's device x-no-mmap=on option. > > In general we try not to assume QEMU is the userspace driver. This > certainly supports x-no-mmap=on in QEMU, but this is needed because > vfio does not require the userspace driver to only perform accesses > through mmaps of the vfio-pci BAR regions and existing userspace driver > precedent requires read/write implementations. Makes sense. Will rephrase it. >> + * >> + * The usemem region is cacheable memory and hence is memremaped. >> + * The resmem region is non-cached and is mapped using ioremap_wc (NORMAL_NC). >> + */ >> +static int >> +nvgrace_gpu_map_device_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev, >> + int index) >> +{ >> + int ret = 0; >> + >> + mutex_lock(&nvdev->remap_lock); >> + if (index == USEMEM_REGION_INDEX && >> + !nvdev->usemem.bar_remap.memaddr) { >> + nvdev->usemem.bar_remap.memaddr = >> + memremap(nvdev->usemem.memphys, >> + nvdev->usemem.memlength, MEMREMAP_WB); >> + if (!nvdev->usemem.bar_remap.memaddr) >> + ret = -ENOMEM; >> + } else if (index == RESMEM_REGION_INDEX && >> + !nvdev->resmem.bar_remap.ioaddr) { >> + nvdev->resmem.bar_remap.ioaddr = >> + ioremap_wc(nvdev->resmem.memphys, >> + nvdev->resmem.memlength); >> + if (!nvdev->resmem.bar_remap.ioaddr) >> + ret = -ENOMEM; >> + } > > With an anonymous union we could reduce a lot of the redundancy here: > > struct mem_region *memregion; > int ret = 0; > > memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev); > if (!memregion) > return -EINVAL; > > mutex_lock(&nvdev->remap_lock); > > if (memregion->memaddr) > goto unlock; > > if (index == USEMEM_REGION_INDEX) > memregion->memaddr = memremap(memregion->memphys, > memregion->memlength, > MEMREMAP_WB); > else > memregion->ioaddr = ioremap_wc(memregion->memphys, > memregion->memlength); > > if (!memregion->memaddr) > ret = -ENOMEM; > > unlock: > ... Great suggestion, thanks. Will update. > BTW, why does this function have args (nvdev, index) but > nvgrace_gpu_vfio_pci_fake_bar_mem_region has args (index, nvdev)? It shouldn't. Missed to maintain uniformity there. > nvgrace_gpu_vfio_pci_fake_bar_mem_region could also be shorted to just > nvgrace_gpu_memregion and I think we could use nvgrace_gpu in place of > nvgrace_gpu_vfio_pci for function names throughout. Ack. >> +static int >> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device *nvdev, >> + char __user *buf, size_t mem_count, loff_t *ppos) >> +{ >> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); >> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK; >> + int ret = 0; >> + >> + /* >> + * Handle read on the BAR regions. Map to the target device memory >> + * physical address and copy to the request read buffer. >> + */ >> + ret = nvgrace_gpu_map_device_mem(nvdev, index); >> + if (ret) >> + return ret; >> + > > > This seems like a good place for a comment regarding COMMAND_MEM being > ignored, especially since we're passing 'false' for test_mem in the > second branch. Good point. Will add the comment. >> + * >> + * A read from a negative or an offset greater than reported size, a negative >> + * count are considered error conditions and returned with an -EINVAL. > > This needs some phrasing help, I can't parse. Yeah, I'll itemize the error conditions to make it more readable. >> +static int nvgrace_gpu_vfio_pci_probe(struct pci_dev *pdev, >> + const struct pci_device_id *id) >> +{ >> + struct nvgrace_gpu_vfio_pci_core_device *nvdev; >> + int ret; >> + >> + nvdev = vfio_alloc_device(nvgrace_gpu_vfio_pci_core_device, core_device.vdev, >> + &pdev->dev, &nvgrace_gpu_vfio_pci_ops); >> + if (IS_ERR(nvdev)) >> + return PTR_ERR(nvdev); >> + >> + dev_set_drvdata(&pdev->dev, nvdev); >> + >> + ret = nvgrace_gpu_vfio_pci_fetch_memory_property(pdev, nvdev); >> + if (ret) >> + goto out_put_vdev; > > As a consequence of exposing the device differently in the host vs > guest, we need to consider nested assignment here. The device table > below will force userspace to select this driver for the device, but > binding to it would fail because these bare metal properties are not > present. We addressed this in the virtio-vfio-pci driver and decided > the driver needs to support the device regardless of the availability > of support for the legacy aspects of that driver. There's no protocol > defined for userspace to pick a second best driver for a device. > > Therefore, like virtio-vfio-pci, this should be able to register a > straight vfio-pci-core ops when these bare metal properties are not > present. Sounds reasonable, will make the change. >> +struct mem_region { >> + phys_addr_t memphys; /* Base physical address of the region */ >> + size_t memlength; /* Region size */ >> + __le64 u64_reg; /* Emulated BAR offset registers */ > > s/u64_reg/bar_val/ ? Fine with me. > We could also include bar_size so we don't recalculate the power-of-2 size. Sure.