On Sun, Jun 23, 2024 at 11:53:11PM -0700, Vivek Kasireddy wrote: > From Jason Gunthorpe: > "dma-buf has become a way to safely acquire a handle to non-struct page > memory that can still have lifetime controlled by the exporter. Notably > RDMA can now import dma-buf FDs and build them into MRs which allows for > PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory > from PCI device BARs. > > The patch design loosely follows the pattern in commit > db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this > does not support pinning. > > Instead, this implements what, in the past, we've called a revocable > attachment using move. In normal situations the attachment is pinned, as a > BAR does not change physical address. However when the VFIO device is > closed, or a PCI reset is issued, access to the MMIO memory is revoked. > > Revoked means that move occurs, but an attempt to immediately re-map the > memory will fail. In the reset case a future move will be triggered when > MMIO access returns. As both close and reset are under userspace control > it is expected that userspace will suspend use of the dma-buf before doing > these operations, the revoke is purely for kernel self-defense against a > hostile userspace." > > Following enhancements are made to the original patch: > - Add support for creating dmabuf from multiple areas (or ranges) > - Add a mmap handler to provide CPU access to the dmabuf > > Original-patch-by: Jason Gunthorpe <jgg@xxxxxxxxxx> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx> > --- > drivers/vfio/pci/Makefile | 1 + > drivers/vfio/pci/dma_buf.c | 438 +++++++++++++++++++++++++++++ > drivers/vfio/pci/vfio_pci_config.c | 22 +- > drivers/vfio/pci/vfio_pci_core.c | 20 +- > drivers/vfio/pci/vfio_pci_priv.h | 23 ++ > include/linux/vfio_pci_core.h | 1 + > include/uapi/linux/vfio.h | 25 ++ > 7 files changed, 525 insertions(+), 5 deletions(-) > create mode 100644 drivers/vfio/pci/dma_buf.c <...> > +static int populate_sgt(struct dma_buf_attachment *attachment, > + enum dma_data_direction dir, > + struct sg_table *sgt, size_t sgl_size) > +{ > + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; > + struct vfio_region_dma_range *dma_ranges = priv->dma_ranges; > + size_t offset, chunk_size; > + struct scatterlist *sgl; > + dma_addr_t dma_addr; > + phys_addr_t phys; > + int i, j, ret; > + > + for_each_sgtable_sg(sgt, sgl, j) > + sgl->length = 0; > + > + sgl = sgt->sgl; > + for (i = 0; i < priv->nr_ranges; i++) { > + phys = pci_resource_start(priv->vdev->pdev, > + dma_ranges[i].region_index); > + phys += dma_ranges[i].offset; > + > + /* > + * Break the BAR's physical range up into max sized SGL's > + * according to the device's requirement. > + */ > + for (offset = 0; offset != dma_ranges[i].length;) { > + chunk_size = min(dma_ranges[i].length - offset, > + sgl_size); > + > + /* > + * Since the memory being mapped is a device memory > + * it could never be in CPU caches. > + */ > + dma_addr = dma_map_resource(attachment->dev, > + phys + offset, > + chunk_size, dir, > + DMA_ATTR_SKIP_CPU_SYNC); > + ret = dma_mapping_error(attachment->dev, dma_addr); > + if (ret) > + goto err; > + > + sg_set_page(sgl, NULL, chunk_size, 0); > + sg_dma_address(sgl) = dma_addr; > + sg_dma_len(sgl) = chunk_size; > + sgl = sg_next(sgl); > + offset += chunk_size; > + } > + } > + > + return 0; > +err: > + for_each_sgtable_sg(sgt, sgl, j) { > + if (!sg_dma_len(sgl)) > + continue; > + > + dma_unmap_resource(attachment->dev, sg_dma_address(sgl), > + sg_dma_len(sgl), > + dir, DMA_ATTR_SKIP_CPU_SYNC); > + } > + > + return ret; > +} > + > +static struct sg_table * > +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, > + enum dma_data_direction dir) > +{ > + size_t sgl_size = dma_get_max_seg_size(attachment->dev); > + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; > + struct sg_table *sgt; > + unsigned int nents; > + int ret; > + > + dma_resv_assert_held(priv->dmabuf->resv); > + > + if (!attachment->peer2peer) > + return ERR_PTR(-EPERM); > + > + if (priv->revoked) > + return ERR_PTR(-ENODEV); > + > + sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); > + if (!sgt) > + return ERR_PTR(-ENOMEM); > + > + nents = DIV_ROUND_UP(priv->dmabuf->size, sgl_size); > + ret = sg_alloc_table(sgt, nents, GFP_KERNEL); > + if (ret) > + goto err_kfree_sgt; > + > + ret = populate_sgt(attachment, dir, sgt, sgl_size); One of the outcomes of the discussion over original Jason's series was the decision do not use scatter-gather list, but provide DMA API to be usable for non-struct page memory, which eliminates the need of intermediate SG list. This is why we had this series https://lore.kernel.org/all/cover.1709635535.git.leon@xxxxxxxxxx And new version is available here: https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dma-split-v1 Thanks