Thanks, Pavel, for the recommendation! We are super interested in collaborating on this - we are working on the prototype of your recommendation but moving a little bit slow due to vacation and resources. Thanks, Wilson -----Original Message----- From: Pavel Begunkov <asml.silence@xxxxxxxxx> Sent: Thursday, June 23, 2022 3:35 AM To: Fang, Wilson <wilson.fang@xxxxxxxxx>; io-uring@xxxxxxxxxxxxxxx Cc: Jens Axboe <axboe@xxxxxxxxx> Subject: Re: dma_buf support with io_uring On 6/23/22 07:17, Fang, Wilson wrote: > Hi Jens, > > We are exploring a kernel native mechanism to support peer to peer data transfer between a NVMe SSD and another device supporting dma_buf, connected on the same PCIe root complex. > NVMe SSD DMA engine requires physical memory address and there is no easy way to pass non system memory address through VFS to the block device driver. > One of the ideas is to use the io_uring and dma_buf mechanism which is supported by the peer device of the SSD. Interesting, that's quite aligns with what we're doing, that is a more generic way for p2p with some non-p2p optimisations on the way. Our approach we tried before is to let userspace to register dma-buf fd inside io_uring as a register buffer, prepare everything in advance like dmabuf attach, and then rw/send/etc. can use that. > The flow is as below: > 1. Application passes the dma_buf fd to the kernel through liburing. > 2. Io_uring adds two new options IORING_OP_READ_DMA and IORING_OP_WRITE_DMA to support read write operations that DMA to/from the peer device memory. > 3. If the dma_buf fd is valid, io_uring attaches dma_buf and get sgl which contains physical memory addresses to be passed down to the block device driver. > 4. NVMe SSD DMA engine DMA the data to/from the physical memory address. > > The road blocker we are facing is that dma_buf_attach() and dma_buf_map_attachment() APIs expects the caller to provide the struct device *dev as input parameter pointing to the device which does the DMA (in this case the block/NVMe device that holds the source data). > But since io_uring operates at the VFS layer there is no straight forward way of finding the block/NVMe device object (struct device*) from the source file descriptor. > > Do you have any recommendations? Much appreciated! For finding a device pointer, we added an optional file operation callback. I think that's much better than parsing it on the io_uring side, especially since we need a guarantee that the device is the only one which will be targeted and won't change (e.g. network may choose a device dynamically based on target address). I think we have space to cooperate here :) -- Pavel Begunkov