https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/diff/drivers/infiniband/hw/mlx5/odp.c?h=dma-split-v1&id=a0d719a406133cdc3ef2328dda3ef082a034c45e > > Thanks, > Oak > > > -----Original Message----- > > From: Leon Romanovsky <leon@xxxxxxxxxx> > > Sent: Monday, June 10, 2024 12:18 PM > > To: Zeng, Oak <oak.zeng@xxxxxxxxx> > > Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Christoph Hellwig <hch@xxxxxx>; Robin > > Murphy <robin.murphy@xxxxxxx>; Marek Szyprowski > > <m.szyprowski@xxxxxxxxxxx>; Joerg Roedel <joro@xxxxxxxxxx>; Will > > Deacon <will@xxxxxxxxxx>; Chaitanya Kulkarni <chaitanyak@xxxxxxxxxx>; > > Brost, Matthew <matthew.brost@xxxxxxxxx>; Hellstrom, Thomas > > <thomas.hellstrom@xxxxxxxxx>; Jonathan Corbet <corbet@xxxxxxx>; Jens > > Axboe <axboe@xxxxxxxxx>; Keith Busch <kbusch@xxxxxxxxxx>; Sagi > > Grimberg <sagi@xxxxxxxxxxx>; Yishai Hadas <yishaih@xxxxxxxxxx>; > > Shameer Kolothum <shameerali.kolothum.thodi@xxxxxxxxxx>; Tian, Kevin > > <kevin.tian@xxxxxxxxx>; Alex Williamson <alex.williamson@xxxxxxxxxx>; > > Jérôme Glisse <jglisse@xxxxxxxxxx>; Andrew Morton <akpm@linux- > > foundation.org>; linux-doc@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > > linux-block@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; > > iommu@xxxxxxxxxxxxxxx; linux-nvme@xxxxxxxxxxxxxxxxxxx; > > kvm@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; Bart Van Assche > > <bvanassche@xxxxxxx>; Damien Le Moal > > <damien.lemoal@xxxxxxxxxxxxxxxxxx>; Amir Goldstein > > <amir73il@xxxxxxxxx>; josef@xxxxxxxxxxxxxx; Martin K. Petersen > > <martin.petersen@xxxxxxxxxx>; daniel@xxxxxxxxxxxxx; Williams, Dan J > > <dan.j.williams@xxxxxxxxx>; jack@xxxxxxxx; Zhu Yanjun > > <zyjzyj2000@xxxxxxxxx>; Bommu, Krishnaiah > > <krishnaiah.bommu@xxxxxxxxx>; Ghimiray, Himal Prasad > > <himal.prasad.ghimiray@xxxxxxxxx> > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > two steps > > > > On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote: > > > Hi Jason, Leon, > > > > > > I come back to this thread to ask a question. Per the discussion in another > > thread, I have integrated the new dma-mapping API (the first 6 patches of > > this series) to DRM subsystem. The new API seems fit pretty good to our > > purpose, better than scatter-gather dma-mapping. So we want to continue > > work with you to adopt this new API. > > > > Sounds great, thanks for the feedback. > > > > > > > > Did you test the new API in RDMA subsystem? > > > > This version was tested in our regression tests, but there is a chance > > that you are hitting flows that were not relevant for RDMA case. > > > > > Or this RFC series was just some untested codes sending out to get > > people's design feedback? > > > > RFC was fully tested in VFIO and RDMA paths, but not NVMe patch. > > > > > Do you have refined version for us to try? I ask because we are seeing > > some issues but not sure whether it is caused by the new API. We are > > debugging but it would be good to also ask at the same time. > > > > Yes, as an outcome of the feedback in this thread, I implemented a new > > version. Unfortunately, there are some personal matters that are preventing > > from me to send it right away. > > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux- > > rdma.git/log/?h=dma-split-v1 > > > > There are some differences in the API, but the main idea is the same. > > This version is not fully tested yet. > > > > Thanks > > > > > > > > Cc Himal/Krishna who are also working/testing the new API. > > > > > > Thanks, > > > Oak > > > > > > > -----Original Message----- > > > > From: Jason Gunthorpe <jgg@xxxxxxxx> > > > > Sent: Friday, May 3, 2024 12:43 PM > > > > To: Zeng, Oak <oak.zeng@xxxxxxxxx> > > > > Cc: leon@xxxxxxxxxx; Christoph Hellwig <hch@xxxxxx>; Robin Murphy > > > > <robin.murphy@xxxxxxx>; Marek Szyprowski > > > > <m.szyprowski@xxxxxxxxxxx>; Joerg Roedel <joro@xxxxxxxxxx>; Will > > > > Deacon <will@xxxxxxxxxx>; Chaitanya Kulkarni <chaitanyak@xxxxxxxxxx>; > > > > Brost, Matthew <matthew.brost@xxxxxxxxx>; Hellstrom, Thomas > > > > <thomas.hellstrom@xxxxxxxxx>; Jonathan Corbet <corbet@xxxxxxx>; > > Jens > > > > Axboe <axboe@xxxxxxxxx>; Keith Busch <kbusch@xxxxxxxxxx>; Sagi > > > > Grimberg <sagi@xxxxxxxxxxx>; Yishai Hadas <yishaih@xxxxxxxxxx>; > > > > Shameer Kolothum <shameerali.kolothum.thodi@xxxxxxxxxx>; Tian, > > Kevin > > > > <kevin.tian@xxxxxxxxx>; Alex Williamson <alex.williamson@xxxxxxxxxx>; > > > > Jérôme Glisse <jglisse@xxxxxxxxxx>; Andrew Morton <akpm@linux- > > > > foundation.org>; linux-doc@xxxxxxxxxxxxxxx; linux- > > kernel@xxxxxxxxxxxxxxx; > > > > linux-block@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; > > > > iommu@xxxxxxxxxxxxxxx; linux-nvme@xxxxxxxxxxxxxxxxxxx; > > > > kvm@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; Bart Van Assche > > > > <bvanassche@xxxxxxx>; Damien Le Moal > > > > <damien.lemoal@xxxxxxxxxxxxxxxxxx>; Amir Goldstein > > > > <amir73il@xxxxxxxxx>; josef@xxxxxxxxxxxxxx; Martin K. Petersen > > > > <martin.petersen@xxxxxxxxxx>; daniel@xxxxxxxxxxxxx; Williams, Dan J > > > > <dan.j.williams@xxxxxxxxx>; jack@xxxxxxxx; Leon Romanovsky > > > > <leonro@xxxxxxxxxx>; Zhu Yanjun <zyjzyj2000@xxxxxxxxx> > > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > > > two steps > > > > > > > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote: > > > > > > > > > > Instead of teaching DMA to know these specific datatypes, let's > > separate > > > > > > existing DMA mapping routine to two steps and give an option to > > > > advanced > > > > > > callers (subsystems) perform all calculations internally in advance and > > > > > > map pages later when it is needed. > > > > > > > > > > I looked into how this scheme can be applied to DRM subsystem and > > GPU > > > > drivers. > > > > > > > > > > I figured RDMA can apply this scheme because RDMA can calculate the > > > > > iova size. Per my limited knowledge of rdma, user can register a > > > > > memory region (the reg_user_mr vfunc) and memory region's sized is > > > > > used to pre-allocate iova space. And in the RDMA use case, it seems > > > > > the user registered region can be very big, e.g., 512MiB or even GiB > > > > > > > > In RDMA the iova would be linked to the SVA granual we discussed > > > > previously. > > > > > > > > > In GPU driver, we have a few use cases where we need dma-mapping. > > Just > > > > name two: > > > > > > > > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu > > > > > (in Intel's driver it is through a vm_bind api, similar to mmap). A > > > > > userptr can be of any random size, depending on user malloc > > > > > size. Today we use dma-map-sg for this use case. The down side of > > > > > our approach is, during userptr invalidation, even if user only > > > > > munmap partially of an userptr, we invalidate the whole userptr from > > > > > gpu page table, because there is no way for us to partially > > > > > dma-unmap the whole sg list. I think we can try your new API in this > > > > > case. The main benefit of the new approach is the partial munmap > > > > > case. > > > > > > > > Yes, this is one of the main things it will improve. > > > > > > > > > We will have to pre-allocate iova for each userptr, and we have many > > > > > userptrs of random size... So we might be not as efficient as RDMA > > > > > case where I assume user register a few big memory regions. > > > > > > > > You are already doing this. dma_map_sg() does exactly the same IOVA > > > > allocation under the covers. > > > > > > > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU > > > > > program directly, without any other extra driver API call. We call > > > > > this use case system allocator. > > > > > > > > > For system allocator, driver have no knowledge of which virtual > > > > > address range is valid in advance. So when GPU access a > > > > > malloc'ed/mmap'ed address, we have a page fault. We then look up a > > > > > CPU vma which contains the fault address. I guess we can use the CPU > > > > > vma size to allocate the iova space of the same size? > > > > > > > > No. You'd follow what we discussed in the other thread. > > > > > > > > If you do a full SVA then you'd split your MM space into granuals and > > > > when a fault hits a granual you'd allocate the IOVA for the whole > > > > granual. RDMA ODP is using a 512M granual currently. > > > > > > > > If you are doing sub ranges then you'd probably allocate the IOVA for > > > > the well defined sub range (assuming the typical use case isn't huge) > > > > > > > > > But there will be a true difficulty to apply your scheme to this use > > > > > case. It is related to the STICKY flag. As I understand it, the > > > > > sticky flag is designed for driver to mark "this page/pfn has been > > > > > populated, no need to re-populate again", roughly...Unlike userptr > > > > > and RDMA use cases where the backing store of a buffer is always in > > > > > system memory, in the system allocator use case, the backing store > > > > > can be changing b/t system memory and GPU's device private > > > > > memory. Even worse, we have to assume the data migration b/t > > system > > > > > and GPU is dynamic. When data is migrated to GPU, we don't need > > > > > dma-map. And when migration happens to a pfn with STICKY flag, we > > > > > still need to repopulate this pfn. So you can see, it is not easy to > > > > > apply this scheme to this use case. At least I can't see an obvious > > > > > way. > > > > > > > > You are already doing this today, you are keeping the sg list around > > > > until you unmap it. > > > > > > > > Instead of keeping the sg list you'd keep a much smaller datastructure > > > > per-granual. The sticky bit is simply a convient way for ODP to manage > > > > the smaller data structure, you don't have to use it. > > > > > > > > But you do need to keep track of what pages in the granual have been > > > > DMA mapped - sg list was doing this before. This could be a simple > > > > bitmap array matching the granual size. > > > > > > > > Looking (far) forward we may be able to have a "replace" API that > > > > allows installing a new page unconditionally regardless of what is > > > > already there. > > > > > > > > Jason