[no subject]

**Date** **Thread**



https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/diff/drivers/infiniband/hw/mlx5/odp.c?h=dma-split-v1&id=a0d719a406133cdc3ef2328dda3ef082a034c45e


> 
> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Leon Romanovsky <leon@xxxxxxxxxx>
> > Sent: Monday, June 10, 2024 12:18 PM
> > To: Zeng, Oak <oak.zeng@xxxxxxxxx>
> > Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Christoph Hellwig <hch@xxxxxx>; Robin
> > Murphy <robin.murphy@xxxxxxx>; Marek Szyprowski
> > <m.szyprowski@xxxxxxxxxxx>; Joerg Roedel <joro@xxxxxxxxxx>; Will
> > Deacon <will@xxxxxxxxxx>; Chaitanya Kulkarni <chaitanyak@xxxxxxxxxx>;
> > Brost, Matthew <matthew.brost@xxxxxxxxx>; Hellstrom, Thomas
> > <thomas.hellstrom@xxxxxxxxx>; Jonathan Corbet <corbet@xxxxxxx>; Jens
> > Axboe <axboe@xxxxxxxxx>; Keith Busch <kbusch@xxxxxxxxxx>; Sagi
> > Grimberg <sagi@xxxxxxxxxxx>; Yishai Hadas <yishaih@xxxxxxxxxx>;
> > Shameer Kolothum <shameerali.kolothum.thodi@xxxxxxxxxx>; Tian, Kevin
> > <kevin.tian@xxxxxxxxx>; Alex Williamson <alex.williamson@xxxxxxxxxx>;
> > Jérôme Glisse <jglisse@xxxxxxxxxx>; Andrew Morton <akpm@linux-
> > foundation.org>; linux-doc@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > linux-block@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx;
> > iommu@xxxxxxxxxxxxxxx; linux-nvme@xxxxxxxxxxxxxxxxxxx;
> > kvm@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; Bart Van Assche
> > <bvanassche@xxxxxxx>; Damien Le Moal
> > <damien.lemoal@xxxxxxxxxxxxxxxxxx>; Amir Goldstein
> > <amir73il@xxxxxxxxx>; josef@xxxxxxxxxxxxxx; Martin K. Petersen
> > <martin.petersen@xxxxxxxxxx>; daniel@xxxxxxxxxxxxx; Williams, Dan J
> > <dan.j.williams@xxxxxxxxx>; jack@xxxxxxxx; Zhu Yanjun
> > <zyjzyj2000@xxxxxxxxx>; Bommu, Krishnaiah
> > <krishnaiah.bommu@xxxxxxxxx>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@xxxxxxxxx>
> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > two steps
> > 
> > On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote:
> > > Hi Jason, Leon,
> > >
> > > I come back to this thread to ask a question. Per the discussion in another
> > thread, I have integrated the new dma-mapping API (the first 6 patches of
> > this series) to DRM subsystem. The new API seems fit pretty good to our
> > purpose, better than scatter-gather dma-mapping. So we want to continue
> > work with you to adopt this new API.
> > 
> > Sounds great, thanks for the feedback.
> > 
> > >
> > > Did you test the new API in RDMA subsystem?
> > 
> > This version was tested in our regression tests, but there is a chance
> > that you are hitting flows that were not relevant for RDMA case.
> > 
> > > Or this RFC series was just some untested codes sending out to get
> > people's design feedback?
> > 
> > RFC was fully tested in VFIO and RDMA paths, but not NVMe patch.
> > 
> > > Do you have refined version for us to try? I ask because we are seeing
> > some issues but not sure whether it is caused by the new API. We are
> > debugging but it would be good to also ask at the same time.
> > 
> > Yes, as an outcome of the feedback in this thread, I implemented a new
> > version. Unfortunately, there are some personal matters that are preventing
> > from me to send it right away.
> > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-
> > rdma.git/log/?h=dma-split-v1
> > 
> > There are some differences in the API, but the main idea is the same.
> > This version is not fully tested yet.
> > 
> > Thanks
> > 
> > >
> > > Cc Himal/Krishna who are also working/testing the new API.
> > >
> > > Thanks,
> > > Oak
> > >
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe <jgg@xxxxxxxx>
> > > > Sent: Friday, May 3, 2024 12:43 PM
> > > > To: Zeng, Oak <oak.zeng@xxxxxxxxx>
> > > > Cc: leon@xxxxxxxxxx; Christoph Hellwig <hch@xxxxxx>; Robin Murphy
> > > > <robin.murphy@xxxxxxx>; Marek Szyprowski
> > > > <m.szyprowski@xxxxxxxxxxx>; Joerg Roedel <joro@xxxxxxxxxx>; Will
> > > > Deacon <will@xxxxxxxxxx>; Chaitanya Kulkarni <chaitanyak@xxxxxxxxxx>;
> > > > Brost, Matthew <matthew.brost@xxxxxxxxx>; Hellstrom, Thomas
> > > > <thomas.hellstrom@xxxxxxxxx>; Jonathan Corbet <corbet@xxxxxxx>;
> > Jens
> > > > Axboe <axboe@xxxxxxxxx>; Keith Busch <kbusch@xxxxxxxxxx>; Sagi
> > > > Grimberg <sagi@xxxxxxxxxxx>; Yishai Hadas <yishaih@xxxxxxxxxx>;
> > > > Shameer Kolothum <shameerali.kolothum.thodi@xxxxxxxxxx>; Tian,
> > Kevin
> > > > <kevin.tian@xxxxxxxxx>; Alex Williamson <alex.williamson@xxxxxxxxxx>;
> > > > Jérôme Glisse <jglisse@xxxxxxxxxx>; Andrew Morton <akpm@linux-
> > > > foundation.org>; linux-doc@xxxxxxxxxxxxxxx; linux-
> > kernel@xxxxxxxxxxxxxxx;
> > > > linux-block@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx;
> > > > iommu@xxxxxxxxxxxxxxx; linux-nvme@xxxxxxxxxxxxxxxxxxx;
> > > > kvm@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; Bart Van Assche
> > > > <bvanassche@xxxxxxx>; Damien Le Moal
> > > > <damien.lemoal@xxxxxxxxxxxxxxxxxx>; Amir Goldstein
> > > > <amir73il@xxxxxxxxx>; josef@xxxxxxxxxxxxxx; Martin K. Petersen
> > > > <martin.petersen@xxxxxxxxxx>; daniel@xxxxxxxxxxxxx; Williams, Dan J
> > > > <dan.j.williams@xxxxxxxxx>; jack@xxxxxxxx; Leon Romanovsky
> > > > <leonro@xxxxxxxxxx>; Zhu Yanjun <zyjzyj2000@xxxxxxxxx>
> > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > > > two steps
> > > >
> > > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> > > >
> > > > > > Instead of teaching DMA to know these specific datatypes, let's
> > separate
> > > > > > existing DMA mapping routine to two steps and give an option to
> > > > advanced
> > > > > > callers (subsystems) perform all calculations internally in advance and
> > > > > > map pages later when it is needed.
> > > > >
> > > > > I looked into how this scheme can be applied to DRM subsystem and
> > GPU
> > > > drivers.
> > > > >
> > > > > I figured RDMA can apply this scheme because RDMA can calculate the
> > > > > iova size. Per my limited knowledge of rdma, user can register a
> > > > > memory region (the reg_user_mr vfunc) and memory region's sized is
> > > > > used to pre-allocate iova space. And in the RDMA use case, it seems
> > > > > the user registered region can be very big, e.g., 512MiB or even GiB
> > > >
> > > > In RDMA the iova would be linked to the SVA granual we discussed
> > > > previously.
> > > >
> > > > > In GPU driver, we have a few use cases where we need dma-mapping.
> > Just
> > > > name two:
> > > > >
> > > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > > > > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > > > > userptr can be of any random size, depending on user malloc
> > > > > size. Today we use dma-map-sg for this use case. The down side of
> > > > > our approach is, during userptr invalidation, even if user only
> > > > > munmap partially of an userptr, we invalidate the whole userptr from
> > > > > gpu page table, because there is no way for us to partially
> > > > > dma-unmap the whole sg list. I think we can try your new API in this
> > > > > case. The main benefit of the new approach is the partial munmap
> > > > > case.
> > > >
> > > > Yes, this is one of the main things it will improve.
> > > >
> > > > > We will have to pre-allocate iova for each userptr, and we have many
> > > > > userptrs of random size... So we might be not as efficient as RDMA
> > > > > case where I assume user register a few big memory regions.
> > > >
> > > > You are already doing this. dma_map_sg() does exactly the same IOVA
> > > > allocation under the covers.
> > > >
> > > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > > > > program directly, without any other extra driver API call. We call
> > > > > this use case system allocator.
> > > >
> > > > > For system allocator, driver have no knowledge of which virtual
> > > > > address range is valid in advance. So when GPU access a
> > > > > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > > > > CPU vma which contains the fault address. I guess we can use the CPU
> > > > > vma size to allocate the iova space of the same size?
> > > >
> > > > No. You'd follow what we discussed in the other thread.
> > > >
> > > > If you do a full SVA then you'd split your MM space into granuals and
> > > > when a fault hits a granual you'd allocate the IOVA for the whole
> > > > granual. RDMA ODP is using a 512M granual currently.
> > > >
> > > > If you are doing sub ranges then you'd probably allocate the IOVA for
> > > > the well defined sub range (assuming the typical use case isn't huge)
> > > >
> > > > > But there will be a true difficulty to apply your scheme to this use
> > > > > case. It is related to the STICKY flag. As I understand it, the
> > > > > sticky flag is designed for driver to mark "this page/pfn has been
> > > > > populated, no need to re-populate again", roughly...Unlike userptr
> > > > > and RDMA use cases where the backing store of a buffer is always in
> > > > > system memory, in the system allocator use case, the backing store
> > > > > can be changing b/t system memory and GPU's device private
> > > > > memory. Even worse, we have to assume the data migration b/t
> > system
> > > > > and GPU is dynamic. When data is migrated to GPU, we don't need
> > > > > dma-map. And when migration happens to a pfn with STICKY flag, we
> > > > > still need to repopulate this pfn. So you can see, it is not easy to
> > > > > apply this scheme to this use case. At least I can't see an obvious
> > > > > way.
> > > >
> > > > You are already doing this today, you are keeping the sg list around
> > > > until you unmap it.
> > > >
> > > > Instead of keeping the sg list you'd keep a much smaller datastructure
> > > > per-granual. The sticky bit is simply a convient way for ODP to manage
> > > > the smaller data structure, you don't have to use it.
> > > >
> > > > But you do need to keep track of what pages in the granual have been
> > > > DMA mapped - sg list was doing this before. This could be a simple
> > > > bitmap array matching the granual size.
> > > >
> > > > Looking (far) forward we may be able to have a "replace" API that
> > > > allows installing a new page unconditionally regardless of what is
> > > > already there.
> > > >
> > > > Jason