Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Matthew Brost <matthew.brost@xxxxxxxxx> · Tue, 9 Apr 2024 17:33:15 +0000

On Tue, Apr 09, 2024 at 10:45:22AM -0600, Zeng, Oak wrote:

Oak - A few drive by comments...

> Hi Jason
> 
> We are re-spinning this series based on the previous community feedback. I will send out a v2 soon. There were big changes compared to v1. So we would better to discuss this work with v2. 
> 
> See some reply inline.
> 
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > Sent: Friday, April 5, 2024 2:02 PM
> > To: Zeng, Oak <oak.zeng@xxxxxxxxx>
> > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx; Brost,
> > Matthew <matthew.brost@xxxxxxxxx>; Thomas.Hellstrom@xxxxxxxxxxxxxxx;
> > Welty, Brian <brian.welty@xxxxxxxxx>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@xxxxxxxxx>; Bommu, Krishnaiah
> > <krishnaiah.bommu@xxxxxxxxx>; Vishwanathapura, Niranjana
> > <niranjana.vishwanathapura@xxxxxxxxx>; Leon Romanovsky <leon@xxxxxxxxxx>
> > Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> > hmm range
> > 
> > On Fri, Apr 05, 2024 at 04:42:14PM +0000, Zeng, Oak wrote:
> > > > > Above codes deal with a case where dma map is not needed. As I
> > > > > understand it, whether we need a dma map depends on the devices
> > > > > topology. For example, when device access host memory or another
> > > > > device's memory through pcie, we need dma mapping; if the connection
> > > > > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > > > > memory can be in same address space, so no dma mapping is needed.
> > > >
> > > > Then you call dma_map_page to do your DMA side and you avoid it for
> > > > the DEVICE_PRIVATE side. SG list doesn't help this anyhow.
> > >
> > > When dma map is needed, we used dma_map_sgtable, a different flavor
> > > of the dma-map-page function.
> > 
> > I saw, I am saying this should not be done. You cannot unmap bits of
> > a sgl mapping if an invalidation comes in.
> 
> You are right, if we register a huge mmu interval notifier to cover the whole address space, then we should use dma map/unmap pages to map bits of the address space. We will explore this approach.
> 
> Right now, in xe driver, mmu interval notifier is dynamically registered with small address range. We map/unmap the whole small address ranges each time. So functionally it still works. But it might not be as performant as the method you said. This is existing logic for our userptr codes. Our system allocator inherit this logic automatically as our system allocator design is built on top of userptr (will send out v2 soon ).We plan to make things work in the first stage then do some performance improvement like you suggested here.
>

Agree reworking the notifier design for initial phase is probably out of
scope as that would be fairly large rework. IMO if / when we switch from
a 1 to 1 relationship between VMA and PT state to a 1 to N, this is
likely when it would make sense to redesign our notifier too.

A 1 to N is basically a prerequisite for 1 notifier to work or at least
a new locking structure (maybe ref counting too?) to be able safely go
from large noifier -> invalidation chunk.

> > 
> > > The reason we also used (mis-used) sg list for non-dma-map cases, is
> > > because we want to re-use some data structure. Our goal here is, for
> > > a hmm_range, build a list of device physical address (can be
> > > dma-mapped or not), which will be used later on to program the
> > > device page table. We re-used the sg list structure for the
> > > non-dma-map cases so those two cases can share the same page table
> > > programming codes. Since sg list was originally designed for
> > > dma-map, it does look like this is mis-used here.
> > 
> > Please don't use sg list at all for this.
> 
> As explained, we use sg list for device private pages so we can re-used the gpu page table update codes. The input of the gpu page table update codes in this case is a list of dma address (in case of system memory) or device physical address (in case of device private pages). The gpu page table update codes in xe driver is pretty complicated, so re-use that codes is a preferable thing for us. If we introduce different data structure we would have to re-write part of the gpu page table update codes.
> 

Use the buddy blocks for the xe_res_cursor... We basically already have
this in place, see xe_res_first which takes a struct ttm_resource *res
as an argument and underneath uses buddy blocks for the cursor. 

> I don't see an obvious problem of this approach. But if you see this a problem, I am open to change it.
>

This should be trivial to change assuming buddy blocks are stored
somewhere (they must be, right?), so I'd do this right away.`

Only got to here, maybe reply a bit more later to below comments...

Matt

> > 
> > > Need to mention, even for some DEVICE_PRIVATE memory, we also need
> > > dma-mapping. For example, if you have two devices connected to each
> > > other through PCIe, both devices memory are registered as
> > > DEVICE_PRIVATE to hmm.
> > 
> > Yes, but you don't ever dma map DEVICE_PRIVATE.
> > 
> > > I believe we need a dma-map when one device access another device's
> > > memory. Two devices' memory doesn't belong to same address space in
> > > this case. For modern GPU with xeLink/nvLink/XGMI, this is not
> > > needed.
> > 
> > Review my emails here:
> > 
> > https://lore.kernel.org/dri-devel/20240403125712.GA1744080@xxxxxxxxxx/
> > 
> > Which explain how it should work.
> 
> You are right. Dma map is not needed for device private
> 
> > 
> > > > A 1:1 SVA mapping is a special case of this where there is a single
> > > > GPU VMA that spans the entire process address space with a 1:1 VA (no
> > > > offset).
> > >
> > > From implementation perspective, we can have one device page table
> > > for one process for such 1:1 va mapping, but it is not necessary to
> > > have a single gpu vma. We can have many gpu vma each cover a segment
> > > of this address space.
> > 
> > This is not what I'm talking about. The GPU VMA is bound to a specific
> > MM VA, it should not be created on demand.
> 
> Today we have two places where we create gpu vma: 1) create gpu vma during a vm_bind ioctl 2) create gpu vma during a page fault of the system allocator range (this will be in v2 of this series).
> 
> So for case 2), we do create gpu vma on demand. We also plan to explore doing this differently, such as only keep a gigantic gpu vma covering the whole address space for system allocator while only create some gpu page table state during page fault handling. This is planned for stage 2.
> 
> > 
> > If you want the full 1:1 SVA case to optimize invalidations you don't
> > need something like a VMA, 
> 
> A gpu vma (xe_vma struct in xe driver codes) is a very fundamental concept in our driver. Leveraging vma can re-use a lot of existing driver codes such as page table update.
> 
> But you are right, strictly speaking we don't need a vma. Actually in this v1 version I sent out, we don't have a gpu vma concept for system allocator. But we do have a svm range concept. We created a temporary vma for page table update purpose. Again this design will be obsolete in v2 - in v2 system allocator leverage userptr codes which incorporate with gpu vma. 
> 
> 
> a simple bitmap reducing the address space
> > into 1024 faulted in chunks or something would be much cheaper than
> > some dynamic VMA ranges.
> 
> 
> I suspect something dynamic is still necessary, either a vma or a page table state (1 vma but many page table state created dynamically, as planned in our stage 2). The reason is, we still need some gpu corresponding structure to match the cpu vm_area_struct. For example, when gpu page fault happens, you look up the cpu vm_area_struct for the fault address and create a corresponding state/struct. And people can have as many cpu vm_area_struct as they want.
> 
> Oak  
> 
> > 
> > Jason