RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

"Zeng, Oak" <oak.zeng@xxxxxxxxx> · Fri, 5 Apr 2024 16:42:14 +0000

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Jason
> Gunthorpe
> Sent: Friday, April 5, 2024 8:37 AM
> To: Zeng, Oak <oak.zeng@xxxxxxxxx>
> Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx; Brost,
> Matthew <matthew.brost@xxxxxxxxx>; Thomas.Hellstrom@xxxxxxxxxxxxxxx;
> Welty, Brian <brian.welty@xxxxxxxxx>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@xxxxxxxxx>; Bommu, Krishnaiah
> <krishnaiah.bommu@xxxxxxxxx>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@xxxxxxxxx>; Leon Romanovsky <leon@xxxxxxxxxx>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Fri, Apr 05, 2024 at 03:33:10AM +0000, Zeng, Oak wrote:
> > >
> > > I didn't look at this series a lot but I wanted to make a few
> > > remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> > > with hmm_range_fault is pretty bad, but it should not be hacked
> > > around like this. Leon is working on a series to improve it:
> > >
> > > https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@xxxxxxxxxx/
> >
> >
> > I completely agree above codes are really ugly. We definitely want
> > to adapt to a better way. I will spend some time on above series.
> >
> > >
> > > Please participate there too. In the mean time you should just call
> > > dma_map_page for every single page like ODP does.
> >
> > Above codes deal with a case where dma map is not needed. As I
> > understand it, whether we need a dma map depends on the devices
> > topology. For example, when device access host memory or another
> > device's memory through pcie, we need dma mapping; if the connection
> > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > memory can be in same address space, so no dma mapping is needed.
> 
> Then you call dma_map_page to do your DMA side and you avoid it for
> the DEVICE_PRIVATE side. SG list doesn't help this anyhow.

When dma map is needed, we used dma_map_sgtable, a different flavor of the dma-map-page function.

The reason we also used (mis-used) sg list for non-dma-map cases, is because we want to re-use some data structure. Our goal here is, for a hmm_range, build a list of device physical address (can be dma-mapped or not), which will be used later on to program the device page table. We re-used the sg list structure for the non-dma-map cases so those two cases can share the same page table programming codes. Since sg list was originally designed for dma-map, it does look like this is mis-used here.

Need to mention, even for some DEVICE_PRIVATE memory, we also need dma-mapping. For example, if you have two devices connected to each other through PCIe, both devices memory are registered as DEVICE_PRIVATE to hmm. I believe we need a dma-map when one device access another device's memory. Two devices' memory doesn't belong to same address space in this case. For modern GPU with xeLink/nvLink/XGMI, this is not needed.

> 
> > > Also, I tried to follow the large discussion in the end but it was
> > > quite hard to read the text in Lore for some reason.
> >
> > Did you mean this discussion: https://lore.kernel.org/dri-
> devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good
> for me with chrome browser.
> 
> That is the one I am referring to
> 
> > > I would just opine some general points on how I see hmm_range_fault
> > > being used by drivers.
> > >
> > > First of all, the device should have a private page table. At least
> > > one, but ideally many. Obviously it should work, so I found it a bit
> > > puzzling the talk about problems with virtualization. Either the
> > > private page table works virtualized, including faults, or it should
> > > not be available..
> >
> > To be very honest, I was also very confused. In this series, I had
> > one very fundamental assumption that with hmm any valid cpu virtual
> > address is also a valid gpu virtual address. But Christian had a
> > very strong opinion that the gpu va can have an offset to cpu va. He
> > mentioned a failed use case with amdkfd and claimed an offset can
> > solve their problem.
> 
> Offset is something different, I said the VM's view of the page table
> should fully work. You shouldn't get into a weird situation where the
> VM is populating the page table and can't handle faults or something.
> 

We don't have such weird situation. There are two layers of translations when run under virtualized environment. From guest VM's perspective, the first level page table is in the guest device physical address space. It is nothing different from bare-metal situation. Our driver doesn't need to know it runs under virtualized or bare-metal for first level page table programming and page fault handling. 

> If the VMM has a weird design where there is only one page table and
> it needs to partition space based on slicing it into regions then
> fine, but the delegated region to the guest OS should still work
> fully.

Agree.

> 
> > > Or it is a selective mirror where it copies part of the mm page table
> > > into a "vma" of a possibly shared device page table. The
> > > hmm_range_fault bit would exclusively own it's bit of VMA.
> >
> > Can you explain what is "hmm_range_fault bit"?
> 
> I mean if you setup a mirror VMA in a device private page table then that
> range of VA will be owned by the hmm_range_fault code and will mirror
> a subset of a mm into that VMA. This is the offset you mention
> above. The MM's VA and the device private page table VA do not have to
> be the same (eg we implement this option in RDMA's ODP)

Ok, it is great to hear a use case where cpu va != gpu va

> 
> A 1:1 SVA mapping is a special case of this where there is a single
> GPU VMA that spans the entire process address space with a 1:1 VA (no
> offset).

>From implementation perspective, we can have one device page table for one process for such 1:1 va mapping, but it is not necessary to have a single gpu vma. We can have many gpu vma each cover a segment of this address space. We have this structure before this svm/system allocator work. This is also true for core mm vma. But as said, we are also open/exploring a single gigantic gpu vma to cover the whole address space.

> 
> > Do you think register a huge mmu notifier to cover the whole address
> > space would be good? I don't know here but there would be much more
> > unnecessary callbacks from mmu to device driver.
> 
> Yes. IMHO you should try to optimize the invalidations away in driver
> logic not through dynamic mmu notifiers. Installing and removing a
> notifier is very expensive.

Ok, we will compare the performance of those two methods.

> 
> > Similarly, we create range only the fly for those range that is
> > accessed by gpu. But we have some idea to keep one gigantic
> > range/VMA representing the whole address space while creating only
> > some "gpu page table state range" on the fly. This idea requires
> > some refactor to our xe driver and we will evaluate it more to
> > decide whether we want to go this way.
> 
> This is a better direction.
> 
> > > I also agree with the general spirit of the remarks that there should
> > > not be a process binding or any kind of "global" character
> > > device.
> >
> > Even though a global pseudo device looks bad, it does come with some
> > benefit. For example, if you want to set a memory attributes to a
> > shared virtual address range b/t all devices, you can set such
> > attributes through a ioctl of the global device. We have agreed to
> > remove our global character device and we will repeat the memory
> > attributes setting on all devices one by one.
> 
> That implies you have a global shared device private page table which
> is sort of impossible because of how the DMA API works.

In the case of multiple devices on one machine, we don't have a shared global device page table. Each device can still have its own page table, even though page tables of all device mirror the same address space (ie., same virtual address pointing to same physical location in cpu page table and all device page tables).

The reason we had the global device is mainly for setting vma memory attributes purpose, for example, for one vma, user can set the preferred backing store placement to be on specific device. Without  a global pseudo device, we would have to perform such settings on each devices through each device's ioctl, and we would not be able to tell device 0 that the preferred placement is on device 1, because device 0 even doesn't know the existing of device 1...

Anyway let's continue multiple devices discussion under thread "Cross-device and cross-driver HMM support"

Oak

> 
> Having the kernel iterate over all the private page tables vs having
> the userspace iterate over all the private page tables doesn't seem
> like a worthwile difference to justify a global cdev.
> 
> > Is /dev/nvidia-uvm a global character device for uvm purpose?
> 
> No idea, I wouldn't assume anything the nvidia drivers do is aligned
> with what upstream expects.
> 
> Jason