RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

"Zeng, Oak" <oak.zeng@xxxxxxxxx> · Fri, 5 Apr 2024 03:33:10 +0000

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, April 4, 2024 8:39 PM
> To: Zeng, Oak <oak.zeng@xxxxxxxxx>
> Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx; Brost,
> Matthew <matthew.brost@xxxxxxxxx>; Thomas.Hellstrom@xxxxxxxxxxxxxxx;
> Welty, Brian <brian.welty@xxxxxxxxx>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@xxxxxxxxx>; Bommu, Krishnaiah
> <krishnaiah.bommu@xxxxxxxxx>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@xxxxxxxxx>; Leon Romanovsky <leon@xxxxxxxxxx>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Wed, Jan 17, 2024 at 05:12:06PM -0500, Oak Zeng wrote:
> > +/**
> > + * xe_svm_build_sg() - build a scatter gather table for all the physical
> pages/pfn
> > + * in a hmm_range.
> > + *
> > + * @range: the hmm range that we build the sg table from. range-
> >hmm_pfns[]
> > + * has the pfn numbers of pages that back up this hmm address range.
> > + * @st: pointer to the sg table.
> > + *
> > + * All the contiguous pfns will be collapsed into one entry in
> > + * the scatter gather table. This is for the convenience of
> > + * later on operations to bind address range to GPU page table.
> > + *
> > + * This function allocates the storage of the sg table. It is
> > + * caller's responsibility to free it calling sg_free_table.
> > + *
> > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > + */
> > +int xe_svm_build_sg(struct hmm_range *range,
> > +			     struct sg_table *st)
> > +{
> > +	struct scatterlist *sg;
> > +	u64 i, npages;
> > +
> > +	sg = NULL;
> > +	st->nents = 0;
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> > +
> > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		unsigned long addr = range->hmm_pfns[i];
> > +
> > +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> > +			sg->length += PAGE_SIZE;
> > +			sg_dma_len(sg) += PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		sg =  sg ? sg_next(sg) : st->sgl;
> > +		sg_dma_address(sg) = addr;
> > +		sg_dma_len(sg) = PAGE_SIZE;
> > +		sg->length = PAGE_SIZE;
> > +		st->nents++;
> > +	}
> > +
> > +	sg_mark_end(sg);
> > +	return 0;
> > +}
> 
> I didn't look at this series a lot but I wanted to make a few
> remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> with hmm_range_fault is pretty bad, but it should not be hacked
> around like this. Leon is working on a series to improve it:
> 
> https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@xxxxxxxxxx/

I completely agree above codes are really ugly. We definitely want to adapt to a better way. I will spend some time on above series.

> 
> Please participate there too. In the mean time you should just call
> dma_map_page for every single page like ODP does.

Above codes deal with a case where dma map is not needed. As I understand it, whether we need a dma map depends on the devices topology. For example, when device access host memory or another device's memory through pcie, we need dma mapping; if the connection b/t devices is xelink (similar to nvidia's nvlink), all device's memory can be in same address space, so no dma mapping is needed.

> 
> Also, I tried to follow the large discussion in the end but it was
> quite hard to read the text in Lore for some reason.

Did you mean this discussion: https://lore.kernel.org/dri-devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good for me with chrome browser.

> 
> I would just opine some general points on how I see hmm_range_fault
> being used by drivers.
> 
> First of all, the device should have a private page table. At least
> one, but ideally many. Obviously it should work, so I found it a bit
> puzzling the talk about problems with virtualization. Either the
> private page table works virtualized, including faults, or it should
> not be available..

To be very honest, I was also very confused. In this series, I had one very fundamental assumption that with hmm any valid cpu virtual address is also a valid gpu virtual address. But Christian had a very strong opinion that the gpu va can have an offset to cpu va. He mentioned a failed use case with amdkfd and claimed an offset can solve their problem.

For all our known use cases, gpu va == cpu va. But we had agreed to make the uAPI to be flexible so we can introduce a offset if a use case come out in the future.

> 
> Second, I see hmm_range_fault as having two main design patterns
> interactions. Either it is the entire exclusive owner of a single
> device private page table and fully mirrors the mm page table into the
> device table.
> 
> Or it is a selective mirror where it copies part of the mm page table
> into a "vma" of a possibly shared device page table. The
> hmm_range_fault bit would exclusively own it's bit of VMA.

Can you explain what is "hmm_range_fault bit"?

The whole page table (process space) is mirrored. But we don't have to copy the whole CPU page table to device page table. We only need to copy the page table entries of an address range which is accessed by GPU. For those address ranges which are not accessed by GPU, there is no need to set up GPU page table.

> 
> So I find it a quite strange that this RFC is creating VMA's,
> notifiers and ranges on the fly. That should happen when userspace
> indicates it wants some/all of the MM to be bound to a specific device
> private pagetable/address space.

We register notifier on the fly because GPU doesn't access all the valid CPU virtual addresses. GPU only access a subset of valid CPU addresses.

Do you think register a huge mmu notifier to cover the whole address space would be good? I don't know here but there would be much more unnecessary callbacks from mmu to device driver.

Similarly, we create range only the fly for those range that is accessed by gpu. But we have some idea to keep one gigantic range/VMA representing the whole address space while creating only some "gpu page table state range" on the fly. This idea requires some refactor to our xe driver and we will evaluate it more to decide whether we want to go this way.

> 
> I also agree with the general spirit of the remarks that there should
> not be a process binding or any kind of "global" character
> device. 

Even though a global pseudo device looks bad, it does come with some benefit. For example, if you want to set a memory attributes to a shared virtual address range b/t all devices, you can set such attributes through a ioctl of the global device. We have agreed to remove our global character device and we will repeat the memory attributes setting on all devices one by one. 

Is /dev/nvidia-uvm a global character device for uvm purpose?

Device private page tables are by their very nature private to
> the device and should be created through a device specific char
> dev. If you have a VMA concept for these page tables then a HMM
> mirroring one is simply a different type of VMA along with all the
> others.
> 
> I was also looking at the mmu notifier register/unregister with some
> suspicion, it seems wrong to have a patch talking about "process
> exit". Notifiers should be destroyed when the device private page
> table is destroyed, or the VMA is destroyed. 

Right. I have dropped the concept of process exit. I will soon send out the new series for review.

Oak

Attempting to connect it
> to a process beyond tying the lifetime of a page table to a FD is
> nonsensical.
> 
> Jason