Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 4 Apr 2024 21:39:27 -0300

On Wed, Jan 17, 2024 at 05:12:06PM -0500, Oak Zeng wrote:
> +/**
> + * xe_svm_build_sg() - build a scatter gather table for all the physical pages/pfn
> + * in a hmm_range.
> + *
> + * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
> + * has the pfn numbers of pages that back up this hmm address range.
> + * @st: pointer to the sg table.
> + *
> + * All the contiguous pfns will be collapsed into one entry in
> + * the scatter gather table. This is for the convenience of
> + * later on operations to bind address range to GPU page table.
> + *
> + * This function allocates the storage of the sg table. It is
> + * caller's responsibility to free it calling sg_free_table.
> + *
> + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> + */
> +int xe_svm_build_sg(struct hmm_range *range,
> +			     struct sg_table *st)
> +{
> +	struct scatterlist *sg;
> +	u64 i, npages;
> +
> +	sg = NULL;
> +	st->nents = 0;
> +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1;
> +
> +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> +		return -ENOMEM;
> +
> +	for (i = 0; i < npages; i++) {
> +		unsigned long addr = range->hmm_pfns[i];
> +
> +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> +			sg->length += PAGE_SIZE;
> +			sg_dma_len(sg) += PAGE_SIZE;
> +			continue;
> +		}
> +
> +		sg =  sg ? sg_next(sg) : st->sgl;
> +		sg_dma_address(sg) = addr;
> +		sg_dma_len(sg) = PAGE_SIZE;
> +		sg->length = PAGE_SIZE;
> +		st->nents++;
> +	}
> +
> +	sg_mark_end(sg);
> +	return 0;
> +}

I didn't look at this series a lot but I wanted to make a few
remarks.. This I don't like quite a lot. Yes, the DMA API interaction
with hmm_range_fault is pretty bad, but it should not be hacked
around like this. Leon is working on a series to improve it:

https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@xxxxxxxxxx/

Please participate there too. In the mean time you should just call
dma_map_page for every single page like ODP does.

Also, I tried to follow the large discussion in the end but it was
quite hard to read the text in Lore for some reason.

I would just opine some general points on how I see hmm_range_fault
being used by drivers.

First of all, the device should have a private page table. At least
one, but ideally many. Obviously it should work, so I found it a bit
puzzling the talk about problems with virtualization. Either the
private page table works virtualized, including faults, or it should
not be available..

Second, I see hmm_range_fault as having two main design patterns
interactions. Either it is the entire exclusive owner of a single
device private page table and fully mirrors the mm page table into the
device table.

Or it is a selective mirror where it copies part of the mm page table
into a "vma" of a possibly shared device page table. The
hmm_range_fault bit would exclusively own it's bit of VMA.

So I find it a quite strange that this RFC is creating VMA's,
notifiers and ranges on the fly. That should happen when userspace
indicates it wants some/all of the MM to be bound to a specific device
private pagetable/address space.

I also agree with the general spirit of the remarks that there should
not be a process binding or any kind of "global" character
device. Device private page tables are by their very nature private to
the device and should be created through a device specific char
dev. If you have a VMA concept for these page tables then a HMM
mirroring one is simply a different type of VMA along with all the
others.

I was also looking at the mmu notifier register/unregister with some
suspicion, it seems wrong to have a patch talking about "process
exit". Notifiers should be destroyed when the device private page
table is destroyed, or the VMA is destroyed. Attempting to connect it
to a process beyond tying the lifetime of a page table to a FD is
nonsensical.

Jason