On Wed, Jan 17, 2024 at 05:12:06PM -0500, Oak Zeng wrote: > +/** > + * xe_svm_build_sg() - build a scatter gather table for all the physical pages/pfn > + * in a hmm_range. > + * > + * @range: the hmm range that we build the sg table from. range->hmm_pfns[] > + * has the pfn numbers of pages that back up this hmm address range. > + * @st: pointer to the sg table. > + * > + * All the contiguous pfns will be collapsed into one entry in > + * the scatter gather table. This is for the convenience of > + * later on operations to bind address range to GPU page table. > + * > + * This function allocates the storage of the sg table. It is > + * caller's responsibility to free it calling sg_free_table. > + * > + * Returns 0 if successful; -ENOMEM if fails to allocate memory > + */ > +int xe_svm_build_sg(struct hmm_range *range, > + struct sg_table *st) > +{ > + struct scatterlist *sg; > + u64 i, npages; > + > + sg = NULL; > + st->nents = 0; > + npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1; > + > + if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL))) > + return -ENOMEM; > + > + for (i = 0; i < npages; i++) { > + unsigned long addr = range->hmm_pfns[i]; > + > + if (sg && (addr == (sg_dma_address(sg) + sg->length))) { > + sg->length += PAGE_SIZE; > + sg_dma_len(sg) += PAGE_SIZE; > + continue; > + } > + > + sg = sg ? sg_next(sg) : st->sgl; > + sg_dma_address(sg) = addr; > + sg_dma_len(sg) = PAGE_SIZE; > + sg->length = PAGE_SIZE; > + st->nents++; > + } > + > + sg_mark_end(sg); > + return 0; > +} I didn't look at this series a lot but I wanted to make a few remarks.. This I don't like quite a lot. Yes, the DMA API interaction with hmm_range_fault is pretty bad, but it should not be hacked around like this. Leon is working on a series to improve it: https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@xxxxxxxxxx/ Please participate there too. In the mean time you should just call dma_map_page for every single page like ODP does. Also, I tried to follow the large discussion in the end but it was quite hard to read the text in Lore for some reason. I would just opine some general points on how I see hmm_range_fault being used by drivers. First of all, the device should have a private page table. At least one, but ideally many. Obviously it should work, so I found it a bit puzzling the talk about problems with virtualization. Either the private page table works virtualized, including faults, or it should not be available.. Second, I see hmm_range_fault as having two main design patterns interactions. Either it is the entire exclusive owner of a single device private page table and fully mirrors the mm page table into the device table. Or it is a selective mirror where it copies part of the mm page table into a "vma" of a possibly shared device page table. The hmm_range_fault bit would exclusively own it's bit of VMA. So I find it a quite strange that this RFC is creating VMA's, notifiers and ranges on the fly. That should happen when userspace indicates it wants some/all of the MM to be bound to a specific device private pagetable/address space. I also agree with the general spirit of the remarks that there should not be a process binding or any kind of "global" character device. Device private page tables are by their very nature private to the device and should be created through a device specific char dev. If you have a VMA concept for these page tables then a HMM mirroring one is simply a different type of VMA along with all the others. I was also looking at the mmu notifier register/unregister with some suspicion, it seems wrong to have a patch talking about "process exit". Notifiers should be destroyed when the device private page table is destroyed, or the VMA is destroyed. Attempting to connect it to a process beyond tying the lifetime of a page table to a FD is nonsensical. Jason