Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 5 Apr 2024 09:37:25 -0300

On Fri, Apr 05, 2024 at 03:33:10AM +0000, Zeng, Oak wrote:
> > 
> > I didn't look at this series a lot but I wanted to make a few
> > remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> > with hmm_range_fault is pretty bad, but it should not be hacked
> > around like this. Leon is working on a series to improve it:
> > 
> > https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@xxxxxxxxxx/
> 
> 
> I completely agree above codes are really ugly. We definitely want
> to adapt to a better way. I will spend some time on above series.
> 
> > 
> > Please participate there too. In the mean time you should just call
> > dma_map_page for every single page like ODP does.
> 
> Above codes deal with a case where dma map is not needed. As I
> understand it, whether we need a dma map depends on the devices
> topology. For example, when device access host memory or another
> device's memory through pcie, we need dma mapping; if the connection
> b/t devices is xelink (similar to nvidia's nvlink), all device's
> memory can be in same address space, so no dma mapping is needed.

Then you call dma_map_page to do your DMA side and you avoid it for
the DEVICE_PRIVATE side. SG list doesn't help this anyhow.

> > Also, I tried to follow the large discussion in the end but it was
> > quite hard to read the text in Lore for some reason.
> 
> Did you mean this discussion: https://lore.kernel.org/dri-devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good for me with chrome browser.

That is the one I am referring to

> > I would just opine some general points on how I see hmm_range_fault
> > being used by drivers.
> > 
> > First of all, the device should have a private page table. At least
> > one, but ideally many. Obviously it should work, so I found it a bit
> > puzzling the talk about problems with virtualization. Either the
> > private page table works virtualized, including faults, or it should
> > not be available..
>
> To be very honest, I was also very confused. In this series, I had
> one very fundamental assumption that with hmm any valid cpu virtual
> address is also a valid gpu virtual address. But Christian had a
> very strong opinion that the gpu va can have an offset to cpu va. He
> mentioned a failed use case with amdkfd and claimed an offset can
> solve their problem.

Offset is something different, I said the VM's view of the page table
should fully work. You shouldn't get into a weird situation where the
VM is populating the page table and can't handle faults or something.

If the VMM has a weird design where there is only one page table and
it needs to partition space based on slicing it into regions then
fine, but the delegated region to the guest OS should still work
fully.

> > Or it is a selective mirror where it copies part of the mm page table
> > into a "vma" of a possibly shared device page table. The
> > hmm_range_fault bit would exclusively own it's bit of VMA.
> 
> Can you explain what is "hmm_range_fault bit"?

I mean if you setup a mirror VMA in a device private page table then that
range of VA will be owned by the hmm_range_fault code and will mirror
a subset of a mm into that VMA. This is the offset you mention
above. The MM's VA and the device private page table VA do not have to
be the same (eg we implement this option in RDMA's ODP)

A 1:1 SVA mapping is a special case of this where there is a single
GPU VMA that spans the entire process address space with a 1:1 VA (no
offset).

> Do you think register a huge mmu notifier to cover the whole address
> space would be good? I don't know here but there would be much more
> unnecessary callbacks from mmu to device driver.

Yes. IMHO you should try to optimize the invalidations away in driver
logic not through dynamic mmu notifiers. Installing and removing a
notifier is very expensive.

> Similarly, we create range only the fly for those range that is
> accessed by gpu. But we have some idea to keep one gigantic
> range/VMA representing the whole address space while creating only
> some "gpu page table state range" on the fly. This idea requires
> some refactor to our xe driver and we will evaluate it more to
> decide whether we want to go this way.

This is a better direction.

> > I also agree with the general spirit of the remarks that there should
> > not be a process binding or any kind of "global" character
> > device. 
> 
> Even though a global pseudo device looks bad, it does come with some
> benefit. For example, if you want to set a memory attributes to a
> shared virtual address range b/t all devices, you can set such
> attributes through a ioctl of the global device. We have agreed to
> remove our global character device and we will repeat the memory
> attributes setting on all devices one by one.

That implies you have a global shared device private page table which
is sort of impossible because of how the DMA API works.

Having the kernel iterate over all the private page tables vs having
the userspace iterate over all the private page tables doesn't seem
like a worthwile difference to justify a global cdev.

> Is /dev/nvidia-uvm a global character device for uvm purpose?

No idea, I wouldn't assume anything the nvidia drivers do is aligned
with what upstream expects.

Jason