Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Daniel Vetter <daniel@xxxxxxxx> · Thu, 2 May 2024 10:04:17 +0200

On Tue, Apr 30, 2024 at 09:09:15PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 30, 2024 at 08:57:48PM +0200, Daniel Vetter wrote:
> > On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström wrote:
> > > 
> > > > > Yes there is another common scheme where you bind a window of CPU to
> > > > > a
> > > > > window on the device and mirror a fixed range, but this is a quite
> > > > > different thing. It is not SVA, it has a fixed range, and it is
> > > > > probably bound to a single GPU VMA in a multi-VMA device page table.
> > > > 
> > > > And this above here is exactly what we're implementing, and the GPU
> > > > page-tables are populated using device faults. Regions (large) of the
> > > > mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
> > > > buffer objects.
> > > 
> > > Well, not really, if that was the case you'd have a single VMA over
> > > the entire bound range, not dynamically create them.
> > > 
> > > A single VMA that uses hmm_range_fault() to populate the VM is
> > > completely logical.
> > > 
> > > Having a hidden range of mm binding and then creating/destroying 2M
> > > VMAs dynamicaly is the thing that doesn't make alot of sense.
> > 
> > I only noticed this thread now but fyi I did dig around in the
> > implementation and it's summarily an absolute no-go imo for multiple
> > reasons. It starts with this approach of trying to mirror cpu vma (which I
> > think originated from amdkfd) leading to all kinds of locking fun, and
> > then it gets substantially worse when you dig into the details.
> 
> :(
> 
> Why does the DRM side struggle so much with hmm_range fault? I would
> have thought it should have a fairly straightforward and logical
> connection to the GPU page table.

Short summary is that traditionally gpu memory was managed with buffer
objects, and each individual buffer object owns the page tables for it's
va range.

For hmm you don't have that buffer object, and you want the pagetables to
be fairly independent (maybe even with their own locking like linux cpu
pagetables do) from any mapping/backing storage. Getting to that world is
a lot of reshuffling, and so thus far all the code went with the quick
hack route of creating ad-hoc ranges that look like buffer objects to the
rest of the driver code. This includes the merged amdkfd hmm code, and if
you dig around in that it results in some really annoying locking
inversions because that middle layer of fake buffer object lookalikes only
gets in the way and results in a fairly fundamental impendendence mismatch
with core linux mm locking.

> FWIW, it does make sense to have both a window and a full MM option
> for hmm_range_fault. ODP does both and it is fine..
> 
> > I think until something more solid shows up you can just ignore this. I do
> > fully agree that for sva the main mirroring primitive needs to be page
> > centric, so dma_map_sg. 
>               ^^^^^^^^^^
> 
> dma_map_page

Oops yes.

> > There's a bit a question around how to make the
> > necessary batching efficient and the locking/mmu_interval_notifier scale
> > enough, but I had some long chats with Thomas and I think there's enough
> > option to spawn pretty much any possible upstream consensus. So I'm not
> > worried.
> 
> Sure, the new DMA API will bring some more considerations to this as
> well. ODP uses a 512M granual scheme and it seems to be OK. By far the
> worst part of all this is the faulting performance. I've yet hear any
> complains about mmu notifier performance..

Yeah I don't expect there to be any need for performance improvements on
the mmu notifier side of things. All the concerns I've heard felt rather
theoretical, or where just fallout of that fake buffer object layer in the
middle.

At worst I guess the gpu pagetables need per-pgtable locking like the cpu
pagetables have, and then maybe keep track of mmu notifier sequence
numbers on a per-pgtable basis, so that invalidates and faults on
different va ranges have no impact on each another. But even that is most
likely way, way down the road.

> > But first this needs to be page-centric in the fundamental mirroring
> > approach.
> 
> Yes

Ok clarifying consensus on this was the main reason I replied, it felt a
bit like the thread was derailing in details that don't yet matter.

Thanks, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch