Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Matthew Brost <matthew.brost@xxxxxxxxx> · Wed, 24 Apr 2024 16:35:17 +0000

On Wed, Apr 24, 2024 at 10:57:54AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 24, 2024 at 02:31:36AM +0000, Matthew Brost wrote:
> 
> > AMD seems to register notifiers on demand for parts of the address space
> > [1], I think Nvidia's open source driver does this too (can look this up
> > if needed). We (Intel) also do this in Xe and the i915 for userptrs
> > (explictly binding a user address via IOCTL) too and it seems to work
> > quite well.
> 
> I always thought AMD's implementation of this stuff was bad..
> 

No comment on the quality of AMD's implementaion.

But in general the view among my team members that registering notifiers
on demand for sub ranges is an accepted practice. If we find the perf
for this is terrible, it would be fairly easy to switch to 1 large
notifier mirroring the CPU entire address space. This is a small design
detail IMO.

> > > > > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > > > > MM VA, it should not be created on demand.
> > > > >
> > > > > Today we have two places where we create gpu vma: 1) create gpu vma
> > > > > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > > > > system allocator range (this will be in v2 of this series).
> > > > 
> > > > Don't do 2.
> > 
> > You have to create something, actually 2 things, on a GPU page fault.
> > Something to track the page table state and something to track VRAM
> > memory allocation. Both AMD and Nvidia's open source driver do this.
> 
> VRAM memory allocation should be tracked by the mm side, under the
> covers of hmm_range_fault (or migration prior to invoke
> hmm_range_fault).
> 

Yes, the above step I describe is optionally done *before* calling hmm
range fault.

> VRAM memory allocation or management has nothing to do with SVA.
> 
> From there the only need is to copy hmm_range_fault results into GPU
> PTEs. You definately do not *need* some other data structure.
> 

You do not *need* some other data structure as you could always just
walk the page tables but in practice a data structure exists in a tree
of shorts with the key being a VA range. The data structure has meta
data about the mapping, all GPU drivers seem to have this. This data
structure, along with pages returned from hmm_range_fault, are used to
program the GPU PTEs.

Again the allocation of this data structure happens *before* calling
hmm_range_fault on first GPU fault within unmapped range.

> > > > > The reason is, we still need some gpu corresponding structure to
> > > > > match the cpu vm_area_struct.
> > > > 
> > > > Definately not.
> > > 
> > > See explanation above.
> > 
> > Agree GPU doesn't need to match vm_area_struct but the allocation must
> > be subset (or equal) to a vm_area_struct. Again other driver do this
> > too.
> 
> No, absolutely not. There can be no linking of CPU vma_area_struct to
> how a driver operates hmm_range_fault().
>

Agree. Once we are calling hmm_range_fault vma_area_struct is out of the
picture.

We also will never store a vma_area_struct in our driver, it is only
looked up on demand when required for migration.

> You probably need to do something like this for your migration logic,
> but that is seperate.
> 

Yes.

> > > > You call hmm_range_fault() and it does everything for you. A driver
> > > > should never touch CPU VMAs and must not be aware of them in any away.
> > 
> > struct vm_area_struct is an argument to the migrate_vma* functions [4], so
> > yes drivers need to be aware of CPU VMAs.
> 
> That is something else. If you want to mess with migration during your
> GPU fault path then fine that is some "migration module", but it
> should have NOTHING to do with how hmm_range_fault() is invoked or how
> the *SVA* flow operates.
>

Agree. hmm_range_fault is invoked in a opaque way whether backing store
is SRAM or VRAM and flows in a uniform way. The only difference is how we
resolve the hmm pfn to a value in the GPU page tables (device private
pages a device physical addresses, while CPU pages are dma mapped).

> You are mixing things up here, this thread is talking about
> hmm_range_fault() and SVA.
>

I figure there was a level of confusion in this thread. I think we are
now aligned?

Thanks for your time.
Matt

> migration is something that happens before doing the SVA mirroring
> flows.
> 
> Jason