Re: [PATCH v6 32/32] drm/doc: gpusvm: Add GPU SVM documentation

Matthew Brost <matthew.brost@xxxxxxxxx> · Thu, 27 Feb 2025 20:36:35 -0800

On Fri, Feb 28, 2025 at 01:34:42PM +1100, Alistair Popple wrote:
> On Mon, Feb 24, 2025 at 08:43:11PM -0800, Matthew Brost wrote:
> > Add documentation for agree upon GPU SVM design principles, current
> > status, and future plans.
> 
> Thanks for writing this up. In general I didn't see anything too controversial
> but added a couple of comments below.
> 
> > 
> > v4:
> >  - Address Thomas's feedback
> > v5:
> >  - s/Current/Basline (Thomas)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > Reviewed-by: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > ---
> >  Documentation/gpu/rfc/gpusvm.rst | 84 ++++++++++++++++++++++++++++++++
> >  Documentation/gpu/rfc/index.rst  |  4 ++
> >  2 files changed, 88 insertions(+)
> >  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> > 
> > diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst
> > new file mode 100644
> > index 000000000000..063412160685
> > --- /dev/null
> > +++ b/Documentation/gpu/rfc/gpusvm.rst
> > @@ -0,0 +1,84 @@
> > +===============
> > +GPU SVM Section
> > +===============
> > +
> > +Agreed upon design principles
> > +=============================
> 
> As a general comment I think it would be nice if we could add some rational/
> reasons for these design principals. Things inevitably change and if/when
> we need to violate or update these principals it would be good to have some
> documented rational for why we decided on them in the first place because the
> reasoning may have become invalid by then.
> 

Let me try to add somethings to the various cases.

> > +* migrate_to_ram path
> > +	* Rely only on core MM concepts (migration PTEs, page references, and
> > +	  page locking).
> > +	* No driver specific locks other than locks for hardware interaction in
> > +	  this path. These are not required and generally a bad idea to
> > +	  invent driver defined locks to seal core MM races.
> 
> In principal I agree. The problem I think you will run into is the analogue of
> what adding a trylock_page() to do_swap_page() fixes. Which is that a concurrent
> GPU fault (which is higly likely after handling a CPU fault due to the GPU PTEs
> becoming invalid) may, depending on your design, kick off a migration of the
> page to the GPU via migrate_vma_setup().
> 
> The problem with that is migrate_vma_setup() will temprarily raise the folio
> refcount, which can cause the migrate_to_ram() callback to fail but the elevated
> refcount from migrate_to_ram() can also cause the GPU migration to fail thus
> leading to a live-lock when both CPU and GPU fault handlers just keep retrying.
> 
> This was particularly problematic for us on multi-GPU setups, and our solution
> was to introduce a migration critical section in the form of a mutex to ensure
> only one thread was calling migrate_vma_setup() at a time.
> 
> And now that I've looked at UVM development history, and remembered more
> context, this is why I had a vague recollection that adding a migration entry
> in do_swap_page() would be better than taking a page lock. Doing so fixes the
> issue with concurrent GPU faults blocking migrate_to_ram() because it makes
> migrate_vma_setup() ignore the page.
> 

Ok, this is something to keep an eye on. In the current Xe code, we try
to migrate a chunk of memory from the CPU to the GPU in our GPU fault
handler once per fault. If it fails due to racing CPU access, we simply
leave it in CPU memory and move on. We don't have any real migration
policies in Xe yet—that is being worked on as a follow-up to my series.
However, if we had a policy requiring a memory region to 'must be in
GPU,' this could conceivably lead to a livelock with concurrent CPU and
GPU access. I'm still not fully convinced that a driver-side lock is the
solution here, but without encountering the issue on our side, I can't
be completely certain what the solution is.

> > +	* Partial migration is supported (i.e., a subset of pages attempting to
> > +	  migrate can actually migrate, with only the faulting page guaranteed
> > +	  to migrate).
> > +	* Driver handles mixed migrations via retry loops rather than locking.
> >
> > +* Eviction
> 
> This is a term that seems be somewhat overloaded depending on context so a
> definition would be nice. Is your view of eviction migrating data from GPU back
> to CPU without a virtual address to free up GPU memory? (that's what I think of,
> but would be good to make sure we're in sync).
> 

Yes. When GPU memory is oversubscribed, we find the physical backing in
an LRU list to evict. In Xe, this is a TTM BO.

> > +	* Only looking at physical memory data structures and locks as opposed to
> > +	  looking at virtual memory data structures and locks.
> > +	* No looking at mm/vma structs or relying on those being locked.
> 
> Agree with the above points.
> 
> > +* GPU fault side
> > +	* mmap_read only used around core MM functions which require this lock
> > +	  and should strive to take mmap_read lock only in GPU SVM layer.
> > +	* Big retry loop to handle all races with the mmu notifier under the gpu
> > +	  pagetable locks/mmu notifier range lock/whatever we end up calling
> > +          those.
> 
> Again, one of the issues here (particularly with multi-GPU setups) is that it's
> very easy to live-lock with rety loops because even attempting a migration that
> fails can cause migration/fault handling in other threads to fail, either by
> calling mmu_notifiers or taking a page reference.
> 
> Those are probably things that we should fix on the MM side, but for now UVM at
> least uses a lock to ensure forward progress.
>

Again, see above. Right now, migration in Xe is more of a best-case
scenario rather than a mandatory process, and perhaps this is masking an
issue.

Maybe I should add a comment here stating your possible concerns and that
Xe will be implementing real migration policies and multi-GPU support
soon. If this issue arises, we can revisit the locking guidelines or
perhaps help contribute to the necessary core changes to make this work
properly.

> > +	* Races (especially against concurrent eviction or migrate_to_ram)
> > +	  should not be handled on the fault side by trying to hold locks;
> > +	  rather, they should be handled using retry loops. One possible
> > +	  exception is holding a BO's dma-resv lock during the initial migration
> > +	  to VRAM, as this is a well-defined lock that can be taken underneath
> > +	  the mmap_read lock.
> 
> See my earlier comments. Although note I agree with this in principal, and we do
> just retry if taking the lock fails.
> 
> > +* Physical memory to virtual backpointer
> > +	* Does not work, no pointers from physical memory to virtual should
> > +	  exist.
> 
> Agree. And my rational is because core-MM can update the virtual address for a
> page without notifying a driver of the new address. For example with mremap().
> So it's impossible to keep any backpointer to a virtual address up to date.
> 

Yep, this is good example and will include this in my next rev.

> > +	* Physical memory backpointer (page->zone_device_data) should be stable
> > +	  from allocation to page free.
> 
> Agree. And presumably the rational is because it is very difficult to safely
> update page->zone_device_data and ensure there aren't concurrent users of it
> unless the page is free (ie. has a 0 refcount)?
> 

Yes, exactly.

> > +* GPU pagetable locking
> > +	* Notifier lock only protects range tree, pages valid state for a range
> > +	  (rather than seqno due to wider notifiers), pagetable entries, and
> > +	  mmu notifier seqno tracking, it is not a global lock to protect
> > +          against races.
> > +	* All races handled with big retry as mentioned above.
> 
> Sounds reasonable.
> 
> > +Overview of current design
> > +==========================
> > +
> > +Baseline design is simple as possible to get a working basline in which can be
> > +built upon.
> > +
> > +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> > +   :doc: Overview
> > +   :doc: Locking
> > +   :doc: Migrataion
> > +   :doc: Partial Unmapping of Ranges
> > +   :doc: Examples
> > +
> > +Possible future design features
> > +===============================
> > +
> > +* Concurrent GPU faults
> > +	* CPU faults are concurrent so makes sense to have concurrent GPU
> > +	  faults.
> > +	* Should be possible with fined grained locking in the driver GPU
> > +	  fault handler.
> > +	* No expected GPU SVM changes required.
> > +* Ranges with mixed system and device pages
> > +	* Can be added if required to drm_gpusvm_get_pages fairly easily.
> 
> I don't see much of a need, but also don't see any barriers on the MM side for
> doing that.
>

I don't see any barriers either, I think it would work in Xe with slight
tweak to our VM bind code.

I'm unsure the use case though too.

> > +* Multi-GPU support
> > +	* Work in progress and patches expected after initially landing on GPU
> > +	  SVM.
> > +	* Ideally can be done with little to no changes to GPU SVM.
> 
> See above, but I mostly agree.
> 
> > +* Drop ranges in favor of radix tree
> > +	* May be desirable for faster notifiers.
> > +* Compound device pages
> > +	* Nvidia, AMD, and Intel all have agreed expensive core MM functions in
> > +	  migrate device layer are a performance bottleneck, having compound
> > +	  device pages should help increase performance by reducing the number
> > +	  of these expensive calls.
> 
> I'm hoping my patch series[1] to allow for compound device pages will land in v6.15

Cool! I was not aware of this ongoing series. Let me look.

> as well. The next steps are extending that to DEVICE_PRIVATE pages with
> migrate_vma_setup() and migrate_to_ram() and we have been making some good
> progress on this internally. I hope to have something posted before LSFMM.
> 

Also cool. If you think you have something working, let me know and will
pull in changes to test out.

> The other thing we have been looking at here is being able to migrate
> file-backed pages to GPU memory.

Also can test this one out too.

Matt

> 
> [1] - https://lore.kernel.org/linux-mm/cover.a782e309b1328f961da88abddbbc48e5b4579021.1739850794.git-series.apopple@xxxxxxxxxx/
> 
> > +* Higher order dma mapping for migration
> > +	* 4k dma mapping adversely affects migration performance on Intel
> > +	  hardware, higher order (2M) dma mapping should help here.
> > +* Build common userptr implementation on top of GPU SVM
> > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > index 476719771eef..396e535377fb 100644
> > --- a/Documentation/gpu/rfc/index.rst
> > +++ b/Documentation/gpu/rfc/index.rst
> > @@ -16,6 +16,10 @@ host such documentation:
> >  * Once the code has landed move all the documentation to the right places in
> >    the main core, helper or driver sections.
> >  
> > +.. toctree::
> > +
> > +    gpusvm.rst
> > +
> >  .. toctree::
> >  
> >      i915_gem_lmem.rst
> > -- 
> > 2.34.1
> >