Re: [PATCH v6 32/32] drm/doc: gpusvm: Add GPU SVM documentation

Matthew Brost <matthew.brost@xxxxxxxxx> · Fri, 28 Feb 2025 16:35:42 -0800

On Fri, Feb 28, 2025 at 04:53:11PM +1100, Alistair Popple wrote:
> On Thu, Feb 27, 2025 at 08:36:35PM -0800, Matthew Brost wrote:
> > On Fri, Feb 28, 2025 at 01:34:42PM +1100, Alistair Popple wrote:
> > > On Mon, Feb 24, 2025 at 08:43:11PM -0800, Matthew Brost wrote:
> > > > Add documentation for agree upon GPU SVM design principles, current
> > > > status, and future plans.
> > > 
> > > Thanks for writing this up. In general I didn't see anything too controversial
> > > but added a couple of comments below.
> > > 
> > > > 
> > > > v4:
> > > >  - Address Thomas's feedback
> > > > v5:
> > > >  - s/Current/Basline (Thomas)
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > > Reviewed-by: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > > > ---
> > > >  Documentation/gpu/rfc/gpusvm.rst | 84 ++++++++++++++++++++++++++++++++
> > > >  Documentation/gpu/rfc/index.rst  |  4 ++
> > > >  2 files changed, 88 insertions(+)
> > > >  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> > > > 
> > > > diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst
> > > > new file mode 100644
> > > > index 000000000000..063412160685
> > > > --- /dev/null
> > > > +++ b/Documentation/gpu/rfc/gpusvm.rst
> > > > @@ -0,0 +1,84 @@
> > > > +===============
> > > > +GPU SVM Section
> > > > +===============
> > > > +
> > > > +Agreed upon design principles
> > > > +=============================
> > > 
> > > As a general comment I think it would be nice if we could add some rational/
> > > reasons for these design principals. Things inevitably change and if/when
> > > we need to violate or update these principals it would be good to have some
> > > documented rational for why we decided on them in the first place because the
> > > reasoning may have become invalid by then.
> > > 
> > 
> > Let me try to add somethings to the various cases.
> 
> Thanks!
> 
> > > > +* migrate_to_ram path
> > > > +	* Rely only on core MM concepts (migration PTEs, page references, and
> > > > +	  page locking).
> > > > +	* No driver specific locks other than locks for hardware interaction in
> > > > +	  this path. These are not required and generally a bad idea to
> > > > +	  invent driver defined locks to seal core MM races.
> > > 
> > > In principal I agree. The problem I think you will run into is the analogue of
> > > what adding a trylock_page() to do_swap_page() fixes. Which is that a concurrent
> > > GPU fault (which is higly likely after handling a CPU fault due to the GPU PTEs
> > > becoming invalid) may, depending on your design, kick off a migration of the
> > > page to the GPU via migrate_vma_setup().
> > > 
> > > The problem with that is migrate_vma_setup() will temprarily raise the folio
> > > refcount, which can cause the migrate_to_ram() callback to fail but the elevated
> > > refcount from migrate_to_ram() can also cause the GPU migration to fail thus
> > > leading to a live-lock when both CPU and GPU fault handlers just keep retrying.
> > > 
> > > This was particularly problematic for us on multi-GPU setups, and our solution
> > > was to introduce a migration critical section in the form of a mutex to ensure
> > > only one thread was calling migrate_vma_setup() at a time.
> > > 
> > > And now that I've looked at UVM development history, and remembered more
> > > context, this is why I had a vague recollection that adding a migration entry
> > > in do_swap_page() would be better than taking a page lock. Doing so fixes the
> > > issue with concurrent GPU faults blocking migrate_to_ram() because it makes
> > > migrate_vma_setup() ignore the page.
> > > 
> > 
> > Ok, this is something to keep an eye on. In the current Xe code, we try
> > to migrate a chunk of memory from the CPU to the GPU in our GPU fault
> > handler once per fault. If it fails due to racing CPU access, we simply
> > leave it in CPU memory and move on. We don't have any real migration
> > policies in Xe yet—that is being worked on as a follow-up to my series.
> > However, if we had a policy requiring a memory region to 'must be in
> > GPU,' this could conceivably lead to a livelock with concurrent CPU and
> > GPU access. I'm still not fully convinced that a driver-side lock is the
> > solution here, but without encountering the issue on our side, I can't
> > be completely certain what the solution is.
> 
> Right - we have migration policies that can cause us to try harder to migrate.
> Also I agree with you that a driver-side lock might not be the best solution
> here. It's what we did due to various limiations we have, but they are
> unimportant for this discussion.
> 
> I agree the ideal solution wouldn't involve locks and would instead be to fix
> the migration interfaces up such that one thread attempting to migrate doesn't
> cause another thread which has started a migration to fail. The solution to that
> isn't obvious, but I don't think it would be impossible either.
> 

I agree this would be a good possible solution. Will keep this in mind
when we start implementing on the Xe side.

> > > > +	* Partial migration is supported (i.e., a subset of pages attempting to
> > > > +	  migrate can actually migrate, with only the faulting page guaranteed
> > > > +	  to migrate).
> > > > +	* Driver handles mixed migrations via retry loops rather than locking.
> > > >
> > > > +* Eviction
> > > 
> > > This is a term that seems be somewhat overloaded depending on context so a
> > > definition would be nice. Is your view of eviction migrating data from GPU back
> > > to CPU without a virtual address to free up GPU memory? (that's what I think of,
> > > but would be good to make sure we're in sync).
> > > 
> > 
> > Yes. When GPU memory is oversubscribed, we find the physical backing in
> > an LRU list to evict. In Xe, this is a TTM BO.
> 
> Sounds good. So eviction is just migration of physical memory. 
> 
> > > > +	* Only looking at physical memory data structures and locks as opposed to
> > > > +	  looking at virtual memory data structures and locks.
> 
> Except of course whatever virtual memory data structures the core-MM needs to
> touch in order to do the migration right? Agree that the driver shouldn't be
> touching any driver data structures that concern themselves with virtual memory
> addresses though. Except what about any data structures that are required as
> part of GPU PTE/TLB invalidation?
> 

The eviction is triggered via the new function I added in this series
migrate_device_pfns, this triggers the notifier which invalidates the
virtual GPU address page tables. So the eviction code itself is only
looking at the physical memory. I can add something here to make this a
bit clear.

> > > > +	* No looking at mm/vma structs or relying on those being locked.
> > > 
> > > Agree with the above points.
> > > 
> > > > +* GPU fault side
> > > > +	* mmap_read only used around core MM functions which require this lock
> > > > +	  and should strive to take mmap_read lock only in GPU SVM layer.
> > > > +	* Big retry loop to handle all races with the mmu notifier under the gpu
> > > > +	  pagetable locks/mmu notifier range lock/whatever we end up calling
> > > > +          those.
> > > 
> > > Again, one of the issues here (particularly with multi-GPU setups) is that it's
> > > very easy to live-lock with rety loops because even attempting a migration that
> > > fails can cause migration/fault handling in other threads to fail, either by
> > > calling mmu_notifiers or taking a page reference.
> > > 
> > > Those are probably things that we should fix on the MM side, but for now UVM at
> > > least uses a lock to ensure forward progress.
> > >
> > 
> > Again, see above. Right now, migration in Xe is more of a best-case
> > scenario rather than a mandatory process, and perhaps this is masking an
> > issue.
> > 
> > Maybe I should add a comment here stating your possible concerns and that
> > Xe will be implementing real migration policies and multi-GPU support
> > soon. If this issue arises, we can revisit the locking guidelines or
> > perhaps help contribute to the necessary core changes to make this work
> > properly.
> 
> Yeah, that could be good. Something along the lines of core-MM code may need
> fixing in the way I described above (one thread attempting a migration shouldn't
> cause another thread that's already started one to fail).
> 

+1

Matt

> > > > +	* Races (especially against concurrent eviction or migrate_to_ram)
> > > > +	  should not be handled on the fault side by trying to hold locks;
> > > > +	  rather, they should be handled using retry loops. One possible
> > > > +	  exception is holding a BO's dma-resv lock during the initial migration
> > > > +	  to VRAM, as this is a well-defined lock that can be taken underneath
> > > > +	  the mmap_read lock.
> > > 
> > > See my earlier comments. Although note I agree with this in principal, and we do
> > > just retry if taking the lock fails.
> > > 
> > > > +* Physical memory to virtual backpointer
> > > > +	* Does not work, no pointers from physical memory to virtual should
> > > > +	  exist.
> > > 
> > > Agree. And my rational is because core-MM can update the virtual address for a
> > > page without notifying a driver of the new address. For example with mremap().
> > > So it's impossible to keep any backpointer to a virtual address up to date.
> > > 
> > 
> > Yep, this is good example and will include this in my next rev.
> > 
> > > > +	* Physical memory backpointer (page->zone_device_data) should be stable
> > > > +	  from allocation to page free.
> > > 
> > > Agree. And presumably the rational is because it is very difficult to safely
> > > update page->zone_device_data and ensure there aren't concurrent users of it
> > > unless the page is free (ie. has a 0 refcount)?
> > > 
> > 
> > Yes, exactly.
> > 
> > > > +* GPU pagetable locking
> > > > +	* Notifier lock only protects range tree, pages valid state for a range
> > > > +	  (rather than seqno due to wider notifiers), pagetable entries, and
> > > > +	  mmu notifier seqno tracking, it is not a global lock to protect
> > > > +          against races.
> > > > +	* All races handled with big retry as mentioned above.
> > > 
> > > Sounds reasonable.
> > > 
> > > > +Overview of current design
> > > > +==========================
> > > > +
> > > > +Baseline design is simple as possible to get a working basline in which can be
> > > > +built upon.
> > > > +
> > > > +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> > > > +   :doc: Overview
> > > > +   :doc: Locking
> > > > +   :doc: Migrataion
> > > > +   :doc: Partial Unmapping of Ranges
> > > > +   :doc: Examples
> > > > +
> > > > +Possible future design features
> > > > +===============================
> > > > +
> > > > +* Concurrent GPU faults
> > > > +	* CPU faults are concurrent so makes sense to have concurrent GPU
> > > > +	  faults.
> > > > +	* Should be possible with fined grained locking in the driver GPU
> > > > +	  fault handler.
> > > > +	* No expected GPU SVM changes required.
> > > > +* Ranges with mixed system and device pages
> > > > +	* Can be added if required to drm_gpusvm_get_pages fairly easily.
> > > 
> > > I don't see much of a need, but also don't see any barriers on the MM side for
> > > doing that.
> > >
> > 
> > I don't see any barriers either, I think it would work in Xe with slight
> > tweak to our VM bind code.
> > 
> > I'm unsure the use case though too.
> > 
> > > > +* Multi-GPU support
> > > > +	* Work in progress and patches expected after initially landing on GPU
> > > > +	  SVM.
> > > > +	* Ideally can be done with little to no changes to GPU SVM.
> > > 
> > > See above, but I mostly agree.
> > > 
> > > > +* Drop ranges in favor of radix tree
> > > > +	* May be desirable for faster notifiers.
> > > > +* Compound device pages
> > > > +	* Nvidia, AMD, and Intel all have agreed expensive core MM functions in
> > > > +	  migrate device layer are a performance bottleneck, having compound
> > > > +	  device pages should help increase performance by reducing the number
> > > > +	  of these expensive calls.
> > > 
> > > I'm hoping my patch series[1] to allow for compound device pages will land in v6.15
> > 
> > Cool! I was not aware of this ongoing series. Let me look.
> 
> There's probably not much of direct interest to you there at the moment. It's a
> prerequisite in that it allows current (and therefore future) users of compound
> ZONE_DEVICE pages to have them ref-counted normally, instead of the funky scheme
> DAX uses at the moment. Our changes will build on top of that.
> 
> > > as well. The next steps are extending that to DEVICE_PRIVATE pages with
> > > migrate_vma_setup() and migrate_to_ram() and we have been making some good
> > > progress on this internally. I hope to have something posted before LSFMM.
> > > 
> > 
> > Also cool. If you think you have something working, let me know and will
> > pull in changes to test out.
> 
> Will do!
> 
> > > The other thing we have been looking at here is being able to migrate
> > > file-backed pages to GPU memory.
> > 
> > Also can test this one out too.
> 
> Thanks.
> 
>  - Alistair
> 
> > Matt
> > 
> > > 
> > > [1] - https://lore.kernel.org/linux-mm/cover.a782e309b1328f961da88abddbbc48e5b4579021.1739850794.git-series.apopple@xxxxxxxxxx/
> > > 
> > > > +* Higher order dma mapping for migration
> > > > +	* 4k dma mapping adversely affects migration performance on Intel
> > > > +	  hardware, higher order (2M) dma mapping should help here.
> > > > +* Build common userptr implementation on top of GPU SVM
> > > > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > > > index 476719771eef..396e535377fb 100644
> > > > --- a/Documentation/gpu/rfc/index.rst
> > > > +++ b/Documentation/gpu/rfc/index.rst
> > > > @@ -16,6 +16,10 @@ host such documentation:
> > > >  * Once the code has landed move all the documentation to the right places in
> > > >    the main core, helper or driver sections.
> > > >  
> > > > +.. toctree::
> > > > +
> > > > +    gpusvm.rst
> > > > +
> > > >  .. toctree::
> > > >  
> > > >      i915_gem_lmem.rst
> > > > -- 
> > > > 2.34.1
> > > >