Re: [PATCH v6 32/32] drm/doc: gpusvm: Add GPU SVM documentation

Alistair Popple <apopple@xxxxxxxxxx> · Fri, 28 Feb 2025 13:34:42 +1100

On Mon, Feb 24, 2025 at 08:43:11PM -0800, Matthew Brost wrote:
> Add documentation for agree upon GPU SVM design principles, current
> status, and future plans.

Thanks for writing this up. In general I didn't see anything too controversial
but added a couple of comments below.

> 
> v4:
>  - Address Thomas's feedback
> v5:
>  - s/Current/Basline (Thomas)
> 
> Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> Reviewed-by: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
> ---
>  Documentation/gpu/rfc/gpusvm.rst | 84 ++++++++++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst  |  4 ++
>  2 files changed, 88 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> 
> diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst
> new file mode 100644
> index 000000000000..063412160685
> --- /dev/null
> +++ b/Documentation/gpu/rfc/gpusvm.rst
> @@ -0,0 +1,84 @@
> +===============
> +GPU SVM Section
> +===============
> +
> +Agreed upon design principles
> +=============================

As a general comment I think it would be nice if we could add some rational/
reasons for these design principals. Things inevitably change and if/when
we need to violate or update these principals it would be good to have some
documented rational for why we decided on them in the first place because the
reasoning may have become invalid by then.

> +* migrate_to_ram path
> +	* Rely only on core MM concepts (migration PTEs, page references, and
> +	  page locking).
> +	* No driver specific locks other than locks for hardware interaction in
> +	  this path. These are not required and generally a bad idea to
> +	  invent driver defined locks to seal core MM races.

In principal I agree. The problem I think you will run into is the analogue of
what adding a trylock_page() to do_swap_page() fixes. Which is that a concurrent
GPU fault (which is higly likely after handling a CPU fault due to the GPU PTEs
becoming invalid) may, depending on your design, kick off a migration of the
page to the GPU via migrate_vma_setup().

The problem with that is migrate_vma_setup() will temprarily raise the folio
refcount, which can cause the migrate_to_ram() callback to fail but the elevated
refcount from migrate_to_ram() can also cause the GPU migration to fail thus
leading to a live-lock when both CPU and GPU fault handlers just keep retrying.

This was particularly problematic for us on multi-GPU setups, and our solution
was to introduce a migration critical section in the form of a mutex to ensure
only one thread was calling migrate_vma_setup() at a time.

And now that I've looked at UVM development history, and remembered more
context, this is why I had a vague recollection that adding a migration entry
in do_swap_page() would be better than taking a page lock. Doing so fixes the
issue with concurrent GPU faults blocking migrate_to_ram() because it makes
migrate_vma_setup() ignore the page.

> +	* Partial migration is supported (i.e., a subset of pages attempting to
> +	  migrate can actually migrate, with only the faulting page guaranteed
> +	  to migrate).
> +	* Driver handles mixed migrations via retry loops rather than locking.
>
> +* Eviction

This is a term that seems be somewhat overloaded depending on context so a
definition would be nice. Is your view of eviction migrating data from GPU back
to CPU without a virtual address to free up GPU memory? (that's what I think of,
but would be good to make sure we're in sync).

> +	* Only looking at physical memory data structures and locks as opposed to
> +	  looking at virtual memory data structures and locks.
> +	* No looking at mm/vma structs or relying on those being locked.

Agree with the above points.

> +* GPU fault side
> +	* mmap_read only used around core MM functions which require this lock
> +	  and should strive to take mmap_read lock only in GPU SVM layer.
> +	* Big retry loop to handle all races with the mmu notifier under the gpu
> +	  pagetable locks/mmu notifier range lock/whatever we end up calling
> +          those.

Again, one of the issues here (particularly with multi-GPU setups) is that it's
very easy to live-lock with rety loops because even attempting a migration that
fails can cause migration/fault handling in other threads to fail, either by
calling mmu_notifiers or taking a page reference.

Those are probably things that we should fix on the MM side, but for now UVM at
least uses a lock to ensure forward progress.

> +	* Races (especially against concurrent eviction or migrate_to_ram)
> +	  should not be handled on the fault side by trying to hold locks;
> +	  rather, they should be handled using retry loops. One possible
> +	  exception is holding a BO's dma-resv lock during the initial migration
> +	  to VRAM, as this is a well-defined lock that can be taken underneath
> +	  the mmap_read lock.

See my earlier comments. Although note I agree with this in principal, and we do
just retry if taking the lock fails.

> +* Physical memory to virtual backpointer
> +	* Does not work, no pointers from physical memory to virtual should
> +	  exist.

Agree. And my rational is because core-MM can update the virtual address for a
page without notifying a driver of the new address. For example with mremap().
So it's impossible to keep any backpointer to a virtual address up to date.

> +	* Physical memory backpointer (page->zone_device_data) should be stable
> +	  from allocation to page free.

Agree. And presumably the rational is because it is very difficult to safely
update page->zone_device_data and ensure there aren't concurrent users of it
unless the page is free (ie. has a 0 refcount)?

> +* GPU pagetable locking
> +	* Notifier lock only protects range tree, pages valid state for a range
> +	  (rather than seqno due to wider notifiers), pagetable entries, and
> +	  mmu notifier seqno tracking, it is not a global lock to protect
> +          against races.
> +	* All races handled with big retry as mentioned above.

Sounds reasonable.

> +Overview of current design
> +==========================
> +
> +Baseline design is simple as possible to get a working basline in which can be
> +built upon.
> +
> +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> +   :doc: Overview
> +   :doc: Locking
> +   :doc: Migrataion
> +   :doc: Partial Unmapping of Ranges
> +   :doc: Examples
> +
> +Possible future design features
> +===============================
> +
> +* Concurrent GPU faults
> +	* CPU faults are concurrent so makes sense to have concurrent GPU
> +	  faults.
> +	* Should be possible with fined grained locking in the driver GPU
> +	  fault handler.
> +	* No expected GPU SVM changes required.
> +* Ranges with mixed system and device pages
> +	* Can be added if required to drm_gpusvm_get_pages fairly easily.

I don't see much of a need, but also don't see any barriers on the MM side for
doing that.

> +* Multi-GPU support
> +	* Work in progress and patches expected after initially landing on GPU
> +	  SVM.
> +	* Ideally can be done with little to no changes to GPU SVM.

See above, but I mostly agree.

> +* Drop ranges in favor of radix tree
> +	* May be desirable for faster notifiers.
> +* Compound device pages
> +	* Nvidia, AMD, and Intel all have agreed expensive core MM functions in
> +	  migrate device layer are a performance bottleneck, having compound
> +	  device pages should help increase performance by reducing the number
> +	  of these expensive calls.

I'm hoping my patch series[1] to allow for compound device pages will land in v6.15
as well. The next steps are extending that to DEVICE_PRIVATE pages with
migrate_vma_setup() and migrate_to_ram() and we have been making some good
progress on this internally. I hope to have something posted before LSFMM.

The other thing we have been looking at here is being able to migrate
file-backed pages to GPU memory.

[1] - https://lore.kernel.org/linux-mm/cover.a782e309b1328f961da88abddbbc48e5b4579021.1739850794.git-series.apopple@xxxxxxxxxx/

> +* Higher order dma mapping for migration
> +	* 4k dma mapping adversely affects migration performance on Intel
> +	  hardware, higher order (2M) dma mapping should help here.
> +* Build common userptr implementation on top of GPU SVM
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 476719771eef..396e535377fb 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -16,6 +16,10 @@ host such documentation:
>  * Once the code has landed move all the documentation to the right places in
>    the main core, helper or driver sections.
>  
> +.. toctree::
> +
> +    gpusvm.rst
> +
>  .. toctree::
>  
>      i915_gem_lmem.rst
> -- 
> 2.34.1
>