Re: [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation

Simona Vetter <simona.vetter@xxxxxxxx> · Tue, 24 Sep 2024 11:16:01 +0200

On Tue, Aug 27, 2024 at 07:48:33PM -0700, Matthew Brost wrote:
> Continuation of SVM work by Oak Zeng [1][2] based on community feedback.
> Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for
> system allocations (e.g., malloc), runtime allocations (e.g., binding a
> BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM
> allocations can evict each other). Fully tested; more on this below.
> 
> The patch breakdown is as follows:
> 
> 1. Preparation patches already on the list [3].
> 	- Patches 1-3.
> 	- Please refrain from reviewing these here.	
> 2. New migrate layer functionality
> 	- Patch 4.
> 	- Required for eviction to avoid locking inversion between
> 	  dma-resv and mmap lock.
> 3. GPU SVM.
> 	- Patch 5.
> 	- This is what needs community review.
> 	- Inspired by GPUVM.
> 	- Kernel doc should explain design principles.
> 	- There is certainly room for optimization of the implementation
> 	  and improvements with existing core MM interaction. Pulling in
> 	  pending DMA mapping work [4] and additional core MM support
> 	  for SVM is also likely desired. However, this serves as a good
> 	  starting point for any SVM discussions and could be used as a
> 	  stepping stone to future core MM work.
> 3. Basic SVM support in Xe (i.e., SRAM backing only).
> 	- Patches 6-15.
> 	- The uAPI in the patch could benefit from community input.
> 4. SVM VRAM migration support in Xe.
> 	- Patches 16-23.
> 	- Using TMM BOs for SVM VRAM allocations could use community
> 	  input. Patch 23 has a detailed explaination of this design
> 	  choice in the commit message.
> 5. SVM eviction support in Xe.
> 	- Patch 24.
> 	- Should work with exhaustive eviction [5] when it merges.
> 6. Xe SVM debug / tuning.
> 	- Patch 25-28.
> 
> Kernel documentation and commit messages are relatively light, aside
> from GPU SVM and uAPI patches as this is an RFC.
> 
> Testing has been conducted quite thoroughly with new IGT [6]. Various
> system allocation types (malloc, mmap, mmap flags, huge pages, different
> sizes, different alignments), mixing runtime allocations, unmapping
> corners, invalid faults, and eviction have been tested. Testing scales
> from single thread to multiple threads and multiple processes. Tests
> pass on LNL, BMG, PVC 1 tile, and PVC 2 tile.
> 
> 1. Multiple GPU support.
> 	- This is likely to follow or occur in parallel to this work.
> 2. Userptr unification with GPU SVM.
> 	- This is essentially designed in my head (likely involving a
> 	  few new GPU SVM layer functions) but would require some fairly
> 	  invasive changes to Xe KMD to test out. Therefore, I would
> 	  like GPU SVM to be reviewed first before proceeding with these
> 	  changes.
> 3. Madvise and prefetch IOCTLs
> 	- This is likely to follow or occur in parallel to this work.
> 
> Given the size of the series, I have pushed a GitLab branch for
> reference [7].
> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/128910/
> [2] https://patchwork.freedesktop.org/series/132229/
> [3] https://patchwork.freedesktop.org/series/137805/
> [4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@xxxxxxxxxx/
> [5] https://patchwork.freedesktop.org/series/133643/
> [6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2
> [7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads

Ok rather late, I wanted to type this up 2 weeks ago or so, but alas here
it is finally. I think with all the experiments and discussions I have
fairly clear understanding of the really tricky parts of svm (thanks a lot
to Matt for all the work done). From my side the key points, sorted
roughly in how important I think they are:

1. migrate_to_ram path

I think this is the absolute center piece of making sure we're aligned
with core mm and don't suffer from deadlocks or livelocks fundamental to
the gpusvm library design. So this part imo needs to be solid for the
first version we merge. But of course any core mm changes Matt prototyped
shouldn't gate merging the drm side since they're nicely decoupled, we
only need those to validate the design in testing.

I think the key points are:

- we rely on the migration pte, temporary page references and page lock
  only, which with the core mm changes Matt worked on gives us guaranteed
  reliable migration back to system memory. And we need that, or svm
  essentially becomes unusable as a concept.

- we need to support partial migration, including the worst case fallback
  of only migrating that single page core mm managed to trylock for us
  while holding the pagetable lock.

  Since we have guaranteed migration back to system memory we can make the
  assumption on the gpu fault handling side that we will only ever handle
  ranges that are entirely in vram (by throwing any partial migrations
  out). Needs a retry loop for that in the gpu fault side, but I no logner
  see an issue with that assumption on the gpu fault side otherwise, so
  not needed for merging or even later until we have a driver that
  requires partial vram support.

- no other driver locks related to that memory range in any way are
  allowed, and ideally we also test with the forced fallback to mmap_read
  lock in do_swap_page removed, i.e. calling migrate_to_ram with only
  holding the read vma lock. Of course driver locks for blitting are
  allowed, it's only locks related to managing physical memory which are
  problematic and could result in deadlocks.

- the drm side must uphold the guarantee of not having elevated page
  references whithout holding the page lock. Otherwise there's a race and
  we cannot guarantee migratione to sram.

- also through the try_to_migrate maze we'll hit our own gpu pte
  invalidate paths, so there's some requirements there too. But I've put
  the discussion for that at the very bottom, since most of the gpu pte
  locking questions are imo not that important, and definitely not
  important for the first version we merge.

Everything else below I think we can sort out post merge and just need
rough alignment on the design.

2. eviction

Requirements much like migrate_to_ram, because otherwise we break the
migration gurantee:

- Only looking at physical memory datastructures and locks, no looking at
  mm/vma structs or relying on those being locked. We rely entirely on
  reverse maps from try_to_migrate to find all the mappings on both cpu
  and gpu side (cpu only zone device swap or migration pte entries ofc).

- Partial migration needs to work to make sure we can get out of any
  low memory bind.

3. gpu fault side

- We can only rely on mmap_read for hmm_range_fault. And ideally should
  assume that's not held anywhere else since with per-vma locking I'd
  expect the mm/vma locking will move into hmm_range_fault. This also
  means no looking at vma beyond just passing it around as currently
  needed by core mm functions.

- Big retry loop to handle all races with the mmu notifier under the gpu
  pagetable locks/mmu notifier range lock/whatever we end up calling
  those. Races (especially against concurrent eviction/migrate_to_ram)
  should _not_ be handled on the fault side by trying to hold locks
  instead.

- Long term I think we need to be able to handle concurrent faults, even
  on hw where there's only one gpu fault handling queue. For performance
  we absolutely want to prefault aggressively, and that likely happens
  through cpu ioctls that are entirely independent from the gpu fault
  handling.

  Short term enough (driver-side) locking to make sure this doesn't go
  boom is enough, I think just some design goal documentation here on how
  to achieve that is all we need.

4. physical memory to virtual backpointer

No. Doesn't work :-) Also it's only used in in the eviction/migrate_to_ram
path and I think Matt already fixed this all anyway.

5. gpu pagetable locking

Or mmu notifier range locking or whatever you want to call it. Like on the
cpu side this should _only_ protect the pagetable entries and additional
for us mmu notifier seqno tracking, nothing else.

Any races due to concurrent eviction/migrate_to_ram/gpu fault/prefault
need to be handled by retrying outside of holding the pagetable locks. If
we try to impose additional consistency guarantees we'll fall over and
have a livelock/deadlock fight with core mm in migrate_to_ram. This part
is required I think for the first version, but we already need that anyway
to make migrate_to_ram work properly.

For the actual data structure/locking design I think anything on the
design line between a single global lock and the radix tree over-the-top
scalable per-pagetable (spin)lock design of the core mm is fine.

The design here with 3 levels (mmu notifer, range, struct page) wouldn't
be my first choice, but clearly fits on that line so imo is fine for
initial merging. We might want to make sure that the range locking (I
guess mostly relevant for the invalidate side, drivers don't see much
else) is somewhat abstracted so we can easily change that post-merge, but
not required imo at all.

For consensus documentation I'd recommend a todo or design documentation
patch, where we put down both the current design and why it's like that,
and some of the longer term goals. Then get that acked (imo needs at least
one other driver that's seriously interested in this, plus I think an ack
from Danilo for gpuvm interactions), then merge that. SVM is tricky enough
that I think this would be really useful to make sure we're not
unecessarily stuck in limbo.