On Fri, Feb 28, 2025 at 04:53:11PM +1100, Alistair Popple wrote: > On Thu, Feb 27, 2025 at 08:36:35PM -0800, Matthew Brost wrote: > > On Fri, Feb 28, 2025 at 01:34:42PM +1100, Alistair Popple wrote: > > > On Mon, Feb 24, 2025 at 08:43:11PM -0800, Matthew Brost wrote: > > > > Add documentation for agree upon GPU SVM design principles, current > > > > status, and future plans. > > > > > > Thanks for writing this up. In general I didn't see anything too controversial > > > but added a couple of comments below. > > > > > > > > > > > v4: > > > > - Address Thomas's feedback > > > > v5: > > > > - s/Current/Basline (Thomas) > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > Reviewed-by: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> > > > > --- > > > > Documentation/gpu/rfc/gpusvm.rst | 84 ++++++++++++++++++++++++++++++++ > > > > Documentation/gpu/rfc/index.rst | 4 ++ > > > > 2 files changed, 88 insertions(+) > > > > create mode 100644 Documentation/gpu/rfc/gpusvm.rst > > > > > > > > diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst > > > > new file mode 100644 > > > > index 000000000000..063412160685 > > > > --- /dev/null > > > > +++ b/Documentation/gpu/rfc/gpusvm.rst > > > > @@ -0,0 +1,84 @@ > > > > +=============== > > > > +GPU SVM Section > > > > +=============== > > > > + > > > > +Agreed upon design principles > > > > +============================= > > > > > > As a general comment I think it would be nice if we could add some rational/ > > > reasons for these design principals. Things inevitably change and if/when > > > we need to violate or update these principals it would be good to have some > > > documented rational for why we decided on them in the first place because the > > > reasoning may have become invalid by then. > > > > > > > Let me try to add somethings to the various cases. > > Thanks! > > > > > +* migrate_to_ram path > > > > + * Rely only on core MM concepts (migration PTEs, page references, and > > > > + page locking). > > > > + * No driver specific locks other than locks for hardware interaction in > > > > + this path. These are not required and generally a bad idea to > > > > + invent driver defined locks to seal core MM races. > > > > > > In principal I agree. The problem I think you will run into is the analogue of > > > what adding a trylock_page() to do_swap_page() fixes. Which is that a concurrent > > > GPU fault (which is higly likely after handling a CPU fault due to the GPU PTEs > > > becoming invalid) may, depending on your design, kick off a migration of the > > > page to the GPU via migrate_vma_setup(). > > > > > > The problem with that is migrate_vma_setup() will temprarily raise the folio > > > refcount, which can cause the migrate_to_ram() callback to fail but the elevated > > > refcount from migrate_to_ram() can also cause the GPU migration to fail thus > > > leading to a live-lock when both CPU and GPU fault handlers just keep retrying. > > > > > > This was particularly problematic for us on multi-GPU setups, and our solution > > > was to introduce a migration critical section in the form of a mutex to ensure > > > only one thread was calling migrate_vma_setup() at a time. > > > > > > And now that I've looked at UVM development history, and remembered more > > > context, this is why I had a vague recollection that adding a migration entry > > > in do_swap_page() would be better than taking a page lock. Doing so fixes the > > > issue with concurrent GPU faults blocking migrate_to_ram() because it makes > > > migrate_vma_setup() ignore the page. > > > > > > > Ok, this is something to keep an eye on. In the current Xe code, we try > > to migrate a chunk of memory from the CPU to the GPU in our GPU fault > > handler once per fault. If it fails due to racing CPU access, we simply > > leave it in CPU memory and move on. We don't have any real migration > > policies in Xe yet—that is being worked on as a follow-up to my series. > > However, if we had a policy requiring a memory region to 'must be in > > GPU,' this could conceivably lead to a livelock with concurrent CPU and > > GPU access. I'm still not fully convinced that a driver-side lock is the > > solution here, but without encountering the issue on our side, I can't > > be completely certain what the solution is. > > Right - we have migration policies that can cause us to try harder to migrate. > Also I agree with you that a driver-side lock might not be the best solution > here. It's what we did due to various limiations we have, but they are > unimportant for this discussion. > > I agree the ideal solution wouldn't involve locks and would instead be to fix > the migration interfaces up such that one thread attempting to migrate doesn't > cause another thread which has started a migration to fail. The solution to that > isn't obvious, but I don't think it would be impossible either. > I agree this would be a good possible solution. Will keep this in mind when we start implementing on the Xe side. > > > > + * Partial migration is supported (i.e., a subset of pages attempting to > > > > + migrate can actually migrate, with only the faulting page guaranteed > > > > + to migrate). > > > > + * Driver handles mixed migrations via retry loops rather than locking. > > > > > > > > +* Eviction > > > > > > This is a term that seems be somewhat overloaded depending on context so a > > > definition would be nice. Is your view of eviction migrating data from GPU back > > > to CPU without a virtual address to free up GPU memory? (that's what I think of, > > > but would be good to make sure we're in sync). > > > > > > > Yes. When GPU memory is oversubscribed, we find the physical backing in > > an LRU list to evict. In Xe, this is a TTM BO. > > Sounds good. So eviction is just migration of physical memory. > > > > > + * Only looking at physical memory data structures and locks as opposed to > > > > + looking at virtual memory data structures and locks. > > Except of course whatever virtual memory data structures the core-MM needs to > touch in order to do the migration right? Agree that the driver shouldn't be > touching any driver data structures that concern themselves with virtual memory > addresses though. Except what about any data structures that are required as > part of GPU PTE/TLB invalidation? > The eviction is triggered via the new function I added in this series migrate_device_pfns, this triggers the notifier which invalidates the virtual GPU address page tables. So the eviction code itself is only looking at the physical memory. I can add something here to make this a bit clear. > > > > + * No looking at mm/vma structs or relying on those being locked. > > > > > > Agree with the above points. > > > > > > > +* GPU fault side > > > > + * mmap_read only used around core MM functions which require this lock > > > > + and should strive to take mmap_read lock only in GPU SVM layer. > > > > + * Big retry loop to handle all races with the mmu notifier under the gpu > > > > + pagetable locks/mmu notifier range lock/whatever we end up calling > > > > + those. > > > > > > Again, one of the issues here (particularly with multi-GPU setups) is that it's > > > very easy to live-lock with rety loops because even attempting a migration that > > > fails can cause migration/fault handling in other threads to fail, either by > > > calling mmu_notifiers or taking a page reference. > > > > > > Those are probably things that we should fix on the MM side, but for now UVM at > > > least uses a lock to ensure forward progress. > > > > > > > Again, see above. Right now, migration in Xe is more of a best-case > > scenario rather than a mandatory process, and perhaps this is masking an > > issue. > > > > Maybe I should add a comment here stating your possible concerns and that > > Xe will be implementing real migration policies and multi-GPU support > > soon. If this issue arises, we can revisit the locking guidelines or > > perhaps help contribute to the necessary core changes to make this work > > properly. > > Yeah, that could be good. Something along the lines of core-MM code may need > fixing in the way I described above (one thread attempting a migration shouldn't > cause another thread that's already started one to fail). > +1 Matt > > > > + * Races (especially against concurrent eviction or migrate_to_ram) > > > > + should not be handled on the fault side by trying to hold locks; > > > > + rather, they should be handled using retry loops. One possible > > > > + exception is holding a BO's dma-resv lock during the initial migration > > > > + to VRAM, as this is a well-defined lock that can be taken underneath > > > > + the mmap_read lock. > > > > > > See my earlier comments. Although note I agree with this in principal, and we do > > > just retry if taking the lock fails. > > > > > > > +* Physical memory to virtual backpointer > > > > + * Does not work, no pointers from physical memory to virtual should > > > > + exist. > > > > > > Agree. And my rational is because core-MM can update the virtual address for a > > > page without notifying a driver of the new address. For example with mremap(). > > > So it's impossible to keep any backpointer to a virtual address up to date. > > > > > > > Yep, this is good example and will include this in my next rev. > > > > > > + * Physical memory backpointer (page->zone_device_data) should be stable > > > > + from allocation to page free. > > > > > > Agree. And presumably the rational is because it is very difficult to safely > > > update page->zone_device_data and ensure there aren't concurrent users of it > > > unless the page is free (ie. has a 0 refcount)? > > > > > > > Yes, exactly. > > > > > > +* GPU pagetable locking > > > > + * Notifier lock only protects range tree, pages valid state for a range > > > > + (rather than seqno due to wider notifiers), pagetable entries, and > > > > + mmu notifier seqno tracking, it is not a global lock to protect > > > > + against races. > > > > + * All races handled with big retry as mentioned above. > > > > > > Sounds reasonable. > > > > > > > +Overview of current design > > > > +========================== > > > > + > > > > +Baseline design is simple as possible to get a working basline in which can be > > > > +built upon. > > > > + > > > > +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c > > > > + :doc: Overview > > > > + :doc: Locking > > > > + :doc: Migrataion > > > > + :doc: Partial Unmapping of Ranges > > > > + :doc: Examples > > > > + > > > > +Possible future design features > > > > +=============================== > > > > + > > > > +* Concurrent GPU faults > > > > + * CPU faults are concurrent so makes sense to have concurrent GPU > > > > + faults. > > > > + * Should be possible with fined grained locking in the driver GPU > > > > + fault handler. > > > > + * No expected GPU SVM changes required. > > > > +* Ranges with mixed system and device pages > > > > + * Can be added if required to drm_gpusvm_get_pages fairly easily. > > > > > > I don't see much of a need, but also don't see any barriers on the MM side for > > > doing that. > > > > > > > I don't see any barriers either, I think it would work in Xe with slight > > tweak to our VM bind code. > > > > I'm unsure the use case though too. > > > > > > +* Multi-GPU support > > > > + * Work in progress and patches expected after initially landing on GPU > > > > + SVM. > > > > + * Ideally can be done with little to no changes to GPU SVM. > > > > > > See above, but I mostly agree. > > > > > > > +* Drop ranges in favor of radix tree > > > > + * May be desirable for faster notifiers. > > > > +* Compound device pages > > > > + * Nvidia, AMD, and Intel all have agreed expensive core MM functions in > > > > + migrate device layer are a performance bottleneck, having compound > > > > + device pages should help increase performance by reducing the number > > > > + of these expensive calls. > > > > > > I'm hoping my patch series[1] to allow for compound device pages will land in v6.15 > > > > Cool! I was not aware of this ongoing series. Let me look. > > There's probably not much of direct interest to you there at the moment. It's a > prerequisite in that it allows current (and therefore future) users of compound > ZONE_DEVICE pages to have them ref-counted normally, instead of the funky scheme > DAX uses at the moment. Our changes will build on top of that. > > > > as well. The next steps are extending that to DEVICE_PRIVATE pages with > > > migrate_vma_setup() and migrate_to_ram() and we have been making some good > > > progress on this internally. I hope to have something posted before LSFMM. > > > > > > > Also cool. If you think you have something working, let me know and will > > pull in changes to test out. > > Will do! > > > > The other thing we have been looking at here is being able to migrate > > > file-backed pages to GPU memory. > > > > Also can test this one out too. > > Thanks. > > - Alistair > > > Matt > > > > > > > > [1] - https://lore.kernel.org/linux-mm/cover.a782e309b1328f961da88abddbbc48e5b4579021.1739850794.git-series.apopple@xxxxxxxxxx/ > > > > > > > +* Higher order dma mapping for migration > > > > + * 4k dma mapping adversely affects migration performance on Intel > > > > + hardware, higher order (2M) dma mapping should help here. > > > > +* Build common userptr implementation on top of GPU SVM > > > > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst > > > > index 476719771eef..396e535377fb 100644 > > > > --- a/Documentation/gpu/rfc/index.rst > > > > +++ b/Documentation/gpu/rfc/index.rst > > > > @@ -16,6 +16,10 @@ host such documentation: > > > > * Once the code has landed move all the documentation to the right places in > > > > the main core, helper or driver sections. > > > > > > > > +.. toctree:: > > > > + > > > > + gpusvm.rst > > > > + > > > > .. toctree:: > > > > > > > > i915_gem_lmem.rst > > > > -- > > > > 2.34.1 > > > >