On Mon, Dec 02, 2024 at 02:00:44PM +0100, Thomas Hellström wrote: > On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote: > > Add documentation for agree upon GPU SVM design principles, current > > status, and future plans. > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > --- > > Documentation/gpu/rfc/gpusvm.rst | 70 > > ++++++++++++++++++++++++++++++++ > > Documentation/gpu/rfc/index.rst | 4 ++ > > 2 files changed, 74 insertions(+) > > create mode 100644 Documentation/gpu/rfc/gpusvm.rst > > > > diff --git a/Documentation/gpu/rfc/gpusvm.rst > > b/Documentation/gpu/rfc/gpusvm.rst > > new file mode 100644 > > index 000000000000..2d3f79a6c30a > > --- /dev/null > > +++ b/Documentation/gpu/rfc/gpusvm.rst > > @@ -0,0 +1,70 @@ > > +=============== > > +GPU SVM Section > > +=============== > > + > > +Agreed upon design principles > > +============================= > > + > > +* migrate_to_ram path > > + * Rely on core MM concepts (migration ptes, page refs, and > > page locking) > > + only > > + * No driver specific locks other than locks for hardware > > interaction in > > + this path > > We have previously been discussing the bo lock to protect the bo from > eviction during migrate, if the vram allocation is bo-based. This is a > cross-driver lock with a well-established locking order and I suggest > we allow this. Apart from that I think the above statement needs some Not allowing additional locks was suggested by Sima and I think it makes sense but I do agree taking a dma-resv in migrate_to_ram would be safe. However, the way GPU SVM is structured there is not any hooks to enable this. > elaboration: What is the problem we are trying to avoid with driver- > specific locks, written so that it's easy to understand it's a bad > idea. > Sure, will try to elaborate. > > + * Partial migration is supported > > Exactly what do we mean by partial migration. > Will > > + * Driver handles mixed migrations via retry loops rather > > than locking > > +* Eviction > > + * Only looking at physical memory datastructures and locks > as opposed to... > As opposed looking at virtual memory data structures. Can elaborate why this is a bad idea too - basically mremap + fork fall apart when you start looking at virtual things. > > + * No looking at mm/vma structs or relying on those being > > locked > We're violating this with the current implementation, aren't we? > Aside from when calling migrate_vma_* or creating initial range. Can elaborate on this and certainly say drivers should not look at CPU VMA. > > > +* GPU fault side > > + * mmap_read only used around core MM functions which require > > this lock > > + * Big retry loop to handle all races with the mmu notifier > > under the gpu > > + pagetable locks/mmu notifier range lock/whatever we end up > > calling > > + those > > + * Races (especially against concurrent > > eviction/migrate_to_ram) should > > + not be handled on the fault side by trying to hold locks > > This actually contradicts my comment written above about using the bo > lock to block eviction here. The alternative would be to pin vram > allocations during migration until the mm_truct has references on the > allocation, but it'd be good to clarify exactly why locking here is a > bad idea, and why we can't rely on lockdep? > I'll try to clarify. > > +* Physical memory to virtual backpointer > > + * Does not work, no pointers from physical memory to virtual > > should > > + exist > > We actually still have the private zdd structure, but it's strictly not > to virtual but to the allocation metadata. Is it verified that the > zone_device_data field is allowed to be modified by the pagemap between > allocation and migration? > We don't modify zdd between allocation and migration aside from the ref count of the zdd. > > > +* GPU pagetable locking > > + * Notifier lock only protects range tree, pages, pagetable > > entries, and > > + mmu notifier seqno tracking, it is not a global lock to > > protect > > + against races > > + * All races handled with big retry as mentioned above > > Adding a note here about "pages valid" for subranges rather than > relying on the wider notifer seqno. I.E. a subrange can be valid even > if the notifier seqno says otherwise. > Sure. > Performance considerations: > Perhaps mention that notifier (core mm) performance is more important > than gpu fault (driver) performance when considering optimizations that > improves one at the cost of the other? > Hmm, that is kinda speculation IMO. I have heard that feedback but unsure if I agree with it nor do we have any data to backup that claim. I'd rather not write something down like this that is based on speculation. I do agree we should profile the code to really understand the hot spots and write down our findings once we have done that. > > + > > +Overview of current design > > +========================== > > + > > +Current design is simple as possible to get a working basline in > > which can built > > can be built > +1 Matt > > +upon. > > + > > +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c > > + :doc: Overview > > + :doc: Locking > > + :doc: Migrataion > > + :doc: Partial Unmapping of Ranges > > + :doc: Examples > > + > > +Possible future design features > > +=============================== > > + > > +* Concurrent GPU faults > > + * CPU faults are concurrent so makes sense to have > > concurrent GPU faults > > + * Should be possible with fined grained locking in the > > driver GPU > > + fault handler > > + * No expected GPU SVM changes required > > +* Ranges with mixed system and device pages > > + * Can be added if required to drm_gpusvm_get_pages fairly > > easily > > +* Multi-GPU support > > + * Work in progress and patches expected after initially > > landing on GPU > > + SVM > > + * Ideally can be done with little to no changes to GPU SVM > > +* Drop ranges in favor of radix tree > > + * May be desirable for faster notifiers > > +* Compound device pages > > + * Nvidia, AMD, and Intel all have agreed expensive core MM > > functions in > > + migrate device layer are a performance bottleneck, having > > compound > > + device pages should help increase performance by reducing > > the number > > + of these expensive calls > > +* Higher order dma mapping for migration > > + * 4k dma mapping adversely affects migration performance on > > Intel > > + hardware, higher order (2M) dma mapping should help here > > diff --git a/Documentation/gpu/rfc/index.rst > > b/Documentation/gpu/rfc/index.rst > > index 476719771eef..396e535377fb 100644 > > --- a/Documentation/gpu/rfc/index.rst > > +++ b/Documentation/gpu/rfc/index.rst > > @@ -16,6 +16,10 @@ host such documentation: > > * Once the code has landed move all the documentation to the right > > places in > > the main core, helper or driver sections. > > > > +.. toctree:: > > + > > + gpusvm.rst > > + > > .. toctree:: > > > > i915_gem_lmem.rst > > Thanks, > Thomas >