On Tue, Aug 27, 2024 at 07:48:33PM -0700, Matthew Brost wrote: > Continuation of SVM work by Oak Zeng [1][2] based on community feedback. > Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for > system allocations (e.g., malloc), runtime allocations (e.g., binding a > BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM > allocations can evict each other). Fully tested; more on this below. > > The patch breakdown is as follows: > > 1. Preparation patches already on the list [3]. > - Patches 1-3. > - Please refrain from reviewing these here. > 2. New migrate layer functionality > - Patch 4. > - Required for eviction to avoid locking inversion between > dma-resv and mmap lock. > 3. GPU SVM. > - Patch 5. > - This is what needs community review. > - Inspired by GPUVM. > - Kernel doc should explain design principles. > - There is certainly room for optimization of the implementation > and improvements with existing core MM interaction. Pulling in > pending DMA mapping work [4] and additional core MM support > for SVM is also likely desired. However, this serves as a good > starting point for any SVM discussions and could be used as a > stepping stone to future core MM work. > 3. Basic SVM support in Xe (i.e., SRAM backing only). > - Patches 6-15. > - The uAPI in the patch could benefit from community input. > 4. SVM VRAM migration support in Xe. > - Patches 16-23. > - Using TMM BOs for SVM VRAM allocations could use community > input. Patch 23 has a detailed explaination of this design > choice in the commit message. > 5. SVM eviction support in Xe. > - Patch 24. > - Should work with exhaustive eviction [5] when it merges. > 6. Xe SVM debug / tuning. > - Patch 25-28. > > Kernel documentation and commit messages are relatively light, aside > from GPU SVM and uAPI patches as this is an RFC. > > Testing has been conducted quite thoroughly with new IGT [6]. Various > system allocation types (malloc, mmap, mmap flags, huge pages, different > sizes, different alignments), mixing runtime allocations, unmapping > corners, invalid faults, and eviction have been tested. Testing scales > from single thread to multiple threads and multiple processes. Tests > pass on LNL, BMG, PVC 1 tile, and PVC 2 tile. > > 1. Multiple GPU support. > - This is likely to follow or occur in parallel to this work. > 2. Userptr unification with GPU SVM. > - This is essentially designed in my head (likely involving a > few new GPU SVM layer functions) but would require some fairly > invasive changes to Xe KMD to test out. Therefore, I would > like GPU SVM to be reviewed first before proceeding with these > changes. > 3. Madvise and prefetch IOCTLs > - This is likely to follow or occur in parallel to this work. > > Given the size of the series, I have pushed a GitLab branch for > reference [7]. > > Matt > > [1] https://patchwork.freedesktop.org/series/128910/ > [2] https://patchwork.freedesktop.org/series/132229/ > [3] https://patchwork.freedesktop.org/series/137805/ > [4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@xxxxxxxxxx/ > [5] https://patchwork.freedesktop.org/series/133643/ > [6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2 > [7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads Ok rather late, I wanted to type this up 2 weeks ago or so, but alas here it is finally. I think with all the experiments and discussions I have fairly clear understanding of the really tricky parts of svm (thanks a lot to Matt for all the work done). From my side the key points, sorted roughly in how important I think they are: 1. migrate_to_ram path I think this is the absolute center piece of making sure we're aligned with core mm and don't suffer from deadlocks or livelocks fundamental to the gpusvm library design. So this part imo needs to be solid for the first version we merge. But of course any core mm changes Matt prototyped shouldn't gate merging the drm side since they're nicely decoupled, we only need those to validate the design in testing. I think the key points are: - we rely on the migration pte, temporary page references and page lock only, which with the core mm changes Matt worked on gives us guaranteed reliable migration back to system memory. And we need that, or svm essentially becomes unusable as a concept. - we need to support partial migration, including the worst case fallback of only migrating that single page core mm managed to trylock for us while holding the pagetable lock. Since we have guaranteed migration back to system memory we can make the assumption on the gpu fault handling side that we will only ever handle ranges that are entirely in vram (by throwing any partial migrations out). Needs a retry loop for that in the gpu fault side, but I no logner see an issue with that assumption on the gpu fault side otherwise, so not needed for merging or even later until we have a driver that requires partial vram support. - no other driver locks related to that memory range in any way are allowed, and ideally we also test with the forced fallback to mmap_read lock in do_swap_page removed, i.e. calling migrate_to_ram with only holding the read vma lock. Of course driver locks for blitting are allowed, it's only locks related to managing physical memory which are problematic and could result in deadlocks. - the drm side must uphold the guarantee of not having elevated page references whithout holding the page lock. Otherwise there's a race and we cannot guarantee migratione to sram. - also through the try_to_migrate maze we'll hit our own gpu pte invalidate paths, so there's some requirements there too. But I've put the discussion for that at the very bottom, since most of the gpu pte locking questions are imo not that important, and definitely not important for the first version we merge. Everything else below I think we can sort out post merge and just need rough alignment on the design. 2. eviction Requirements much like migrate_to_ram, because otherwise we break the migration gurantee: - Only looking at physical memory datastructures and locks, no looking at mm/vma structs or relying on those being locked. We rely entirely on reverse maps from try_to_migrate to find all the mappings on both cpu and gpu side (cpu only zone device swap or migration pte entries ofc). - Partial migration needs to work to make sure we can get out of any low memory bind. 3. gpu fault side - We can only rely on mmap_read for hmm_range_fault. And ideally should assume that's not held anywhere else since with per-vma locking I'd expect the mm/vma locking will move into hmm_range_fault. This also means no looking at vma beyond just passing it around as currently needed by core mm functions. - Big retry loop to handle all races with the mmu notifier under the gpu pagetable locks/mmu notifier range lock/whatever we end up calling those. Races (especially against concurrent eviction/migrate_to_ram) should _not_ be handled on the fault side by trying to hold locks instead. - Long term I think we need to be able to handle concurrent faults, even on hw where there's only one gpu fault handling queue. For performance we absolutely want to prefault aggressively, and that likely happens through cpu ioctls that are entirely independent from the gpu fault handling. Short term enough (driver-side) locking to make sure this doesn't go boom is enough, I think just some design goal documentation here on how to achieve that is all we need. 4. physical memory to virtual backpointer No. Doesn't work :-) Also it's only used in in the eviction/migrate_to_ram path and I think Matt already fixed this all anyway. 5. gpu pagetable locking Or mmu notifier range locking or whatever you want to call it. Like on the cpu side this should _only_ protect the pagetable entries and additional for us mmu notifier seqno tracking, nothing else. Any races due to concurrent eviction/migrate_to_ram/gpu fault/prefault need to be handled by retrying outside of holding the pagetable locks. If we try to impose additional consistency guarantees we'll fall over and have a livelock/deadlock fight with core mm in migrate_to_ram. This part is required I think for the first version, but we already need that anyway to make migrate_to_ram work properly. For the actual data structure/locking design I think anything on the design line between a single global lock and the radix tree over-the-top scalable per-pagetable (spin)lock design of the core mm is fine. The design here with 3 levels (mmu notifer, range, struct page) wouldn't be my first choice, but clearly fits on that line so imo is fine for initial merging. We might want to make sure that the range locking (I guess mostly relevant for the invalidate side, drivers don't see much else) is somewhat abstracted so we can easily change that post-merge, but not required imo at all. For consensus documentation I'd recommend a todo or design documentation patch, where we put down both the current design and why it's like that, and some of the longer term goals. Then get that acked (imo needs at least one other driver that's seriously interested in this, plus I think an ack from Danilo for gpuvm interactions), then merge that. SVM is tricky enough that I think this would be really useful to make sure we're not unecessarily stuck in limbo.