Hi, Sima, On Mon, 2024-09-02 at 14:33 +0200, Daniel Vetter wrote: > Jumping in here in the middle, since I think it's a solid place to > drop my > idea of "align with core mm" gpusvm locking ... > > On Thu, Aug 29, 2024 at 08:56:23PM +0000, Matthew Brost wrote: > > On Thu, Aug 29, 2024 at 09:18:29PM +0200, Thomas Hellström wrote: > > Issues with removing a SVM range: > > > > - Xe bind code stores invalidation / present state in VMA, this > > would > > need to be moved to the radix tree. I have Jira open for that > > work > > which I believe other developers are going to own. > > - Where would the dma mapping / device pages be stored? > > - In the radix tree? What if ATS is enabled? We don't have > > a > > driver owned radix tree. How do we reasonably connect a > > driver > > owned radix to a common GPUSVM layer? > > Yeah this one is really annoying, because the core mm gets away with > nothing because it can just store the pfn in the pte. And it doesn't > need > anything else. So we probably still need something unfortuantely ... > > > - In the notifier? What is the notifier is sparsely > > populated? > > We would be wasting huge amounts of memory. What is the > > notifier is configured to span the entire virtual > > address > > space? > > So if we go with the radix idea, we could model the radix to exactly > match > the gpu pagetables. That's essentially what the core mm does. Then > each > pagetable at each level has a spinlock for essentially a range lock. > notifier seqno would be stored into each pagetable (not the > endividual > entries, that's probably too much), which should allow us to very > effeciently check whether an entire arbitrary va range is still valid > on > the fault side. I still wonder wether this should be owned by the driver, though. And if we were optimizing for multiple simultaneous fault processing with a small granularity, I would agree, but given that gpu pagefaults are considered so slow they should be avoided, I wonder whether xe's current approach of a single page-table lock wouldn't suffice, in addition to a semi-global seqno? For invalidations, I think we actually currently allow simultaneous overlapping invalidations that are only protected by the write-side of the notifier seqno. > > On the notifier side we can also very efficiently walk arbitrary > ranges, > because the locking is really fine-grained and in an adaptive way. > > > - How does the garbage collector work? We can't allocate memory in > > the > > notifier so we don't anything to add to the garbage collector. We > > can't directly modify page tables given you need lock in the path > > of > > reclaim. > > Probably no more garbage collector, you deal with pages/folios like > the > core mm expects. Yeah, if the page-table locks are reclaim-safe no more garbage collector, but OTOH, IIRC even in core-mm, the invalidation counterpart, unmap_mapping_range() can't and doesn't remove page-table subtrees when called from the address-space side, whereas zapping when called from the mm side, like madvise(WONTNEED), can. /Thomas > > > - How do we deal with fault storms (e.g. tons of faults hitting the > > same > > SVM range in a row)? Without a SVM range no every to know if > > mapping > > is valid and GPU page handler can be short circuited. > > So the core mm sorts this out by allowing faults to be handled in > parallel, without any lock. Essentially: > - you get a fault (or prefault) > - you hold just enough read locks to make sure stuff doesn't > disappear. > Currently that's mmap_read_lock, but strictly speaking we only need > the > new-ish per-vma lock. > - you allocate memory, dma_map, everything else you need > - you grab that very fine-grained radix tree lock (pagetable locks on > the > cpu side) and recheck whether you've raced: mmu notifier seqno and > the > pte must still be non-present. If that check fails, you bail out > and > release all the vram/dma_maps you've created. > > > - Do we have notifier seqno for every PTE? > > I think per-pagetable, so every node in the radix tree, would make > sense. > If we go with also one lock per pagetable like the cpu mm then > tracking > notifier seqno to match makes the most sense imo. > > Again, this is entirely aside from the discussion in this subthread > about > understanding the current design and tradeoffs/reasons. Just figured > this > is a good spot to drop this. > -Sima