Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Mon, 02 Sep 2024 11:45:40 +0200

Hi, Matt.

On Fri, 2024-08-30 at 13:47 +0000, Matthew Brost wrote:
> On Fri, Aug 30, 2024 at 11:57:33AM +0200, Thomas Hellström wrote:
> > Hi, Matthew,
> > 
> > Agreed the below might not be important just now, but some ideas:
> > 
> > On Thu, 2024-08-29 at 20:56 +0000, Matthew Brost wrote:
> > > Issues with removing a SVM range:
> > > 
> > > - Xe bind code stores invalidation / present state in VMA, this
> > > would
> > >   need to be moved to the radix tree. I have Jira open for that
> > > work
> > >   which I believe other developers are going to own.
> > 
> > Yeah, although we shouldn't *design* around xe bind-code and page-
> > table
> > code shortcomings.
> > 
> 
> I'm thinking this one certainly should be fixed sooner rather than
> later which would be helpful.
> 
> But let's also consider the case where we get a bunch of individual
> page
> invalidates serially for an entire range (I can't remember when this
> happens but I have seen it in my testing, will look into this more to
> figure exactly when). If we invalidate 1 page at a time in radix
> tree,
> each invalidation could potentially results in TLB invalidation
> interaction with the hardware in cases where a larger GPU pages are
> not
> being used. The TLB invalidation is going to vastly slower than any
> CPU
> operation (e.g. RB search, radix tree walk). If we key on a range
> invalidate the entire once on the first invalidation this may end up
> being significantly faster.
> 
> Above is pure speculation though, a lot of what both of us is saying
> is... So another reason I'd like to get apps running to do profiling.
> It
> would be nice to make design decisions based on data not speculation.

Well nothing would stop you from adding a configurable invalidation
granularity, even with a radix-tree based approach. You'd just pad the
invalidation range to match the granularity.

> 
> > 
> > > - Where would the dma mapping / device pages be stored?
> > > 	- In the radix tree? What if ATS is enabled? We don't
> > > have a
> > > 	  driver owned radix tree. How do we reasonably connect
> > > a
> > > driver
> > > 	  owned radix to a common GPUSVM layer?
> > 
> > With ATS you mean IOMMU SVA, right? I think we could assume that
> > any
> > user of this code also has a gpu page-table since otherwise they
> > couldn't be using VRAM and a simpler solution would be in place. 
> > 
> 
> Fair point.
> 
> > But to that specific question, drm_gpusvm state would live in a
> > drm_gpusvm radix tree and driver-specific stuff in the driver tree.
> > A
> > helper based approach would then call drm_gpusvm_unmap_dma(range),
> > whereas a middle layer would just traverse the tree and unmap.
> > 
> 
> Let me consider this. Open to all options.
> 
> > > 	- In the notifier? What is the notifier is sparsely
> > > populated?
> > > 	  We would be wasting huge amounts of memory. What is
> > > the
> > > 	  notifier is configured to span the entire virtual
> > > address
> > > 	  space?
> > 
> > Let's assume you use a fake page-table like in xe_pt_walk.c as your
> > "radix tree", adapted to relevant page-sizes, sparsity is not a
> > problem.
> > 
> 
> Ok, makes sense I think.
> 
> > > - How does the garbage collector work? We can't allocate memory
> > > in
> > > the
> > >   notifier so we don't anything to add to the garbage collector.
> > > We
> > >   can't directly modify page tables given you need lock in the
> > > path
> > > of
> > >   reclaim.
> > 
> > The garbage collector would operate on the whole invalidated range.
> > In
> > the case of xe, upon zapping under reclaim you mark individual
> > page-
> > table bos that are to be removed as "invalid", the garbage
> > collector
> > walks the range removing the "invalid" entries. Subsequent (re-
> > binding)
> > avoids the "invalid" entries, (perhaps even helps removing them)
> > and
> > can thus race with the garbage collector. Hence, any ranges implied
> > by
> > the page-table code are elimitated.
> > 
> 
> This is pretty much with what I came up with too if we didn't have a
> SVM
> range.
> 
> > > - How do we deal with fault storms (e.g. tons of faults hitting
> > > the
> > > same
> > >   SVM range in a row)? Without a SVM range no every to know if
> > > mapping
> > >   is valid and GPU page handler can be short circuited.
> > 
> > Perhaps look at page-table tree and check whether the gpu_pte
> > causing
> > the fault is valid.
> > 
> 
> Came up with the same thing.
> 
> > > - Do we have notifier seqno for every PTE?
> > 
> > I'd say no. With this approach it makes sense to have a wide
> > notifier.
> > The seqno now only affects binding of new gpu_ptes, so the problem
> > with
> > a wide notifier becomes that if invalidation occurs to *any* part
> > of
> > the notifier while we're in the read section during binding, we
> > need to
> 
> I have avoided this by the drm_gpusvm_range_pages_valid. This isn't
> just
> an optimization is actually required for the 2 tile case to be able
> to
> safely know when dma pages can be unmapped (i.e. you can't dma unmap
> pages if either tile has a valid mapping).

OK, I still need to read up on that..

Thanks,
Thomas

> 
> Matt
> 
> > rerun the binding. Adding more notifiers to mitigate that would be
> > to
> > optimize faulting performance over core invalidation performance
> > which
> > Jason asked us to avoid.
> > 
> > /Thomas
> > 
> > 
> >