Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Fri, 26 Apr 2024 16:49:26 +0200

On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote:
> On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> > First, the gpu_vma structure is something that partitions the
> > gpu_vm
> > that holds gpu-related range metadata, like what to mirror, desired
> > gpu
> > caching policies etc. These are managed (created, removed and
> > split)
> > mainly from user-space. These are stored and looked up from an rb-
> > tree.
> 
> Except we are talking about SVA here, so all of this should not be
> exposed to userspace.

I think you are misreading. this is on the level "Mirror this region of
the cpu_vm", "prefer this region placed in VRAM", "GPU will do atomic
accesses on this region", very similar to cpu mmap / munmap and
madvise. What I'm trying to say here is that this does not directly
affect the SVA except whether to do SVA or not, and in that case what
region of the CPU mm will be mirrored, and in addition, any gpu
attributes for the mirrored region.

> 
> > Now, when we hit a fault, we want to use hmm_range_fault() to re-
> > populate the faulting PTE, but also to pre-fault a range. Using a
> > range
> > here (let's call this a prefault range for clarity) rather than to
> > insert a single PTE is for multiple reasons:
> 
> I've never said to do a single range, everyone using
> hmm_range_fault()
> has some kind of prefetch populate around algorithm.
> 
> > This is why we've been using dma_map_sg() for these ranges, since
> > it is
> > assumed the benefits gained from 
> 
> This doesn't logically follow. You need to use dma_map_page page by
> page and batch that into your update mechanism.
> 
> If you use dma_map_sg you get into the world of wrongness where you
> have to track ranges and invalidation has to wipe an entire range -
> because you cannot do a dma unmap of a single page from a dma_map_sg
> mapping. This is all the wrong way to use hmm_range_fault.
> 
> hmm_range_fault() is page table mirroring, it fundamentally must be
> page-by-page. The target page table structure must have similar
> properties to the MM page table - especially page by page
> validate/invalidate. Meaning you cannot use dma_map_sg().

To me this is purely an optimization to make the driver page-table and
hence the GPU TLB benefit from iommu coalescing / large pages and large
driver PTEs. It is true that invalidation will sometimes shoot down
large gpu ptes unnecessarily but it will not put any additional burden
on the core AFAICT. For the dma mappings themselves they aren't touched
on invalidation since zapping the gpu PTEs effectively stops any dma
accesses. The dma mappings are rebuilt on the next gpu pagefault,
which, as you mention, are considered slow anyway, but will probably
still reuse the same prefault region, hence needing to rebuild the dma
mappings anyway.

So as long as we are correct and do not adversely affect core mm, If
the gpu performance (for whatever reason) is severely hampered if
large gpu page-table-entries are not used, couldn't this be considered
left to the driver?

And a related question. What about THP pages? OK to set up a single
dma-mapping to those?

> 
> > Second, when pre-faulting a range like this, the mmu interval
> > notifier
> > seqno comes into play, until the gpu ptes for the prefault range
> > are
> > safely in place. Now if an invalidation happens in a completely
> > separate part of the mirror range, it will bump the seqno and force
> > us
> > to rerun the fault processing unnecessarily. 
> 
> This is how hmm_range_fault() works. Drivers should not do hacky
> things to try to "improve" this. SVA granuals should be large, maybe
> not the entire MM, but still quite large. 2M is far to small.
> 
> There is a tradeoff here of slowing down the entire MM vs risking an
> iteration during fault processing. We want to err toward making fault
> processing slowing because fault procesing is already really slow.
> 
> > Hence, for this purpose we
> > ideally just want to get a seqno bump covering the prefault range.
> 
> Ideally, but this is not something we can get for free.
> 
> > That's why finer-granularity mmu_interval notifiers might be
> > beneficial
> > (and then cached for future re-use of the same prefault range).
> > This
> > leads me to the next question:
> 
> It is not the design, please don't invent crazy special Intel things
> on top of hmm_range_fault.

For the record, this is not a "crazy special Intel" invention. It's the
way all GPU implementations do this so far. We're currently catching
up. If we're going to do this in another way, we fully need to
understand why it's a bad thing to do. That's why these questions are
asked.

> 
> > You mention that mmu_notifiers are expensive to register. From
> > looking
> > at the code it seems *mmu_interval* notifiers are cheap unless
> > there
> > are ongoing invalidations in which case using a gpu_vma-wide
> > notifier
> > would block anyway? Could you clarify a bit more the cost involved
> > here?
> 
> The rb tree insertions become expensive the larger the tree is. If
> you
> have only a couple of notifiers it is reasonable.
> 
> > If we don't register these smaller-range interval notifiers, do
> > you think the seqno bumps from unrelated subranges would be a real
> > problem?
> 
> I don't think it is, you'd need to have a workload which was
> agressively manipulating the CPU mm (which is also pretty slow). If
> the workload is doing that then it also really won't like being
> slowed
> down by the giant rb tree.

OK, this makes sense, and will also simplify implementation.

> 
> You can't win with an argument that collisions are likely due to an
> app pattern that makes alot of stress on the MM so the right response
> is to make the MM slower.
> 
> > Finally the size of the pre-faulting range is something we need to
> > tune. 
> 
> Correct.
> 
> > Currently it is cpu vma - wide. I understand you strongly suggest
> > this should be avoided. Could you elaborate a bit on why this is
> > such a
> > bad choice?
> 
> Why would a prefetch have anything to do with a VMA? Ie your app
> calls
> malloc() and gets a little allocation out of a giant mmap() arena -
> you want to prefault the entire arena? Does that really make any
> sense?

Personally, no it doesn't. I'd rather use some sort of fixed-size
chunk. But to rephrase, the question was more into the strong "drivers
should not be aware of the cpu mm vma structures" comment. While I
fully agree they are probably not very useful for determining the size
of gpu prefault regions, is there anything else we should be aware of
here.

Thanks,
Thomas

> 
> Mirroring is a huge PITA, IMHO it should be discouraged in favour of
> SVA. Sadly too many CPUs still canot implement SVA.
> 
> With mirroring there is no good way for the system to predict what
> the
> access pattern is. The only way to make this actually performant is
> for userspace to explicitly manage the mirroring with some kind of
> prefetching scheme to avoid faulting its accesses except in
> extrodinary cases.
> 
> VMA is emphatically not a hint about what to prefetch. You should
> balance your prefetching based on HW performance and related. If it
> is
> tidy for HW to fault around a 2M granual then just do that.
> 
> Jason