Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 24 Apr 2024 10:48:40 -0300

On Tue, Apr 23, 2024 at 09:17:03PM +0000, Zeng, Oak wrote:
> > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> > 
> > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > a sgl mapping if an invalidation comes in.
> > >
> > > You are right, if we register a huge mmu interval notifier to cover
> > > the whole address space, then we should use dma map/unmap pages to
> > > map bits of the address space. We will explore this approach.
> > >
> > > Right now, in xe driver, mmu interval notifier is dynamically
> > > registered with small address range. We map/unmap the whole small
> > > address ranges each time. So functionally it still works. But it
> > > might not be as performant as the method you said.
> > 
> > Please don't do this, it is not how hmm_range_fault() should be
> > used.
> > 
> > It is intended to be page by page and there is no API surface
> > available to safely try to construct covering ranges. Drivers
> > definately should not try to invent such a thing.
> 
> I need your help to understand this comment. Our gpu mirrors the
> whole CPU virtual address space. It is the first design pattern in
> your previous reply (entire exclusive owner of a single device
> private page table and fully mirrors the mm page table into the
> device table.)
> 
> What do you mean by "page by page"/" no API surface available to
> safely try to construct covering ranges"? As I understand it,
> hmm_range_fault take a virtual address range (defined in hmm_range
> struct), and walk cpu page table in this range. It is a range based
> API.

Yes, but invalidation is not linked to range_fault so you can get
invalidations of single pages. You are binding range_fault to
dma_map_sg but then you can't handle invalidation at all sanely.

> From your previous reply ("So I find it a quite strange that this
> RFC is creating VMA's, notifiers and ranges on the fly "), it seems
> you are concerning why we are creating vma and register mmu interval
> notifier on the fly. Let me try to explain it. Xe_vma is a very
> fundamental concept in xe driver. 

I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
and you do need to ensure the page table manipulation has an API that
is usable. "populate an entire VMA" / "invalidate an entire VMA" is
not a sufficient primitive.

> The gpu page table update, invalidation are all vma-based. This
> concept exists before this svm work. For svm, we create a 2M (the
> size is user configurable) vma during gpu page fault handler and
> register this 2M range to mmu interval notifier.

You can create VMAs, but they can't be fully populated and they should
be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
SVA, the "vma" granual is 512M.

The key thing is the VMA is just a container for the notifier and
other data structures, it doesn't insist the range be fully populated
and it must support page-by-page unmap/remap/invalidateion.

> Now I try to figure out if we don't create vma, what can we do? We
> can map one page (which contains the gpu fault address) to gpu page
> table. But that doesn't work for us because the GPU cache and TLB
> would not be performant for 4K page each time. One way to think of
> the vma is a chunk size which is good for GPU HW performance.

>From this perspective invalidation is driven by the range the
invalidation callback gets, it could be a single page, but probably
bigger.

mapping is driven by the range passed to hmm_range_fault() during
fault handling, which is entirely based on your drivers prefetch
logic.

GPU TLB invalidation sequences should happen once, at the end, for any
invalidation or range_fault sequence regardless. Nothing about "gpu
vmas" should have anything to do with this.

> And the mmu notifier... if we don't register the mmu notifier on the
> fly, do we register one mmu notifier to cover the whole CPU virtual
> address space (which would be huge, e.g., 0~2^56 on a 57 bit
> machine, if we have half half user space kernel space split)? That
> won't be performant as well because for any address range that is
> unmapped from cpu program, but even if they are never touched by
> GPU, gpu driver still got a invalidation callback. In our approach,
> we only register a mmu notifier for address range that we know gpu
> would touch it.

When SVA is used something, somewhere, has to decide if a CPU VA
intersects with a HW VA.

The mmu notifiers are orginized in an interval (red/black) tree, so if
you have a huge number of them the RB search becomes very expensive.

Conversly your GPU page table is organized in a radix tree, so
detecting no-presence during invalidation is a simple radix
walk. Indeed for the obviously unused ranges it is probably a pointer
load and single de-ref to check it.

I fully expect the radix walk is much, much faster than a huge number
of 2M notifiers in the red/black tree.

Notifiers for SVA cases should be giant. If not the entire memory
space, then at least something like 512M/1G kind of size, neatly
aligned with something in your page table structure so the radix walk
can start lower in the tree automatically.

> > > For example, when gpu page fault happens, you look
> > > up the cpu vm_area_struct for the fault address and create a
> > > corresponding state/struct. And people can have as many cpu
> > > vm_area_struct as they want.
> > 
> > No you don't.
> 
> See explains above. I need your help to understand how we can do it
> without a vma (or chunk). One thing GPU driver is different from
> RDMA driver is, RDMA doesn't have device private memory so no
> migration. 

I want to be very clear, there should be no interaction of your
hmm_range_fault based code with MM centric VMAs. You MUST NOT look up
the CPU vma_area_struct in your driver.

> It only need to dma-map the system memory pages and use
> them to fill RDMA page table. so RDMA don't need another memory
> manager such as our buddy. RDMA only deal with system memory which
> is completely struct page based management. Page by page make 100 %
> sense for RDMA.

I don't think this is the issue at hand, you just have some historical
poorly designed page table manipulation code from what I can
understand..

Jason