RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

"Zeng, Oak" <oak.zeng@xxxxxxxxx> · Wed, 24 Apr 2024 23:59:18 +0000

Hi Jason,

I went through the conversation b/t you and Matt. I think we are pretty much aligned. Here is what I get from this threads:

1) hmm range fault size, gpu page table map size : you prefer bigger gpu vma size and vma can be sparsely mapped to gpu. Our vma size is configurable through a user madvise api. We do plan to try 1 gigantic vma and sparse mapping. That requires us to reconstruct driver for a 1vma: N page table state mapping. This will be stage 2 work

2) invalidation: you prefer giant notifier. We can consider this if it turns out our implementation is not performant. Currently we don't know.

3) whether driver can look up cpu vma. I think we need this for data migration purpose.

See also comments inline.

> -----Original Message-----
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Wednesday, April 24, 2024 9:49 AM
> To: Zeng, Oak <oak.zeng@xxxxxxxxx>
> Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx; Brost,
> Matthew <matthew.brost@xxxxxxxxx>; Thomas.Hellstrom@xxxxxxxxxxxxxxx;
> Welty, Brian <brian.welty@xxxxxxxxx>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@xxxxxxxxx>; Bommu, Krishnaiah
> <krishnaiah.bommu@xxxxxxxxx>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@xxxxxxxxx>; Leon Romanovsky
> <leon@xxxxxxxxxx>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Tue, Apr 23, 2024 at 09:17:03PM +0000, Zeng, Oak wrote:
> > > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> > >
> > > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > > a sgl mapping if an invalidation comes in.
> > > >
> > > > You are right, if we register a huge mmu interval notifier to cover
> > > > the whole address space, then we should use dma map/unmap pages
> to
> > > > map bits of the address space. We will explore this approach.
> > > >
> > > > Right now, in xe driver, mmu interval notifier is dynamically
> > > > registered with small address range. We map/unmap the whole small
> > > > address ranges each time. So functionally it still works. But it
> > > > might not be as performant as the method you said.
> > >
> > > Please don't do this, it is not how hmm_range_fault() should be
> > > used.
> > >
> > > It is intended to be page by page and there is no API surface
> > > available to safely try to construct covering ranges. Drivers
> > > definately should not try to invent such a thing.
> >
> > I need your help to understand this comment. Our gpu mirrors the
> > whole CPU virtual address space. It is the first design pattern in
> > your previous reply (entire exclusive owner of a single device
> > private page table and fully mirrors the mm page table into the
> > device table.)
> >
> > What do you mean by "page by page"/" no API surface available to
> > safely try to construct covering ranges"? As I understand it,
> > hmm_range_fault take a virtual address range (defined in hmm_range
> > struct), and walk cpu page table in this range. It is a range based
> > API.
> 
> Yes, but invalidation is not linked to range_fault so you can get
> invalidations of single pages. You are binding range_fault to
> dma_map_sg but then you can't handle invalidation at all sanely.

Ok, I understand your point now.

Yes strictly speaking we can get invalidation of a single page. This can be triggered by core mm numa balance or ksm (kernel samepage merging). At present, my understanding is, single page (or a few pages) invalidation is not a very common case. The more common cases are invalidation triggered by user munmap, or invalidation triggered by hmm migration itself (triggered in migrate_vma_setup). I will experiment this.

User munmap obviously triggers range based invalidation.

The invalidation triggered by hmm vma migration is also range based as we select to migration at vma granularity due to performance considerations as explained.

I agree in case of single page invalidation, the current codes is not performant because we invalidate the whole vma. What I can do is, look at the mmu_notifier_range parameter of the invalidation callback, and only invalidate the range. The requires our driver to split the vma state and page table state. It is a big change. We plan to do it in stage 2

> 
> > From your previous reply ("So I find it a quite strange that this
> > RFC is creating VMA's, notifiers and ranges on the fly "), it seems
> > you are concerning why we are creating vma and register mmu interval
> > notifier on the fly. Let me try to explain it. Xe_vma is a very
> > fundamental concept in xe driver.
> 
> I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
> and you do need to ensure the page table manipulation has an API that
> is usable. "populate an entire VMA" / "invalidate an entire VMA" is
> not a sufficient primitive.

I understand invalidate entire VMA might be not performant. I will improve as explained above.

I think whether we want to populate entire VMA or only one page is still driver's selection. For us, populate a whole VMA (at 2M bytes by default but overwritable by user) is still a performant option. If we populate one page per time, we would run into continuous gpu page fault when gpu access the following pages. In most of our compute workload, gpu need to process big chunk of data, e.g., many MiB or even GiB. And page fault overhead is huge per our measurement.

Do you suggest per page based population? Or do you suggest to populate the entire address space or the entire memory region? I did look at RDMA odp codes. In function ib_umem_odp_map_dma_and_lock, it is also a range based population. It seems it populate the whole memory region each time, not very sure. 

> 
> > The gpu page table update, invalidation are all vma-based. This
> > concept exists before this svm work. For svm, we create a 2M (the
> > size is user configurable) vma during gpu page fault handler and
> > register this 2M range to mmu interval notifier.
> 
> You can create VMAs, but they can't be fully populated and they should
> be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
> SVA, the "vma" granual is 512M.

Oh, I see. So you are suggesting a much bigger granularity. That make sense to me. Our design actually support a much bigger granularity. The migration/population granularity is configurable in our design. It is a memory advise API and one of the advise is called "MIGRATION_GRANULARITY". This part of the codes is not in my series yet as it is being work by Himal who is also on this email list. We will publish that work soon for review. 

> 
> The key thing is the VMA is just a container for the notifier and
> other data structures, it doesn't insist the range be fully populated
> and it must support page-by-page unmap/remap/invalidateion.

Agree and I don't see a hard conflict of our implementation to this concept. So the linux core mm vma (struct vm_area_struct) represent an address range in the process's address space. Xe driver would create some xe_vma to cover a sub-region of a core mm vma. For example, if the core mm vma is 1GiB, we might create xe-vma of 512MiB or 2MiB, depending on our MIGRATION_GRANULARITY setting. 

As explained, we can support page-by page map/unmap. Our design makes sure we can map/unmap at any page boundary if we want. The granularity setting is all depends on performance data.

> 
> > Now I try to figure out if we don't create vma, what can we do? We
> > can map one page (which contains the gpu fault address) to gpu page
> > table. But that doesn't work for us because the GPU cache and TLB
> > would not be performant for 4K page each time. One way to think of
> > the vma is a chunk size which is good for GPU HW performance.
> 
> From this perspective invalidation is driven by the range the
> invalidation callback gets, it could be a single page, but probably
> bigger.

Agree

> 
> mapping is driven by the range passed to hmm_range_fault() during
> fault handling, which is entirely based on your drivers prefetch
> logic.

In our driver, mapping can be triggered by either prefetch or fault. 

Prefetch is a user API so user can decide the range.

The range used in fault is decided by MIGRATION_GRANULARITY user setting. The default value is 2MiB as said. 

> 
> GPU TLB invalidation sequences should happen once, at the end, for any
> invalidation or range_fault sequence regardless. Nothing about "gpu
> vmas" should have anything to do with this.
> 
> > And the mmu notifier... if we don't register the mmu notifier on the
> > fly, do we register one mmu notifier to cover the whole CPU virtual
> > address space (which would be huge, e.g., 0~2^56 on a 57 bit
> > machine, if we have half half user space kernel space split)? That
> > won't be performant as well because for any address range that is
> > unmapped from cpu program, but even if they are never touched by
> > GPU, gpu driver still got a invalidation callback. In our approach,
> > we only register a mmu notifier for address range that we know gpu
> > would touch it.
> 
> When SVA is used something, somewhere, has to decide if a CPU VA
> intersects with a HW VA.
> 
> The mmu notifiers are orginized in an interval (red/black) tree, so if
> you have a huge number of them the RB search becomes very expensive.
> 
> Conversly your GPU page table is organized in a radix tree, so
> detecting no-presence during invalidation is a simple radix
> walk. Indeed for the obviously unused ranges it is probably a pointer
> load and single de-ref to check it.
> 
> I fully expect the radix walk is much, much faster than a huge number
> of 2M notifiers in the red/black tree.
> 
> Notifiers for SVA cases should be giant. If not the entire memory
> space, then at least something like 512M/1G kind of size, neatly
> aligned with something in your page table structure so the radix walk
> can start lower in the tree automatically.

In our implementation, our page table is not organized as a radix tree. Maybe this an area we can improve. For validation, we don't need to walk page table to figure out which range is present in page table. Since we only register a mmu notifier when a range is actually mapped to gpu page table. So all the notifier callback is with a valid range in gpu page table.

I agree many 2M small notifiers can slow down the red/black tree walk from core mm side. But with giant notifier, core mm callback driver much more times than small notifier - driver would be called back even if a range is not mapped to gpu page table.

So I am not sure which approach is faster. But I can experiment this.

> 
> > > > For example, when gpu page fault happens, you look
> > > > up the cpu vm_area_struct for the fault address and create a
> > > > corresponding state/struct. And people can have as many cpu
> > > > vm_area_struct as they want.
> > >
> > > No you don't.
> >
> > See explains above. I need your help to understand how we can do it
> > without a vma (or chunk). One thing GPU driver is different from
> > RDMA driver is, RDMA doesn't have device private memory so no
> > migration.
> 
> I want to be very clear, there should be no interaction of your
> hmm_range_fault based code with MM centric VMAs. You MUST NOT look
> up
> the CPU vma_area_struct in your driver.

Without look up cpu vma, we even can't decide whether our gpu accessed an valid address or not.

When GPU accesses an address, valid or not (e.g., out of bound access), hardware generates a page fault. If we don't look up cpu vma, how do we determine whether gpu has a out of bound access?

Further more, we call hmm helpers migrate_vma_setup for migration which take a struct migrate_vma parameter. Migrate_vma has a vma field. If we don't look up cpu vma, how do we get this field?

Oak

> 
> > It only need to dma-map the system memory pages and use
> > them to fill RDMA page table. so RDMA don't need another memory
> > manager such as our buddy. RDMA only deal with system memory which
> > is completely struct page based management. Page by page make 100 %
> > sense for RDMA.
> 
> I don't think this is the issue at hand, you just have some historical
> poorly designed page table manipulation code from what I can
> understand..
> 
> Jason