RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

"Zeng, Oak" <oak.zeng@xxxxxxxxx> · Fri, 3 May 2024 14:43:19 +0000

> -----Original Message-----
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Friday, May 3, 2024 9:38 AM
> To: Zeng, Oak <oak.zeng@xxxxxxxxx>
> Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>; Daniel Vetter
> <daniel@xxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-
> xe@xxxxxxxxxxxxxxxxxxxxx; Brost, Matthew <matthew.brost@xxxxxxxxx>;
> Welty, Brian <brian.welty@xxxxxxxxx>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@xxxxxxxxx>; Bommu, Krishnaiah
> <krishnaiah.bommu@xxxxxxxxx>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@xxxxxxxxx>; Leon Romanovsky
> <leon@xxxxxxxxxx>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Thu, May 02, 2024 at 07:25:50PM +0000, Zeng, Oak wrote:
> > Hi Jason,
> >
> > I tried to understand how you supposed us to use hmm range fault... it
> seems you want us to call hmm range fault two times on each gpu page fault:
> 
> > 1.
> > Call Hmm_range_fault first time, pfn of the fault address is set with
> HMM_PFN_REQ_FAULT
> > Other pfns in the PREFAULT_SIZE range will be set as 0
> > Hmm_range fault returns:
> > 	Pfn with 0 flag or HMM_PFN_VALID flag, means a valid pfn
> > 	Pfn with HMM_PFN_ERROR flag, means invalid pfn
> >
> > 2.
> > Then call hmm_range_fault a second time
> > Setting the hmm_range start/end only to cover valid pfns
> > With all valid pfns, set the REQ_FAULT flag
> 
> Why would you do this? The first already did the faults you needed and
> returned all the easy pfns that don't require faulting.

But we have use case where we want to fault-in pages other than the page which contains the GPU fault address, e.g., user malloc'ed or mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU access it. Let's say GPU access caused a GPU page fault a 2MiB place. The first hmm-range-fault would only fault-in the page at 2MiB place, because in the first call we only set REQ_FAULT to the pfn at 2MiB place. 

In such case, we would go over all the pfns returned from the first hmm-range-fault to learn which pfn is a faultable page but not fault-in yet (pfn flag ==  0), which pfns is not possible to fault-in in the future (pfn flag == HMM_PFN_ERROR), then call hmm-range-fault again by setting REQ_FAULT to all faultable pages.

> 
> > Basically use hmm_range_fault to figure out the valid address range
> > in the first round; then really fault (e.g., trigger cpu fault to
> > allocate system pages) in the second call the hmm range fault.
> 
> You don't fault on prefetch. Prefetch is about mirroring already
> populated pages, it should not be causing new faults.

Maybe there is different wording here. We have this scenario we call it prefetch, or whatever you call it:

GPU page fault at an address A, we want to map an address range (e.g., 2MiB, or whatever size depending on setting) around address A to GPU page table. The range around A could have no backing pages when GPU page fault happens. We want to populate the 2MiB range. We can call it prefetch because most of pages in this range is not accessed by GPU yet, but we expect GPU to access it soon.

You mentioned "already populated pages". Who populated those pages then? Is it a CPU access populated them? If CPU access those pages first, it is true pages can be already populated. But it is also a valid use case where GPU access address before CPU so there is no "already populated pages" on GPU page fault. Please let us know what is the picture in your head. We seem picture it completely differently.

> 
> > Do I understand it correctly?
> 
> No
> 
> > This is strange to me. We should already know the valid address
> > range before we call hmm range fault, because the migration codes
> > need to look up cpu vma anyway. what is the point of the first
> > hmm_range fault?
> 
> I don't really understand why the GPU driver would drive migration off
> of faulting. It doesn't make alot of sense, especially if you are
> prefetching CPU pages into the GPU and thus won't get faults for them.
> 

Migration on GPU fault is definitely what we want to do. Not like RDMA case, GPU has its own device memory. The size of the device memory is comparable to the size of CPU system memory, sometimes bigger. We leverage significantly on device memory for performance purpose. This is why HMM introduce the MEMORY_DEVICE_PRIVATE memory type.

On GPU page fault, driver decides whether we need to migrate pages to device memory depending on a lot of factors, such as user hints, atomic correctness requirement ect. We could migrate, or we could leave pages in CPU system memory, all tuned for performance. 

> If your plan is to leave the GPU page tables unpopulated and then
> migrate on every fault to try to achieve some kind of locality then
> you'd want to drive the hmm prefetch on the migration window (so you
> don't populate unmigrated pages) and hope for the best.

Exactly what did. We decide the migration window by: 

1) look up the CPU VMA which contains the GPU fault address
2) decide a migration window per migration granularity (e.g., 2MiB) settings, inside the CPU VMA. If CPU VMA is smaller than the migration granular, migration window is the whole CPU vma range; otherwise, partially of the VMA range is migrated.

We then prefetch the migration window. If migration happened, it is true all pages are already populated. But there is use cases where migration is skipped and we want fault-in through hmm-range-fault, see explanation above. 

> 
> However, the migration stuff should really not be in the driver
> either. That should be core DRM logic to manage that. It is so
> convoluted and full of policy that all the drivers should be working
> in the same way.

Completely agreed. Moving migration infrastructures to DRM is part of our plan. We want to first prove of concept with xekmd driver, then move helpers, infrastructures to DRM. Driver should be as easy as implementation a few callback functions for device specific page table programming and device migration, and calling some DRM common functions during gpu page fault.

> 
> The GPU fault handler should indicate to some core DRM function that a
> GPU memory access occured and get back a prefetch window to pass into
> hmm_range_fault. The driver will mirror what the core code tells it.

No objections to this approach.

Oak

> 
> Jason