RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Friday, May 3, 2024 12:28 PM
> To: Zeng, Oak <oak.zeng@xxxxxxxxx>
> Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>; Daniel Vetter
> <daniel@xxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-
> xe@xxxxxxxxxxxxxxxxxxxxx; Brost, Matthew <matthew.brost@xxxxxxxxx>;
> Welty, Brian <brian.welty@xxxxxxxxx>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@xxxxxxxxx>; Bommu, Krishnaiah
> <krishnaiah.bommu@xxxxxxxxx>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@xxxxxxxxx>; Leon Romanovsky
> <leon@xxxxxxxxxx>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Fri, May 03, 2024 at 02:43:19PM +0000, Zeng, Oak wrote:
> > > > 2.
> > > > Then call hmm_range_fault a second time
> > > > Setting the hmm_range start/end only to cover valid pfns
> > > > With all valid pfns, set the REQ_FAULT flag
> > >
> > > Why would you do this? The first already did the faults you needed and
> > > returned all the easy pfns that don't require faulting.
> >
> > But we have use case where we want to fault-in pages other than the
> > page which contains the GPU fault address, e.g., user malloc'ed or
> > mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU
> > access it. Let's say GPU access caused a GPU page fault a 2MiB
> > place. The first hmm-range-fault would only fault-in the page at
> > 2MiB place, because in the first call we only set REQ_FAULT to the
> > pfn at 2MiB place.
> 
> Honestly, that doesn't make alot of sense to me, but if you really
> want that you should add some new flag and have hmm_range_fault do
> this kind of speculative faulting. I think you will end up
> significantly over faulting.

Above 2 steps hmm-range-fault approach is just my guess of what you were suggesting. Since you don't like the CPU vma look up, so we come out this 2 steps hmm-range-fault thing. The first step has the same functionality of a CPU vma lookup.

I also think this approach doesn't make sense.

In our original approach, we lookup cpu vma before migration. Calling hmm-range-fault in a non-speculative way, and there is no overfaulting, because we only call hmm-range-fault within a valid range that we get from CPU vma.

> 
> It also doesn't make sense to do faulting in hmm prefetch if you are
> going to do migration to force the fault anyhow.

What do you mean by hmm prefetch?

As explained, we call hmm-range-fault in two scenarios:

1) call hmm-range-fault to get the current status of cpu page table without causing CPU fault. When address range is already accessed by CPU before GPU, or when we migrate for such range, we run into this scenario

2) when CPU never access range and driver determined there is no need to migrate, we call hmm-range-fault to trigger cpu fault and allocate system pages for this range.

> 
> 
> > > > Basically use hmm_range_fault to figure out the valid address range
> > > > in the first round; then really fault (e.g., trigger cpu fault to
> > > > allocate system pages) in the second call the hmm range fault.
> > >
> > > You don't fault on prefetch. Prefetch is about mirroring already
> > > populated pages, it should not be causing new faults.
> >
> > Maybe there is different wording here. We have this scenario we call
> > it prefetch, or whatever you call it:
> >
> > GPU page fault at an address A, we want to map an address range
> > (e.g., 2MiB, or whatever size depending on setting) around address A
> > to GPU page table. The range around A could have no backing pages
> > when GPU page fault happens. We want to populate the 2MiB range. We
> > can call it prefetch because most of pages in this range is not
> > accessed by GPU yet, but we expect GPU to access it soon.
> 
> This isn't prefetch, that is prefaulting.

Sure, prefaulting is a better name. 

We do have another prefetch API which can be called from user space to prefetch before GPU job submission.


> 
> > You mentioned "already populated pages". Who populated those pages
> > then? Is it a CPU access populated them? If CPU access those pages
> > first, it is true pages can be already populated.
> 
> Yes, I would think that is a pretty common case
> 
> > But it is also a valid use case where GPU access address before CPU
> > so there is no "already populated pages" on GPU page fault. Please
> > let us know what is the picture in your head. We seem picture it
> > completely differently.
> 
> And sure, this could happen too, but I feel like it is an application
> issue to be not prefaulting the buffers it knows the GPU is going to
> touch.
> 
> Again, our experiments have shown that taking the fault path is so
> slow that sane applications must explicitly prefault and prefetch as
> much as possible to avoid the faults in the first place.

I agree fault path has a huge overhead. We all agree.


> 
> I'm not sure I full agree there is a real need to agressively optimize
> the faulting path like you are describing when it shouldn't really be
> used in a performant application :\

As a driver, we need to support all possible scenarios. Our way of using hmm-range-fault is just generalized enough to deal with both situation: when application is smart enough to prefetch/prefault, sure hmm-range-fault just get back the existing pfns; otherwise it falls back to the slow faulting path.

It is not an aggressive optimization. The codes is written for fast path but it also works for slow path.


> 
> > 2) decide a migration window per migration granularity (e.g., 2MiB)
> > settings, inside the CPU VMA. If CPU VMA is smaller than the
> > migration granular, migration window is the whole CPU vma range;
> > otherwise, partially of the VMA range is migrated.
> 
> Seems rather arbitary to me. You are quite likely to capture some
> memory that is CPU memory and cause thrashing. As I said before in
> common cases the heap will be large single VMAs, so this kind of
> scheme is just going to fault a whole bunch of unrelated malloc
> objects over to the GPU.

I want to listen more here.

Here is my understanding. Malloc of small size such as less than one page, or a few pages, memory is allocated from heap.

When malloc is much more than one pages, the GlibC's behavior is mmap it directly from OS, vs from heap

In glibC the threshold is defined by MMAP_THRESHOLD. The default value is 128K: https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html

So on the heap, it is some small VMAs each contains a few pages, normally one page per VMA. In the worst case, VMA on pages shouldn't be bigger than MMAP_THRESHOLD.

In a reasonable GPU application, people use GPU for compute which usually involves large amount of data which can be many MiB, sometimes it can even be many GiB of data

Now going back our scheme. I picture in most application, the CPU vma search end up big vma, MiB, GiB etc

If we end up with a vma that is only a few pages, we fault in the whole vma. It is true that for this case we fault in unrelated malloc objects. Maybe we can fine tune here to only fault in one page (which is minimum fault size) for such case. Admittedly one page can also have bunch of unrelated objects. But overall we think this should not be  common case.

Let me know if this understanding is correct.

Or what would you like to do in such situation?

> 
> Not sure how it is really a good idea.
> 
> Adaptive locality of memory is still an unsolved problem in Linux,
> sadly.
> 
> > > However, the migration stuff should really not be in the driver
> > > either. That should be core DRM logic to manage that. It is so
> > > convoluted and full of policy that all the drivers should be working
> > > in the same way.
> >
> > Completely agreed. Moving migration infrastructures to DRM is part
> > of our plan. We want to first prove of concept with xekmd driver,
> > then move helpers, infrastructures to DRM. Driver should be as easy
> > as implementation a few callback functions for device specific page
> > table programming and device migration, and calling some DRM common
> > functions during gpu page fault.
> 
> You'd be better to start out this way so people can look at and
> understand the core code on its own merits.

The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/, bullet 4)


Oak

> 
> Jason




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux