On Fri, May 03, 2024 at 08:29:39PM +0000, Zeng, Oak wrote: > > > But we have use case where we want to fault-in pages other than the > > > page which contains the GPU fault address, e.g., user malloc'ed or > > > mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU > > > access it. Let's say GPU access caused a GPU page fault a 2MiB > > > place. The first hmm-range-fault would only fault-in the page at > > > 2MiB place, because in the first call we only set REQ_FAULT to the > > > pfn at 2MiB place. > > > > Honestly, that doesn't make alot of sense to me, but if you really > > want that you should add some new flag and have hmm_range_fault do > > this kind of speculative faulting. I think you will end up > > significantly over faulting. > > Above 2 steps hmm-range-fault approach is just my guess of what you > were suggesting. Since you don't like the CPU vma look up, so we > come out this 2 steps hmm-range-fault thing. The first step has the > same functionality of a CPU vma lookup. If you want to retain the GPU fault flag as a signal for changing locality then you have to correct the locality and resolve all faults before calling hmm_range_fault(). hmm_range_fault() will never do faulting. It will always just read in the already resolved pages. > > It also doesn't make sense to do faulting in hmm prefetch if you are > > going to do migration to force the fault anyhow. > > What do you mean by hmm prefetch? I mean the pages that are not part of the critical fault resultion. The pages you are preloading into the GPU page table without an immediate need. > As explained, we call hmm-range-fault in two scenarios: > > 1) call hmm-range-fault to get the current status of cpu page table > without causing CPU fault. When address range is already accessed by > CPU before GPU, or when we migrate for such range, we run into this > scenario This is because you are trying to keep locality management outside of the code code - it is creating this problem. As I said below locality management should be core code, not in drivers. It may be hmm core code, not drm, but regardless. > We do have another prefetch API which can be called from user space > to prefetch before GPU job submission. This API seems like it would break the use of faulting as a mechanism to manage locality... > > I'm not sure I full agree there is a real need to agressively optimize > > the faulting path like you are describing when it shouldn't really be > > used in a performant application :\ > > As a driver, we need to support all possible scenarios. Functionally support is different from micro optimizing it. > > > 2) decide a migration window per migration granularity (e.g., 2MiB) > > > settings, inside the CPU VMA. If CPU VMA is smaller than the > > > migration granular, migration window is the whole CPU vma range; > > > otherwise, partially of the VMA range is migrated. > > > > Seems rather arbitary to me. You are quite likely to capture some > > memory that is CPU memory and cause thrashing. As I said before in > > common cases the heap will be large single VMAs, so this kind of > > scheme is just going to fault a whole bunch of unrelated malloc > > objects over to the GPU. > > I want to listen more here. > > Here is my understanding. Malloc of small size such as less than one > page, or a few pages, memory is allocated from heap. > > When malloc is much more than one pages, the GlibC's behavior is > mmap it directly from OS, vs from heap Yes "much more", there is some cross over where very large allocations may get there own arena. > In glibC the threshold is defined by MMAP_THRESHOLD. The default > value is 128K: > https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html Sure > So on the heap, it is some small VMAs each contains a few pages, > normally one page per VMA. In the worst case, VMA on pages shouldn't > be bigger than MMAP_THRESHOLD. Huh? That isn't quite how it works. The glibc arenas for < 128K allocation can be quite big, they often will come from the brk heap which is a single large VMA. The above only says that allocations over 128K will get their own VMAs. It doesn't say small allocations get small VMAs. Of course there are many allocator libraries with different schemes and tunables. > In a reasonable GPU application, people use GPU for compute which > usually involves large amount of data which can be many MiB, > sometimes it can even be many GiB of data Then the application can also prefault the whole thing. > Now going back our scheme. I picture in most application, the CPU > vma search end up big vma, MiB, GiB etc I'm not sure. Some may, but not all, and not all memory touched by the GPU will necessarily come from the giant allocation even in the apps that do work that way. > If we end up with a vma that is only a few pages, we fault in the > whole vma. It is true that for this case we fault in unrelated > malloc objects. Maybe we can fine tune here to only fault in one > page (which is minimum fault size) for such case. Admittedly one > page can also have bunch of unrelated objects. But overall we think > this should not be common case. This is the obvious solution, without some kind of special knowledge the kernel possibly shouldn't attempt the optimize by speculating how to resolve the fault - or minimally the speculation needs to be a tunable (ugh) Broadly, I think using fault indication to indicate locality of pages that haven't been faulted is pretty bad. Locality indications need to come from some way that reliably indicates if the device is touching the pages at all. Arguably this can never be performant, so I'd argue you should focus on making things simply work (ie single fault, no prefault, basic prefetch) and do not expect to achieve a high quality dynamic locality. Application must specify, application must prefault & prefetch. Jason