On Tue, Apr 23, 2024 at 09:17:03PM +0000, Zeng, Oak wrote: > > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote: > > > > > > I saw, I am saying this should not be done. You cannot unmap bits of > > > > a sgl mapping if an invalidation comes in. > > > > > > You are right, if we register a huge mmu interval notifier to cover > > > the whole address space, then we should use dma map/unmap pages to > > > map bits of the address space. We will explore this approach. > > > > > > Right now, in xe driver, mmu interval notifier is dynamically > > > registered with small address range. We map/unmap the whole small > > > address ranges each time. So functionally it still works. But it > > > might not be as performant as the method you said. > > > > Please don't do this, it is not how hmm_range_fault() should be > > used. > > > > It is intended to be page by page and there is no API surface > > available to safely try to construct covering ranges. Drivers > > definately should not try to invent such a thing. > > I need your help to understand this comment. Our gpu mirrors the > whole CPU virtual address space. It is the first design pattern in > your previous reply (entire exclusive owner of a single device > private page table and fully mirrors the mm page table into the > device table.) > > What do you mean by "page by page"/" no API surface available to > safely try to construct covering ranges"? As I understand it, > hmm_range_fault take a virtual address range (defined in hmm_range > struct), and walk cpu page table in this range. It is a range based > API. Yes, but invalidation is not linked to range_fault so you can get invalidations of single pages. You are binding range_fault to dma_map_sg but then you can't handle invalidation at all sanely. > From your previous reply ("So I find it a quite strange that this > RFC is creating VMA's, notifiers and ranges on the fly "), it seems > you are concerning why we are creating vma and register mmu interval > notifier on the fly. Let me try to explain it. Xe_vma is a very > fundamental concept in xe driver. I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based and you do need to ensure the page table manipulation has an API that is usable. "populate an entire VMA" / "invalidate an entire VMA" is not a sufficient primitive. > The gpu page table update, invalidation are all vma-based. This > concept exists before this svm work. For svm, we create a 2M (the > size is user configurable) vma during gpu page fault handler and > register this 2M range to mmu interval notifier. You can create VMAs, but they can't be fully populated and they should be alot bigger than 2M. ODP uses a similar 2 level scheme for it's SVA, the "vma" granual is 512M. The key thing is the VMA is just a container for the notifier and other data structures, it doesn't insist the range be fully populated and it must support page-by-page unmap/remap/invalidateion. > Now I try to figure out if we don't create vma, what can we do? We > can map one page (which contains the gpu fault address) to gpu page > table. But that doesn't work for us because the GPU cache and TLB > would not be performant for 4K page each time. One way to think of > the vma is a chunk size which is good for GPU HW performance. >From this perspective invalidation is driven by the range the invalidation callback gets, it could be a single page, but probably bigger. mapping is driven by the range passed to hmm_range_fault() during fault handling, which is entirely based on your drivers prefetch logic. GPU TLB invalidation sequences should happen once, at the end, for any invalidation or range_fault sequence regardless. Nothing about "gpu vmas" should have anything to do with this. > And the mmu notifier... if we don't register the mmu notifier on the > fly, do we register one mmu notifier to cover the whole CPU virtual > address space (which would be huge, e.g., 0~2^56 on a 57 bit > machine, if we have half half user space kernel space split)? That > won't be performant as well because for any address range that is > unmapped from cpu program, but even if they are never touched by > GPU, gpu driver still got a invalidation callback. In our approach, > we only register a mmu notifier for address range that we know gpu > would touch it. When SVA is used something, somewhere, has to decide if a CPU VA intersects with a HW VA. The mmu notifiers are orginized in an interval (red/black) tree, so if you have a huge number of them the RB search becomes very expensive. Conversly your GPU page table is organized in a radix tree, so detecting no-presence during invalidation is a simple radix walk. Indeed for the obviously unused ranges it is probably a pointer load and single de-ref to check it. I fully expect the radix walk is much, much faster than a huge number of 2M notifiers in the red/black tree. Notifiers for SVA cases should be giant. If not the entire memory space, then at least something like 512M/1G kind of size, neatly aligned with something in your page table structure so the radix walk can start lower in the tree automatically. > > > For example, when gpu page fault happens, you look > > > up the cpu vm_area_struct for the fault address and create a > > > corresponding state/struct. And people can have as many cpu > > > vm_area_struct as they want. > > > > No you don't. > > See explains above. I need your help to understand how we can do it > without a vma (or chunk). One thing GPU driver is different from > RDMA driver is, RDMA doesn't have device private memory so no > migration. I want to be very clear, there should be no interaction of your hmm_range_fault based code with MM centric VMAs. You MUST NOT look up the CPU vma_area_struct in your driver. > It only need to dma-map the system memory pages and use > them to fill RDMA page table. so RDMA don't need another memory > manager such as our buddy. RDMA only deal with system memory which > is completely struct page based management. Page by page make 100 % > sense for RDMA. I don't think this is the issue at hand, you just have some historical poorly designed page table manipulation code from what I can understand.. Jason