On Fri, 2024-04-26 at 13:35 -0300, Jason Gunthorpe wrote: > On Fri, Apr 26, 2024 at 04:49:26PM +0200, Thomas Hellström wrote: > > On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote: > > > On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote: > > > > First, the gpu_vma structure is something that partitions the > > > > gpu_vm > > > > that holds gpu-related range metadata, like what to mirror, > > > > desired > > > > gpu > > > > caching policies etc. These are managed (created, removed and > > > > split) > > > > mainly from user-space. These are stored and looked up from an > > > > rb- > > > > tree. > > > > > > Except we are talking about SVA here, so all of this should not > > > be > > > exposed to userspace. > > > > I think you are misreading. this is on the level "Mirror this > > region of > > the cpu_vm", "prefer this region placed in VRAM", "GPU will do > > atomic > > accesses on this region", very similar to cpu mmap / munmap and > > madvise. What I'm trying to say here is that this does not directly > > affect the SVA except whether to do SVA or not, and in that case > > what > > region of the CPU mm will be mirrored, and in addition, any gpu > > attributes for the mirrored region. > > SVA is you bind the whole MM and device faults dynamically populate > the mirror page table. There aren't non-SVA regions. Meta data, like > you describe, is meta data for the allocation/migration mechanism, > not > for the page table or that has anything to do with the SVA mirror > operation. > > Yes there is another common scheme where you bind a window of CPU to > a > window on the device and mirror a fixed range, but this is a quite > different thing. It is not SVA, it has a fixed range, and it is > probably bound to a single GPU VMA in a multi-VMA device page table. And this above here is exactly what we're implementing, and the GPU page-tables are populated using device faults. Regions (large) of the mirrored CPU mm need to coexist in the same GPU vm as traditional GPU buffer objects. > > SVA is not just a whole bunch of windows being dynamically created by > the OS, that is entirely the wrong mental model. It would be horrible > to expose to userspace something like that as uAPI. Any hidden SVA > granules and other implementation specific artifacts must not be made > visible to userspace!! Implementation-specific artifacts are not to be made visible to user- space. > > > > If you use dma_map_sg you get into the world of wrongness where > > > you > > > have to track ranges and invalidation has to wipe an entire range > > > - > > > because you cannot do a dma unmap of a single page from a > > > dma_map_sg > > > mapping. This is all the wrong way to use hmm_range_fault. > > > > > > hmm_range_fault() is page table mirroring, it fundamentally must > > > be > > > page-by-page. The target page table structure must have similar > > > properties to the MM page table - especially page by page > > > validate/invalidate. Meaning you cannot use dma_map_sg(). > > > > To me this is purely an optimization to make the driver page-table > > and > > hence the GPU TLB benefit from iommu coalescing / large pages and > > large > > driver PTEs. > > This is a different topic. Leon is working on improving the DMA API > to > get these kinds of benifits for HMM users. dma_map_sg is not the path > to get this. Leon's work should be significantly better in terms of > optimizing IOVA contiguity for a GPU use case. You can get a > guaranteed DMA contiguity at your choosen granual level, even up to > something like 512M. > > > It is true that invalidation will sometimes shoot down > > large gpu ptes unnecessarily but it will not put any additional > > burden > > on the core AFAICT. > > In my experience people doing performance workloads don't enable the > IOMMU due to the high performance cost, so while optimizing iommu > coalescing is sort of interesting, it is not as important as using > the > APIs properly and not harming the much more common situation when > there is no iommu and there is no artificial contiguity. > > > on invalidation since zapping the gpu PTEs effectively stops any > > dma > > accesses. The dma mappings are rebuilt on the next gpu pagefault, > > which, as you mention, are considered slow anyway, but will > > probably > > still reuse the same prefault region, hence needing to rebuild the > > dma > > mappings anyway. > > This is bad too. The DMA should not remain mapped after pages have > been freed, it completely destroys the concept of IOMMU enforced DMA > security and the ACPI notion of untrusted external devices. Hmm. Yes, good point. > > > So as long as we are correct and do not adversely affect core mm, > > If > > the gpu performance (for whatever reason) is severely hampered if > > large gpu page-table-entries are not used, couldn't this be > > considered > > left to the driver? > > Please use the APIs properly. We are trying to improve the DMA API to > better support HMM users, and doing unnecessary things like this in > drivers is only harmful to that kind of consolidation. > > There is nothing stopping getting large GPU page table entries for > large CPU page table entries. > > > And a related question. What about THP pages? OK to set up a single > > dma-mapping to those? > > Yes, THP is still a page and dma_map_page() will map it. OK great. This is probably sufficient for the performance concern for now. Thanks, Thomas > > > > > That's why finer-granularity mmu_interval notifiers might be > > > > beneficial > > > > (and then cached for future re-use of the same prefault range). > > > > This > > > > leads me to the next question: > > > > > > It is not the design, please don't invent crazy special Intel > > > things > > > on top of hmm_range_fault. > > > > For the record, this is not a "crazy special Intel" invention. It's > > the > > way all GPU implementations do this so far. > > "all GPU implementations" you mean AMD, and AMD predates alot of the > modern versions of this infrastructure IIRC. > > > > Why would a prefetch have anything to do with a VMA? Ie your app > > > calls > > > malloc() and gets a little allocation out of a giant mmap() arena > > > - > > > you want to prefault the entire arena? Does that really make any > > > sense? > > > > Personally, no it doesn't. I'd rather use some sort of fixed-size > > chunk. But to rephrase, the question was more into the strong > > "drivers > > should not be aware of the cpu mm vma structures" comment. > > But this is essentially why - there is nothing useful the driver can > possibly learn from the CPU VMA to drive > hmm_range_fault(). hmm_range_fault already has to walk the VMA's if > someday something is actually needed it needs to be integrated in a > general way, not by having the driver touch vmas directly. > > Jason