On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote: > First, the gpu_vma structure is something that partitions the gpu_vm > that holds gpu-related range metadata, like what to mirror, desired gpu > caching policies etc. These are managed (created, removed and split) > mainly from user-space. These are stored and looked up from an rb-tree. Except we are talking about SVA here, so all of this should not be exposed to userspace. > Now, when we hit a fault, we want to use hmm_range_fault() to re- > populate the faulting PTE, but also to pre-fault a range. Using a range > here (let's call this a prefault range for clarity) rather than to > insert a single PTE is for multiple reasons: I've never said to do a single range, everyone using hmm_range_fault() has some kind of prefetch populate around algorithm. > This is why we've been using dma_map_sg() for these ranges, since it is > assumed the benefits gained from This doesn't logically follow. You need to use dma_map_page page by page and batch that into your update mechanism. If you use dma_map_sg you get into the world of wrongness where you have to track ranges and invalidation has to wipe an entire range - because you cannot do a dma unmap of a single page from a dma_map_sg mapping. This is all the wrong way to use hmm_range_fault. hmm_range_fault() is page table mirroring, it fundamentally must be page-by-page. The target page table structure must have similar properties to the MM page table - especially page by page validate/invalidate. Meaning you cannot use dma_map_sg(). > Second, when pre-faulting a range like this, the mmu interval notifier > seqno comes into play, until the gpu ptes for the prefault range are > safely in place. Now if an invalidation happens in a completely > separate part of the mirror range, it will bump the seqno and force us > to rerun the fault processing unnecessarily. This is how hmm_range_fault() works. Drivers should not do hacky things to try to "improve" this. SVA granuals should be large, maybe not the entire MM, but still quite large. 2M is far to small. There is a tradeoff here of slowing down the entire MM vs risking an iteration during fault processing. We want to err toward making fault processing slowing because fault procesing is already really slow. > Hence, for this purpose we > ideally just want to get a seqno bump covering the prefault range. Ideally, but this is not something we can get for free. > That's why finer-granularity mmu_interval notifiers might be beneficial > (and then cached for future re-use of the same prefault range). This > leads me to the next question: It is not the design, please don't invent crazy special Intel things on top of hmm_range_fault. > You mention that mmu_notifiers are expensive to register. From looking > at the code it seems *mmu_interval* notifiers are cheap unless there > are ongoing invalidations in which case using a gpu_vma-wide notifier > would block anyway? Could you clarify a bit more the cost involved > here? The rb tree insertions become expensive the larger the tree is. If you have only a couple of notifiers it is reasonable. > If we don't register these smaller-range interval notifiers, do > you think the seqno bumps from unrelated subranges would be a real > problem? I don't think it is, you'd need to have a workload which was agressively manipulating the CPU mm (which is also pretty slow). If the workload is doing that then it also really won't like being slowed down by the giant rb tree. You can't win with an argument that collisions are likely due to an app pattern that makes alot of stress on the MM so the right response is to make the MM slower. > Finally the size of the pre-faulting range is something we need to > tune. Correct. > Currently it is cpu vma - wide. I understand you strongly suggest > this should be avoided. Could you elaborate a bit on why this is such a > bad choice? Why would a prefetch have anything to do with a VMA? Ie your app calls malloc() and gets a little allocation out of a giant mmap() arena - you want to prefault the entire arena? Does that really make any sense? Mirroring is a huge PITA, IMHO it should be discouraged in favour of SVA. Sadly too many CPUs still canot implement SVA. With mirroring there is no good way for the system to predict what the access pattern is. The only way to make this actually performant is for userspace to explicitly manage the mirroring with some kind of prefetching scheme to avoid faulting its accesses except in extrodinary cases. VMA is emphatically not a hint about what to prefetch. You should balance your prefetching based on HW performance and related. If it is tidy for HW to fault around a 2M granual then just do that. Jason