Hi, Jason. I've quickly read through the discussion here and have a couple of questions and clarifications to hopefully help moving forward on aligning on an approach for this. Let's for simplicity initially ignore migration and assume this is on integrated hardware since it seems like it's around the hmm_range_fault() usage there is a disconnect. First, the gpu_vma structure is something that partitions the gpu_vm that holds gpu-related range metadata, like what to mirror, desired gpu caching policies etc. These are managed (created, removed and split) mainly from user-space. These are stored and looked up from an rb-tree. Each such mirroring gpu_vma holds an mmu_interval notifier. For invalidation only purposes, the mmu_interval seqno is not tracked. An invalidation thus only zaps page-table entries, causing subsequent accesses to fault. Hence for this purpose, having a single notifier that covers a huge range is desirable and does not become a problem. Now, when we hit a fault, we want to use hmm_range_fault() to re- populate the faulting PTE, but also to pre-fault a range. Using a range here (let's call this a prefault range for clarity) rather than to insert a single PTE is for multiple reasons: 1) avoid subsequent adjacent faults 2a) Using huge GPU page-table entries. 2b) Updating the GPU page-table (tree-based and multi-level) becomes more efficient when using a range. Here, depending on hardware, 2a might be more or less crucial for GPU performance. 2b somewhat ties into 2a but otherwise does not affect gpu performance. This is why we've been using dma_map_sg() for these ranges, since it is assumed the benefits gained from 2) above by far outweighs any benefit from finer-granularity dma-mappings on the rarer occasion of faults. Are there other benefits from single-page dma mappings that you think we need to consider here? Second, when pre-faulting a range like this, the mmu interval notifier seqno comes into play, until the gpu ptes for the prefault range are safely in place. Now if an invalidation happens in a completely separate part of the mirror range, it will bump the seqno and force us to rerun the fault processing unnecessarily. Hence, for this purpose we ideally just want to get a seqno bump covering the prefault range. That's why finer-granularity mmu_interval notifiers might be beneficial (and then cached for future re-use of the same prefault range). This leads me to the next question: You mention that mmu_notifiers are expensive to register. From looking at the code it seems *mmu_interval* notifiers are cheap unless there are ongoing invalidations in which case using a gpu_vma-wide notifier would block anyway? Could you clarify a bit more the cost involved here? If we don't register these smaller-range interval notifiers, do you think the seqno bumps from unrelated subranges would be a real problem? Finally the size of the pre-faulting range is something we need to tune. Currently it is cpu vma - wide. I understand you strongly suggest this should be avoided. Could you elaborate a bit on why this is such a bad choice? Thanks, Thomas