On Fri, May 24, 2024 at 4:31 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Fri, May 24, 2024 at 01:54:20PM -0700, Axel Rasmussen wrote: > > Hi, > > Hi, Axel, > > > > > I'm interested in extending remap_pfn_range to allow it to map the > > range hugely (using PUDs or PMDs). The initial user I have in mind is > > vfio-pci; I'm thinking when we're mapping large ranges for GPUs, we > > can get both a performance and host overhead win by doing this hugely. > > > > Another thing I have in the back of my mind is adding something KVM > > can re-use to simplify its whole host_pfn_mapping_level / > > hva_to_pfn_remapped / get_user_page_fast_only thing. > > IIUC kvm should be prepared for it, as host_pfn_mapping_level() can detect > any huge mappings using the *_leaf() apis. Right, the KVM code works as is. Sean had been suggesting though that if follow_pte() (or its replacement) returned the level, and had an option to work locklessly, KVM could just re-use it and delete some code. I think we could also avoid doing two page table walks (once for follow_pte, and once to determine the level). Then again, I think it is somewhat debatable what exactly such an API would look like, or whether it would be too KVM-specific to expose generally. > > > > > I know Peter and David are working on some related things (hugetlbfs > > unification and follow_pte et al improvements, respectively). Although > > I have a hacky proof of concept that works, I thought it best to get > > some consensus on the design before I post something, so I don't > > conflict with this existing / upcoming work. > > Yes we're working on that, mostly with Alex. There's a testing branch but > half baked so far: > > https://github.com/xzpeter/linux/commits/huge-pfnmap/ Ah, I hadn't been aware of this, it looks like you're already well on your way to implementing exactly what I was thinking of. :) In that case I'll mostly plan on trying out this branch, and offering any feedback / fixes I find, it would be counter productive to spend time building my own implementation. > > > > > Changing remap_pfn_range to install PUDs or PMDs is straightforward. > > The hairy part is the fault / follow side of things: > > I'm surprised you thought about the fault() path, even if Alex just > officially proposed it yesterday. Maybe you followed the previous > discussions. It's here: > > https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@xxxxxxxxxx > > > > > 1. follow_pte clearly doesn't work for this, since the leaf might be a > > PUD or PMD instead. Most callers don't care about the PTE itself, they > > care about the pgprot or flags it has set, so my idea was to add a new > > interface which just yields those bits, instead of the actual PTE. > > See: > > https://github.com/xzpeter/linux/commit/2cb4702418a1b740129fc7b379b52e16e57032e1 Ah! Thanks for the pointer. This is relatively close to what I had in mind. > > > > > Peter, I think hugetlbfs unification may run into similar issues, do > > you have some plan already to deal with PUD/PMD/PTE being different > > types? > > Exactly. There'll be some shared work between the two projects on fork(), > mprotect, etc. And yes I plan to cover them all but I'll start with the > pfnmap thing, paving way for hugetlb, while we have Oscar (from SUSE kernel > team) working concurrently on other paths of hugetlb. > > > > > 2. vfio-pci relies on vm_ops->fault. This is a problem because the > > normal fault handler path doesn't call this until after it has walked > > down to the PTE level, installing PUDs/PMDs along the way. I have only > > gross ideas for how to deal with this: > > > > - Add a VM_HUGEPFNMAP VMA flag indicating vm_ops->fault should be > > called earlier in __handle_mm_fault > > - Add a vm_ops->hugepfn_fault (name not important) which should be > > called earlier in __handle_mm_fault > > - Go ahead and let remap_pfn_range overwrite existing PUDs/PMDS > > I actually don't know what exactly you meant here, but Alex already worked > on that with huge_fault(). See: > > https://github.com/awilliam/linux-vfio/commit/ec6c970f8374f91df0ebfe180cd388ba31187942 > > So far I don't yet understand why we need a new vma flag. Ah, I had discounted huge_fault() thinking it was specific to hugetlbfs or THPs. I should have spent more time reading that code, I agree it looks like it avoids all of what I'm talking about here. :) > > > > > I wonder which of these folks find least offensive? Or is there a > > better way I haven't thought of? > > > > 3. That's also an issue for CoW faults, but I don't know of any real > > use case for CoW huge pfn mappings, so I thought we can just keep the > > existing small mapping behavior for CoW VMAs. Any objections? > > I think we should keep the pud/pmd transparent, so that the old pte > behavior needs to be maintained. E.g., I think we'll need to be able to > split a pud/pmd mapping if mprotect() partially. I had been thinking of ensuring we never had pud/pmds in CoW mappings, but using huge_fault() might make my worry go away entirely. I completely agree we should allow vfio mappings to be mixed size though, in case things aren't quite aligned (due to a mprotect split or any other reason), we can still have a mostly-huge mapping with some ptes on the end(s). > > Thanks, > > -- > Peter Xu >