Re: [RFC] Huge remap_pfn_range for vfio-pci

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 24, 2024 at 01:54:20PM -0700, Axel Rasmussen wrote:
> Hi,

Hi, Axel,

> 
> I'm interested in extending remap_pfn_range to allow it to map the
> range hugely (using PUDs or PMDs). The initial user I have in mind is
> vfio-pci; I'm thinking when we're mapping large ranges for GPUs, we
> can get both a performance and host overhead win by doing this hugely.
> 
> Another thing I have in the back of my mind is adding something KVM
> can re-use to simplify its whole host_pfn_mapping_level /
> hva_to_pfn_remapped / get_user_page_fast_only thing.

IIUC kvm should be prepared for it, as host_pfn_mapping_level() can detect
any huge mappings using the *_leaf() apis.

> 
> I know Peter and David are working on some related things (hugetlbfs
> unification and follow_pte et al improvements, respectively). Although
> I have a hacky proof of concept that works, I thought it best to get
> some consensus on the design before I post something, so I don't
> conflict with this existing / upcoming work.

Yes we're working on that, mostly with Alex.  There's a testing branch but
half baked so far:

https://github.com/xzpeter/linux/commits/huge-pfnmap/

> 
> Changing remap_pfn_range to install PUDs or PMDs is straightforward.
> The hairy part is the fault / follow side of things:

I'm surprised you thought about the fault() path, even if Alex just
officially proposed it yesterday.  Maybe you followed the previous
discussions.  It's here:

https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@xxxxxxxxxx

> 
> 1. follow_pte clearly doesn't work for this, since the leaf might be a
> PUD or PMD instead. Most callers don't care about the PTE itself, they
> care about the pgprot or flags it has set, so my idea was to add a new
> interface which just yields those bits, instead of the actual PTE.

See:

https://github.com/xzpeter/linux/commit/2cb4702418a1b740129fc7b379b52e16e57032e1

> 
> Peter, I think hugetlbfs unification may run into similar issues, do
> you have some plan already to deal with PUD/PMD/PTE being different
> types?

Exactly.  There'll be some shared work between the two projects on fork(),
mprotect, etc.  And yes I plan to cover them all but I'll start with the
pfnmap thing, paving way for hugetlb, while we have Oscar (from SUSE kernel
team) working concurrently on other paths of hugetlb.

> 
> 2. vfio-pci relies on vm_ops->fault. This is a problem because the
> normal fault handler path doesn't call this until after it has walked
> down to the PTE level, installing PUDs/PMDs along the way. I have only
> gross ideas for how to deal with this:
> 
> - Add a VM_HUGEPFNMAP VMA flag indicating vm_ops->fault should be
> called earlier in __handle_mm_fault
> - Add a vm_ops->hugepfn_fault (name not important) which should be
> called earlier in __handle_mm_fault
> - Go ahead and let remap_pfn_range overwrite existing PUDs/PMDS

I actually don't know what exactly you meant here, but Alex already worked
on that with huge_fault().  See:

https://github.com/awilliam/linux-vfio/commit/ec6c970f8374f91df0ebfe180cd388ba31187942

So far I don't yet understand why we need a new vma flag.

> 
> I wonder which of these folks find least offensive? Or is there a
> better way I haven't thought of?
> 
> 3. That's also an issue for CoW faults, but I don't know of any real
> use case for CoW huge pfn mappings, so I thought we can just keep the
> existing small mapping behavior for CoW VMAs. Any objections?

I think we should keep the pud/pmd transparent, so that the old pte
behavior needs to be maintained.  E.g., I think we'll need to be able to
split a pud/pmd mapping if mprotect() partially.

Thanks,

-- 
Peter Xu





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux