On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > +Marc and Oliver > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote: > > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > > > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote: > > > > > Overview > > > > > ======== > > > > > > > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest, > > > > > plus dax 1g fix [1]. Note that this series should also apply if without > > > > > the dax 1g fix series, but when without it, mprotect() will trigger similar > > > > > errors otherwise on PUD mappings. > > > > > > > > > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > > > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > > > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > > > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > > > > > as large as 8GB or even bigger. > > > > > > > > FWIW, I've started to hear people talk about needing this in the VFIO > > > > context with VMs. > > > > > > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to > > > > setup the IOMMU, but KVM is not able to do it so reliably. > > > > > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create > > > a huge page unless the mapping is huge in the primary MMU. And that's very much > > > by design, as KVM has no knowledge of what actually resides at a given PFN, and > > > thus can't determine whether or not its safe to create a huge page if KVM happens > > > to realize the VM has access to a contiguous range of memory. > > > > Oh? Someone told me recently x86 kvm had code to reassemble contiguous > > ranges? > > Nope. KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only > a win in very select use cases, and is overall a non-trivial loss. Ah that ARM behavior was probably what was being mentioned then! So take my original remark as applying to this :) > > I don't quite understand your safety argument, if the VMA has 1G of > > contiguous physical memory described with 4K it is definitely safe for > > KVM to reassemble that same memory and represent it as 1G. > > That would require taking mmap_lock to get the VMA, which would be a net negative, > especially for workloads that are latency sensitive. You can aggregate if the read and aggregating logic are protected by mmu notifiers, I think. A invalidation would still have enough information to clear the aggregate shadow entry. If you get a sequence number collision then you'd throw away the aggregation. But yes, I also think it would be slow to have aggregation logic in KVM. Doing in the main mmu is much better. Jason