On Fri, Mar 15, 2024 at 10:59 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, Mar 14, 2024, Sean Christopherson wrote: > > +Alex, who is looking at the huge-VM_PFNMAP angle in particular. > > Oof, *Axel*. Sorry Axel. No worries. Believe it or not this happens frequently. :) I'm well past being bothered about it. > > > > On Thu, Mar 14, 2024, Sean Christopherson wrote: > > > -Christ{oph,ian} to avoid creating more noise... > > > > > > On Thu, Mar 14, 2024, David Stevens wrote: > > > > Because of that, the specific type of pfns that don't work right now are > > > > pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the > > > > non-refcounted pages in a bad choice of words. If that's correct, then > > > > perhaps this series should go a little bit further in modifying > > > > hva_to_pfn_remapped, but it isn't fundamentally wrong. > > > > > > Loosely related to all of this, I have a mildly ambitious idea. Well, one mildly > > > ambitious idea, and one crazy ambitious idea. Crazy ambitious idea first... > > > > > > Something we (GCE side of Google) have been eyeballing is adding support for huge > > > VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory > > > into guests using hugepages. One of the hiccups is that follow_pte() doesn't play > > > nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))". > > > Teaching follow_pte() to play nice with hugepage probably is doing, but making > > > sure all existing users are aware, maybe not so much. Right. The really basic idea I had was, to modify remap_pfn_range so it can, if possible (if it sees a (sub)range which is aligned + big enough), it can just install a leaf pud or pmd instead of always going down to the pte level. follow_pte is problematic though, because it returns a pte specifically so it's unclear to me how to have it "just work" for existing callers with an area mapped this way. So I think the idea would be to change follow_pte to detect this case and bail out (-EINVAL I guess), and then add some new function which can properly support these mappings. > > > > > > > My first (half baked, crazy ambitious) idea is to move away from follow_pte() and > > > get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need > > > to grab references, and replace them with a new converged API that locklessly walks > > > host userspace page tables, and grabs the hugepage size along the way, e.g. so that > > > arch code wouldn't have to do a second walk of the page tables just to get the > > > hugepage size. > > > > > > In other words, for the common case (mmu_notifier integration, no reference needed), > > > route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably > > > only for write faults, to avoid CoW compliciations) before doing anything else. > > > > > > Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be > > > converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization. > > > But for everything else, grabbing a reference is a non-goal, i.e. actually "getting" > > > a user page is wasted effort and actively gets in the way. > > > > > > I was initially hoping we could go super simple and use something like x86's > > > host_pfn_mapping_level(), but there are too many edge cases in gup() that need to > > > be respected, e.g. to avoid mapping memfd_secret pages into KVM guests. I.e. the > > > API would need to be a formal mm-owned thing, not some homebrewed KVM implementation. > > > > > > I can't tell if the payoff would be big enough to justify the effort involved, i.e. > > > having a single unified API for grabbing PFNs from the primary MMU might just be a > > > pie-in-the-sky type idea. Yeah, I have been thinking about this. One thing is, KVM is not the only caller of follow_pte today. At least naively it would be nice if any existing callers could benefit from this new support, not just KVM. Another thing I noticed is, most callers don't need much; they don't need a reference to the page, they don't even really need the pte itself. Most callers just want a ptl held, and they want to know the pgprot flags or whether the pte is writable or not, and they want to know the pfn for this address. IOW follow_pte is sort of overkill for most callers. KVM is a bit different, it does call GUP to get the page. One other thing is, KVM at least on x86 also cares about the "level" of the mapping, in host_pfn_mapping_level(). That code is all fairly x86 specific so I'm not sure how to generalize it. Also I haven't looked closely at what's going on for other architectures. I'm not sure yet where is the right place to end up, but I at least think it's worth trying to build some general API under mm/ which supports these various things. In general I'm thinking of proceeding in two steps. First, enlightening remap_pfn_range, updating follow_pte to return -EINVAL, and then adding some new function which tries to be somewhat close to a drop-in replacement for existing follow_pte callers. Once I get that proof of concept working / tested, I think the next step is figuring out how we can extend it a bit to build some more general solution like Sean is describing here. > > > > > > My second, less ambitious idea: the previously linked LWN[*] article about the > > > writeback issues reminded me of something that has bugged me for a long time. IIUC, > > > getting a writable mapping from the primary MMU marks the page/folio dirty, and that > > > page/folio stays dirty until the data is written back and the mapping is made read-only. > > > And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the > > > RW=>RO conversion completes, i.e. before the page/folio is marked clean. > > > > > > I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or > > > dropping any mmu_notifier-aware mapping) is completely unnecessary. If that is the > > > case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then > > > with FOLL_GET plumbed into hva_to_pfn(), we can: > > > > > > - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn() > > > that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings > > > before the page/folio is cleaned, will *know* that they hold a refcounted struct > > > page. > > > > > > - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because > > > KVM never needs to know if a SPTE points at a refcounted page. > > > > > > In other words, double down on immediately doing put_page() after gup() if FOLL_GET > > > isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs > > > that are acquired by follow_pte(). > > > > > > I suspect we can simply mark pages as access when a page is retrieved from the primary > > > MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical. > > > E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking > > > the page accessed when KVM drops its SPTEs in response to the swap adds no value. And > > > through the mmu_notifiers, KVM already plays nice with setups that use idle page > > > tracking to make reclaim decisions. > > > > > > [*] https://lwn.net/Articles/930667