On Fri, Aug 16, 2024, David Matlack wrote: > On Fri, Aug 16, 2024 at 3:47 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > The proposed approach will also ignore nx_huge_page_disallowed, and just always > > create a huge NX page. On the plus side, "free" NX hugepage recovery! The > > downside is that it means KVM is pretty much guaranteed to force the guest to > > re-fault all of its code pages, and zap a non-trivial number of huge pages that > > were just created. IIRC, we deliberately did that for the zapping case, e.g. to > > use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit > > different than zap+create (if the guest is still using the region for code). > > I'm ok with skipping nx_huge_page_disallowed pages during disable-dirty-log. > > But why is recovering in-place is worse/different than zapping? They > both incur the same TLB flushes. And recovering might even result in > less vCPU faults, since vCPU faults use tdp_mmu_split_huge_page() to > install a fully populated lower level page table (vs faulting on a > zapped mapping will just install a 4K SPTE). Doh, never mind, I was thinking zapping collapsible SPTEs zapped leafs to induce faults, but it does the opposite and zaps at the level KVM thinks can be huge. > > So rather than looking for a present leaf SPTE, what about "stopping" as soon as > > KVM find a SP that can be replaced with a huge SPTE, pre-checking > > nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new > > SPTE? Or maybe even use kvm_tdp_map_page()? Though it might be more work to > > massage kvm_tdp_map_page() into a usable form. > > My intuition is that going through the full page fault flow would be > more expensive than just stepping down+up in 99.99999% of cases. Hmm, yeah, doing the full fault flow isn't the right place to hook in. Ugh, right, and it's the whole problem with not having a vCPU for tdp_mmu_map_handle_target_level(). But that's solvable as it's really just is_rsvd_spte(), which I would be a-ok skipping. Ah, no, host_writable is also problematic. Blech. That's solvable too, e.g. host_pfn_mapping_level() could get the protection, but that would require checking for an ongoing mmu_notifier invalidation. So again, probably not worth it. Double blech. > And will require more code churn. I'm not terribly concerned with code churn. I care much more about long-term maintenance, and especially about having multiple ways of doing the same thing (installing a shadow-present leaf SPTE). But I agree that trying to remedy that last point (similar flows) is probably a fool's errand in this case, as creating a new SPTE from scratch really is different than up-leveling an existing SPTE. I still have concerns about the step-up code, but I'll respond to those in the context of the patch I think is problematic.