On Mon, Aug 05, 2024, David Matlack wrote: > This reverts commit 85f44f8cc07b5f61bef30fe5343d629fd4263230. > > Bring back the logic that walks down to leafs when zapping collapsible > SPTEs. Stepping down to leafs is technically unnecessary when zapping, > but the leaf SPTE will be used in a subsequent commit to construct a > huge SPTE and recover the huge mapping in place. Please no, getting rid of the step-up code made me so happy. :-D It's also suboptimal, e.g. in the worst case scenario (which is comically unlikely, but theoretically possible), if the first present leaf SPTE in a 1GiB region is the last SPTE in the last 2MiB range, and all non-leaf SPTEs are somehow present (this is the super unlikely part), then KVM will read 512*512 SPTEs before encountering a shadow-present leaf SPTE. The proposed approach will also ignore nx_huge_page_disallowed, and just always create a huge NX page. On the plus side, "free" NX hugepage recovery! The downside is that it means KVM is pretty much guaranteed to force the guest to re-fault all of its code pages, and zap a non-trivial number of huge pages that were just created. IIRC, we deliberately did that for the zapping case, e.g. to use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit different than zap+create (if the guest is still using the region for code). So rather than looking for a present leaf SPTE, what about "stopping" as soon as KVM find a SP that can be replaced with a huge SPTE, pre-checking nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new SPTE? Or maybe even use kvm_tdp_map_page()? Though it might be more work to massage kvm_tdp_map_page() into a usable form. By virtue of there being a present non-leaf SPTE, KVM knows the guest accessed the region at some point. And now that MTRR virtualization is gone, the only time KVM zaps _only_ leafs is for mmu_notifiers and and the APIC access page, i.e. the odds of stepping down NOT finding a present SPTE somewhere in the region is very small. Lastly, kvm_mmu_max_mapping_level() has verified there is a valid mapping in the pr imary MMU, else the max level would be PG_LEVEL_4K. So the odds of getting a false positive and effectively pre-faulting memory the guest isn't using are quite small.