On Tue, Jul 12, 2022, Sean Christopherson wrote: > On Mon, Mar 28, 2022, Ben Gardon wrote: > > On Mon, Mar 28, 2022 at 10:45 AM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > > If iter.old_spte is not a leaf, the only loop would always continue to > > > the next SPTE. Now we try to promote it and if that fails we run through > > > the rest of the loop. This seems broken. For example, in the next line > > > we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN > > > of the TDP MMU page table?) and treat that as the PFN backing this GFN, > > > which is wrong. > > > > > > In the worst case we end up zapping an SPTE that we didn't need to, but > > > we should still fix up this code. > > My thought to remedy this is to drop the @pfn argument to kvm_mmu_max_mapping_level(). > It's used only for historical reasons, where KVM didn't walk the host page tables > to get the max mapping level and instead pulled THP information out of struct page. > I.e. KVM needed the pfn to get the page. > > That would also allow KVM to use huge pages for things that aren't backed by > struct page (I know of at least one potential use case). > > I _think_ we can do the below. It's compile tested only at this point, and I > want to split some of the changes into separate patches, e.g. the WARN on the step-up > not going out-of-bounds. I'll put this on the backburner for now, it's too late > for 5.20, and too many people are OOO :-) Heh, that was a bit of a lie. _Now_ it's going on the backburner. Thinking about the pfn coming from the old leaf SPTE made me realize all of the information we need to use __make_spte() during promotion is available in the existing leaf SPTE. If KVM first retrieves a PRESENT leaf SPTE, then pfn _can't_ be different, because that would mean KVM done messed up and didn't zap existing entries in response to a MMU notifier invalidation, and holding mmu_lock prevents new invalidations. And because the primary MMU must invalidate before splitting a huge page, having a valid leaf SPTE means that host mapping level can't become stale until mmu_lock is dropped. In other words, KVM can compute the huge pfn by using the smaller pfn and adjusting based on the target mapping level. As for the EPT memtype, that can also come from the existing leaf SPTE. KVM only forces the memtype for host MMIO pfns, and if the primary MMU maps a huge page that straddles host MMIO (UC) and regular memory (WB), then the kernel is already hosed. If the VM doesn't have non-coherent DMA, then the EPT memtype will be WB regardless of the page size. That means KVM just needs to reject promotion if the VM has non-coherent DMA and the target pfn is not host MMIO, else KVM can use the leaf's memtype as-is. Using the pfn avoids gup() (fast-only, but still), and using the memtype avoids having to split vmx_get_mt_mask() and add another kvm_x86_ops hook. And digging into all of that yielded another optimization. kvm_tdp_page_fault() needs to restrict the host mapping level if and only if it may consume the guest MTRRs. If KVM ignores the guest MTRRs, then the fact that they're inconsistent across a TDP page is irrelevant because the _guest_ MTRRs are completely virtual and are not consumed by either EPT or NPT. I doubt this meaningfully affects whether or not KVM can create huge pages for real world VMs, but it does avoid having to walk the guest variable MTRRs when faulting in a huge page. Compile tested only at this point, but I'm mostly certain my logic is sound. int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { /* * If the guest's MTRRs may be used, restrict the mapping level to * ensure KVM uses a consistent memtype across the entire mapping. */ if (kvm_may_need_guest_mtrrs(vcpu->kvm)) { for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) { int page_num = KVM_PAGES_PER_HPAGE(fault->max_level); gfn_t base = (fault->addr >> PAGE_SHIFT) & ~(page_num - 1); if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num)) break; } } return direct_page_fault(vcpu, fault); } static int try_promote_to_huge_page(struct kvm *kvm, struct rsvd_bits_validate *rsvd_bits, const struct kvm_memory_slot *slot, u64 leaf_spte, struct tdp_iter *iter) { struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep); kvm_pfn_t pfn; u64 new_spte; u8 mt_mask; if (WARN_ON_ONCE(slot->flags & KVM_MEM_READONLY)) return -EINVAL; pfn = spte_to_pfn(leaf_spte) & ~(KVM_PAGES_PER_HPAGE(iter->level) - 1); mt_mask = leaf_spte & shadow_memtype_mask; /* * Bail if KVM needs guest MTRRs to compute the memtype and will not * force the memtype (host MMIO). There is no guarantee the guest uses * a consistent MTRR memtype for the entire huge page, and MTRRs are * tracked per vCPU, not per VM. */ if (kvm_may_need_guest_mtrrs(kvm) && !kvm_is_mmio_pfn(pfn)) return -EIO; __make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true, true, mt_mask, rsvd_bits, &new_spte); return tdp_mmu_set_spte_atomic(kvm, iter, new_spte); }