On Wed, Mar 16, 2022 at 3:27 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Fri, Mar 11, 2022 at 12:25:22AM +0000, David Matlack wrote: > > Extend KVM's eager page splitting to also split huge pages that are > > mapped by the shadow MMU. Specifically, walk through the rmap splitting > > all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB > > pages. > > > > Splitting huge pages mapped by the shadow MMU requries dealing with some > > extra complexity beyond that of the TDP MMU: > > > > (1) The shadow MMU has a limit on the number of shadow pages that are > > allowed to be allocated. So, as a policy, Eager Page Splitting > > refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer > > pages available. > > > > (2) Huge pages may be mapped by indirect shadow pages which have the > > possibility of being unsync. As a policy we opt not to split such > > pages as their translation may no longer be valid. > > > > (3) Splitting a huge page may end up re-using an existing lower level > > shadow page tables. This is unlike the TDP MMU which always allocates > > new shadow page tables when splitting. This commit does *not* > > handle such aliasing and opts not to split such huge pages. > > > > (4) When installing the lower level SPTEs, they must be added to the > > rmap which may require allocating additional pte_list_desc structs. > > This commit does *not* handle such cases and instead opts to leave > > such lower-level SPTEs non-present. In this situation TLBs must be > > flushed before dropping the MMU lock as a portion of the huge page > > region is being unmapped. > > > > Suggested-by: Peter Feiner <pfeiner@xxxxxxxxxx> > > [ This commit is based off of the original implementation of Eager Page > > Splitting from Peter in Google's kernel from 2016. ] > > Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx> > > --- > > .../admin-guide/kernel-parameters.txt | 3 - > > arch/x86/kvm/mmu/mmu.c | 307 ++++++++++++++++++ > > 2 files changed, 307 insertions(+), 3 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 05161afd7642..495f6ac53801 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -2360,9 +2360,6 @@ > > the KVM_CLEAR_DIRTY ioctl, and only for the pages being > > cleared. > > > > - Eager page splitting currently only supports splitting > > - huge pages mapped by the TDP MMU. > > - > > Default is Y (on). > > > > kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface. > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 926ddfaa9e1a..dd56b5b9624f 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) > > > > static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache) > > { > > + static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO; > > + > > + if (WARN_ON_ONCE(!cache)) > > + return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache); > > + > > I also think this is not proper to be added into this patch. Maybe it'll > be more suitable for the rmap_add() rework patch previously, or maybe it > can be dropped directly if it should never trigger at all. Then we die hard > at below when referencing it. > > > return kvm_mmu_memory_cache_alloc(cache); > > } > > > > @@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) > > return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS)); > > } > > > > +static gfn_t sptep_to_gfn(u64 *sptep) > > +{ > > + struct kvm_mmu_page *sp = sptep_to_sp(sptep); > > + > > + return kvm_mmu_page_get_gfn(sp, sptep - sp->spt); > > +} > > + > > +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index) > > +{ > > + if (!sp->role.direct) > > + return sp->shadowed_translation[index].access; > > + > > + return sp->role.access; > > +} > > + > > +static unsigned int sptep_to_access(u64 *sptep) > > +{ > > + struct kvm_mmu_page *sp = sptep_to_sp(sptep); > > + > > + return kvm_mmu_page_get_access(sp, sptep - sp->spt); > > +} > > + > > static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index, > > gfn_t gfn, u32 access) > > { > > @@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, > > return count; > > } > > > > +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, > > + const struct kvm_memory_slot *slot); > > + > > static void > > pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head, > > struct pte_list_desc *desc, int i, > > @@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm, > > return sp; > > } > > > > +static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn, > > + union kvm_mmu_page_role role) > > +{ > > + struct kvm_mmu_page *sp; > > + LIST_HEAD(invalid_list); > > + > > + BUG_ON(!role.direct); > > + > > + sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list); > > + > > + /* Direct SPs are never unsync. */ > > + WARN_ON_ONCE(sp && sp->unsync); > > + > > + kvm_mmu_commit_zap_page(kvm, &invalid_list); > > + return sp; > > +} > > + > > /* > > * Looks up an existing SP for the given gfn and role if one exists. The > > * return SP is guaranteed to be synced. > > @@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, > > kvm_arch_flush_remote_tlbs_memslot(kvm, memslot); > > } > > > > +static int prepare_to_split_huge_page(struct kvm *kvm, > > + const struct kvm_memory_slot *slot, > > + u64 *huge_sptep, > > + struct kvm_mmu_page **spp, > > + bool *flush, > > + bool *dropped_lock) > > +{ > > + int r = 0; > > + > > + *dropped_lock = false; > > + > > + if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) > > + return -ENOSPC; > > + > > + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) > > + goto drop_lock; > > + > > Not immediately clear on whether there'll be case that *spp is set within > the current function. Some sanity check might be nice? > > > + *spp = kvm_mmu_alloc_direct_sp_for_split(true); > > + if (r) > > + goto drop_lock; > > + > > + return 0; > > + > > +drop_lock: > > + if (*flush) > > + kvm_arch_flush_remote_tlbs_memslot(kvm, slot); > > + > > + *flush = false; > > + *dropped_lock = true; > > + > > + write_unlock(&kvm->mmu_lock); > > + cond_resched(); > > + *spp = kvm_mmu_alloc_direct_sp_for_split(false); > > + if (!*spp) > > + r = -ENOMEM; > > + write_lock(&kvm->mmu_lock); > > + > > + return r; > > +} > > + > > +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm, > > + const struct kvm_memory_slot *slot, > > + u64 *huge_sptep, > > + struct kvm_mmu_page **spp) > > +{ > > + struct kvm_mmu_page *split_sp; > > + union kvm_mmu_page_role role; > > + unsigned int access; > > + gfn_t gfn; > > + > > + gfn = sptep_to_gfn(huge_sptep); > > + access = sptep_to_access(huge_sptep); > > + > > + /* > > + * Huge page splitting always uses direct shadow pages since we are > > + * directly mapping the huge page GFN region with smaller pages. > > + */ > > + role = kvm_mmu_child_role(huge_sptep, true, access); > > + split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role); > > + > > + /* > > + * Opt not to split if the lower-level SP already exists. This requires > > + * more complex handling as the SP may be already partially filled in > > + * and may need extra pte_list_desc structs to update parent_ptes. > > + */ > > + if (split_sp) > > + return NULL; > > This smells tricky.. > > Firstly we're trying to lookup the existing SPs that has shadowed this huge > page in split way, with the access bits fetched from the shadow cache (so > without hugepage nx effect). However could the pages be mapped with > different permissions from the currently hugely mapped page? > > IIUC all these is for the fact that we can't allocate pte_list_desc and we > want to make sure we won't make the pte list to be >1. > > But I also see that the pte_list check below... > > > + > > + swap(split_sp, *spp); > > + init_shadow_page(kvm, split_sp, slot, gfn, role); > > + trace_kvm_mmu_get_page(split_sp, true); > > + > > + return split_sp; > > +} > > + > > +static int kvm_mmu_split_huge_page(struct kvm *kvm, > > + const struct kvm_memory_slot *slot, > > + u64 *huge_sptep, struct kvm_mmu_page **spp, > > + bool *flush) > > + > > +{ > > + struct kvm_mmu_page *split_sp; > > + u64 huge_spte, split_spte; > > + int split_level, index; > > + unsigned int access; > > + u64 *split_sptep; > > + gfn_t split_gfn; > > + > > + split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp); > > + if (!split_sp) > > + return -EOPNOTSUPP; > > + > > + /* > > + * Since we did not allocate pte_list_desc_structs for the split, we > > + * cannot add a new parent SPTE to parent_ptes. This should never happen > > + * in practice though since this is a fresh SP. > > + * > > + * Note, this makes it safe to pass NULL to __link_shadow_page() below. > > + */ > > + if (WARN_ON_ONCE(split_sp->parent_ptes.val)) > > + return -EINVAL; > > + > > + huge_spte = READ_ONCE(*huge_sptep); > > + > > + split_level = split_sp->role.level; > > + access = split_sp->role.access; > > + > > + for (index = 0; index < PT64_ENT_PER_PAGE; index++) { > > + split_sptep = &split_sp->spt[index]; > > + split_gfn = kvm_mmu_page_get_gfn(split_sp, index); > > + > > + BUG_ON(is_shadow_present_pte(*split_sptep)); > > + > > + /* > > + * Since we did not allocate pte_list_desc structs for the > > + * split, we can't add a new SPTE that maps this GFN. > > + * Skipping this SPTE means we're only partially mapping the > > + * huge page, which means we'll need to flush TLBs before > > + * dropping the MMU lock. > > + * > > + * Note, this make it safe to pass NULL to __rmap_add() below. > > + */ > > + if (gfn_to_rmap(split_gfn, split_level, slot)->val) { > > + *flush = true; > > + continue; > > + } > > ... here. > > IIUC this check should already be able to cover all the cases and it's > accurate on the fact that we don't want to grow any rmap to >1 len. > > > + > > + split_spte = make_huge_page_split_spte( > > + huge_spte, split_level + 1, index, access); > > + > > + mmu_spte_set(split_sptep, split_spte); > > + __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access); > > __rmap_add() with a NULL cache pointer is weird.. same as > __link_shadow_page() below. > > I'll stop here for now I guess.. Have you considered having rmap allocation > ready altogether, rather than making this intermediate step but only add > that later? Because all these look hackish to me.. It's also possible > that I missed something important, if so please shoot. I'd be happy to do it that way. The reasons I broke it up into the intermediate steps are: - At Google we only support up to including this patch. We don't handle the cases where the rmap or parent_ptes list need to grow. - It seemed like a good way to break up the support into smaller patches. But I think this backfired since the intermediate steps introduce their own complexity such as passing in NULL to __rmap_add(). > > Thanks, > > > + } > > + > > + /* > > + * Replace the huge spte with a pointer to the populated lower level > > + * page table. Since we are making this change without a TLB flush vCPUs > > + * will see a mix of the split mappings and the original huge mapping, > > + * depending on what's currently in their TLB. This is fine from a > > + * correctness standpoint since the translation will either be identical > > + * or non-present. To account for non-present mappings, the TLB will be > > + * flushed prior to dropping the MMU lock. > > + */ > > + __drop_large_spte(kvm, huge_sptep, false); > > + __link_shadow_page(NULL, huge_sptep, split_sp); > > + > > + return 0; > > +} > > -- > Peter Xu >