On Fri, Dec 10, 2021, Lai Jiangshan wrote: > From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx> > > Level pormotion occurs when mmu->shadow_root_level > mmu->root_level. s/pormotion/promotion (in multiple places) That said, I strongly prefer "passthrough" or maybe "nop" over "level_promoted". Promoted in the context of page level usually refers to making a hugepage, which is not the case here. > > There are several cases that can cuase level pormotion: > > shadow mmu (shadow paging for 32 bit guest): > case1: gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0 > > shadow nested NPT (for 32bit L1 hypervisor): > case2: gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0 > case3: gCR0_PG=1,gEFER_LMA=0,hEFER_LMA=1 > > shadow nested NPT (for 64bit L1 hypervisor): > case4: gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1 > > When level pormotion occurs (32bit guest, case1-3), special roots are > often used. But case4 is not using special roots. It uses shadow page > without fully aware of the specialty. It might work accidentally: > 1) The root_page (root_sp->spt) is allocated with level = 5, > and root_sp->spt[0] is allocated with the same gfn and the > same role except role.level = 4. Luckly that they are > different shadow pages. > 2) FNAME(walk_addr_generic) sets walker->table_gfn[4] and > walker->pt_access[4], which are normally unused when > mmu->shadow_root_level == mmu->root_level == 4, so that > FNAME(fetch) can use them to allocate shadow page for > root_sp->spt[0] and link them when shadow_root_level == 5. > > But it has problems. > If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse) > and usees the same gfn as the root for the nNPT before and after > switching gCR4_LA57. The host (hCR4_LA57=1) wold use the same root_sp > for guest even guest switches gCR4_LA57. The guest will see unexpected > page mapped and L2 can hurts L1. It is lucky the the problem can't > hurt L0. > > The root_sp should be like role.direct=1 sometimes: its contents are > not backed by gptes, root_sp->gfns is meaningless. For a normal high > level sp, sp->gfns is often unused and kept zero, but it could be > relevant and meaningful when sp->gfns is used because they are back > by concret gptes. For level-promoted root_sp described before, root_sp > is just a portal to contribute root_sp->spt[0], and root_sp should not > have root_sp->gfns and root_sp->spt[0] should not be dropped if gpte[0] > of the root gfn is changed. > > This patch adds role.level_promoted to address the two problems. > role.level_promoted is set when shadow paging and > role.level > gMMU.level. > > An alternative way to fix the problem of case4 is that: also using the > special root pml5_root for it. But it would required to change many > other places because it is assumption that special roots is only used > for 32bit guests. > > This patch also paves the way to use level promoted shadow page for > case1-3, but that requires the special handling or PAE paging, so the > extensive usage of it is not included. > > Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx> > --- > arch/x86/include/asm/kvm_host.h | 3 ++- > arch/x86/kvm/mmu/mmu.c | 15 +++++++++++++-- > arch/x86/kvm/mmu/paging_tmpl.h | 1 + > 3 files changed, 16 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 88ecf53f0d2b..6465c83794fc 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -334,7 +334,8 @@ union kvm_mmu_page_role { > unsigned smap_andnot_wp:1; > unsigned ad_disabled:1; > unsigned guest_mode:1; > - unsigned :6; > + unsigned level_promoted:1; > + unsigned :5; The massive comment above this union needs to be updated. > /* > * This is left at the top of the word so that > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 54e7cbc15380..4769253e9024 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -767,6 +767,9 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc) > > static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) > { > + if (sp->role.level_promoted) > + return sp->gfn; > + > if (!sp->role.direct) > return sp->gfns[index]; > > @@ -776,6 +779,8 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) > static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn) > { > if (!sp->role.direct) { > + if (WARN_ON_ONCE(sp->role.level_promoted && gfn != sp->gfn)) > + return; > sp->gfns[index] = gfn; This is wrong, sp->gfns is NULL when sp->role.level_promoted is true. I believe you want something similar to the "get" flow, e.g. if (sp->role.passthrough) { WARN_ON_ONCE(gfn != sp->gfn); return; } if (!sp->role.direct) { ... } Alternatively, should we mark passthrough shadow pages as direct=1? That would naturally handle this code, and for things like reexecute_instruction()'s usage of kvm_mmu_unprotect_page(), I don't think passthrough shadow pages should be considered indirect, e.g. zapping them won't help and the shadow page can't become unsync. > return; > } > @@ -1702,7 +1707,7 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp) > hlist_del(&sp->hash_link); > list_del(&sp->link); > free_page((unsigned long)sp->spt); > - if (!sp->role.direct) > + if (!sp->role.direct && !sp->role.level_promoted) Hmm, at this point, it may be better to just omit the check entirely. @sp is zero allocated and free_page() plays nice with NULL "pointers". I believe TDX and maybe SNP support will need to do more work here, i.e. may add back a similar check, but if so then checking sp->gfns directly for !NULL is preferable. > free_page((unsigned long)sp->gfns); > kmem_cache_free(mmu_page_header_cache, sp); > } > @@ -1740,7 +1745,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn, > > sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); > sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); > - if (!role.direct) > + if (!(role.direct || role.level_promoted)) I personally prefer the style of the previous check, mainly because I'm horrible at reading !(x || y). if (!role.direct && !role.passthrough) > sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache); > set_page_private(virt_to_page(sp->spt), (unsigned long)sp); > > @@ -2084,6 +2089,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, > quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; > role.quadrant = quadrant; > } > + if (role.level_promoted && (level <= vcpu->arch.mmu->root_level)) > + role.level_promoted = 0; > > sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; > for_each_valid_sp(vcpu->kvm, sp, sp_list) { > @@ -4836,6 +4843,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu, > > role.base.direct = false; > role.base.level = kvm_mmu_get_tdp_level(vcpu); > + if (role.base.level > role_regs_to_root_level(regs)) > + role.base.level_promoted = 1; > > return role; > } > @@ -5228,6 +5237,8 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE); > > for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) { > + if (sp->role.level_promoted) > + continue; > if (detect_write_misaligned(sp, gpa, bytes) || > detect_write_flooding(sp)) { > kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list); > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h > index 5c78300fc7d9..16ac276d342a 100644 > --- a/arch/x86/kvm/mmu/paging_tmpl.h > +++ b/arch/x86/kvm/mmu/paging_tmpl.h > @@ -1043,6 +1043,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) > .level = 0xf, > .access = 0x7, > .quadrant = 0x3, > + .level_promoted = 0x1, > }; > > /* > -- > 2.19.1.6.gb485710b >