On 11/12/2011 11:37 PM, Nadav Har'El wrote: > On Sat, Nov 12, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT": >> host may write-protect a page. Second, the shadow and guest ptes may be >> in different formats (ept vs ia32). > > I'm afraid I've lost you here... The shadow table and the to-be-shadowed > table are both ia32 (this is the normal shadow table code), or both ept > (the nested tdp code). When are they supposed to be in different > formats (ept vs ia32)? > > I'm also puzzled in what situation will the host will write-protect an EPT02 > (shadow EPT) page? > >> In fact that happens to accidentally work, no? Intermediate ptes are >> always present/write/user, which translates to read/write/execute in EPT. > > It didn't work because it also used to set the "accessed" bit, bit 5, > which on EPT is reserved and caused EPT misconfiguration. So I had to > fix link_shadow_page, or nested EPT would not work at all. > >> Don't optimize for least changes, optimize for best result afterwards. > > As I'm sure you remember, two years ago, in September 6 2009, you wrote in > your blog about the newly contributed nested VMX patch set, and in > particular its nested EPT (which predated the nested NPT contribution). > > Nested EPT was, for some workloads, a huge performance improvement, but > you (if I understand correctly) did not want that code in KVM because > it, basically, optimized for getting the job done, in the most correct > and most efficient manner - but without regard of how cleanly this fit with > other types of shadowing (normal shadow page tables, and nested NPT), > or how much of the code was being duplicated or circumvented. > > So this time around, I couldn't really "not optimize for least changes". > This time, the nested EPT had to fit (like a square peg in a round hole > ;-)), into the preexisting MMU and NPT shadowing. I couldn't really just write > the most correct and most efficient code (which Orit Wasserman already > did, two years earlier). This time I needed to figure out the least obtrusive > way of changing the existing code. The hardest thing about doing this > was trying to understand all the complexities and subtleties of the existing > MMU code in KVM, which already does 101 different cases in one > overloaded piece of code, which is not commented or documented. > And of course, add to that all the complexities (some might even say "cruft") > which the underlying x86 architecture itself has acrued over the years. > So it's not surprising I've missed some of the important subtleties which > didn't have any effect in the typical case I've tried. Like I said, in my > tests nested EPT *did* work. And even getting to that point was hard enough :-) > >> We need a third variant of walk_addr_generic that parses EPT format >> PTEs. Whether that's best done by writing paging_ept.h or modifying >> paging_tmpl.h, I don't know. > > Thanks. I'll think about everything you've said in this thread (I'm still > not convinced I understood all your points, so just understanding them > will be the first step). I'll see what I can do to improve the patch. > > But I have to be honest - I'm not sure how quickly I can finish this. > I really appreciate all your comments about nested VMX in the last two > years - most of them have been spot-on, 100% correct, and really helpful > for making me understand things which I had previously misunderstood. > However, since you are (of course) extremely familiar with every nook and > cranny of KVM, what normally happens is that every comment which took you > 5 minutes to figure out, takes me 5 days to fully understand, and to actually > write, debug and test the fixed code. Every review that takes you two days > to go through (and is very much appreciated!) takes me several months to fix > each and every thing you asked for. > > Don't get me wrong, I *am* planning to continue working (part-time) on nested > VMX, and nested EPT in particular. But if you want it to pick up the pace, > I could use some help with actual coding from people who have much more > intimate knowledge of the non-nested-VMX parts of KVM than I have. > > In the meantime, if anybody wants to experiment with a much faster > Nested VMX than we had before, you can try my current patch. It may not > be perfect, but in many ways it is better than the old shadow-on-ept code. > And in simple (64 bit, 4k page) kvm-over-kvm configurations like I tried, it > works well. > > Nadav. > Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly. I'm sorry I don't have setup to run nested VMX at the moment so i can't test it. Orit
>From 032ba0221ea690f61ecc4f6d946090d18d6f4dfb Mon Sep 17 00:00:00 2001 From: Orit Wasserman <owasserm@xxxxxxxxxxx> Date: Sun, 13 Nov 2011 12:57:17 +0200 Subject: [PATCH] Add EPT_walk_addr_genric --- arch/x86/kvm/mmu.c | 13 +++++++++++++ arch/x86/kvm/mmu.h | 5 +++++ arch/x86/kvm/paging_tmpl.h | 44 ++++++++++++++++++++++++++++++++++++++++---- 3 files changed, 58 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9335e1b..bbe212f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3180,6 +3180,10 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access, #include "paging_tmpl.h" #undef PTTYPE +#define PTTYPE EPT +#include "paging_tmpl.h" +#undef PTTYPE + #define PTTYPE 32 #include "paging_tmpl.h" #undef PTTYPE @@ -3381,6 +3385,15 @@ static int init_kvm_softmmu(struct kvm_vcpu *vcpu) return r; } +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) +{ + int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu); + + vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested; + + return r; +} + static int init_kvm_nested_mmu(struct kvm_vcpu *vcpu) { struct kvm_mmu *g_context = &vcpu->arch.nested_mmu; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index e374db9..545e2ff 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -48,6 +48,11 @@ #define PFERR_RSVD_MASK (1U << 3) #define PFERR_FETCH_MASK (1U << 4) +#define EPT_WRITABLE_MASK 2 +#define EPT_EXEC_MASK 4 + +#define EPT 18 + int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]); void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask); int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 507e2b8..70d4cfd 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -39,6 +39,21 @@ #define CMPXCHG cmpxchg64 #define PT_MAX_FULL_LEVELS 2 #endif +#elif PTTYPE == EPT + #define pt_element_t u64 + #define FNAME(name) EPT_##name + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl) + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl) + #define PT_INDEX(addr, level) PT64_INDEX(addr, level) + #define PT_LEVEL_BITS PT64_LEVEL_BITS + #ifdef CONFIG_X86_64 + #define PT_MAX_FULL_LEVELS 4 + #define CMPXCHG cmpxchg + #else + #define CMPXCHG cmpxchg64 + #define PT_MAX_FULL_LEVELS 2 + #endif #elif PTTYPE == 32 #define pt_element_t u32 #define guest_walker guest_walker32 @@ -106,14 +121,19 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte, { unsigned access; +#if PTTYPE == EPT access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK; +#else + access = (gpte & EPT_WRITABLE_MASK) | EPT_EXEC_MASK; if (last && !is_dirty_gpte(gpte)) access &= ~ACC_WRITE_MASK; +#endif #if PTTYPE == 64 if (vcpu->arch.mmu.nx) access &= ~(gpte >> PT64_NX_SHIFT); -#endif +#endif + return access; } @@ -149,9 +169,15 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, gpa_t pte_gpa; bool eperm; int offset; +#if PTTYPE == EPT + const int write_fault = access & EPT_WRITABLE_MASK; + const int user_fault = 0; + const int fetch_fault = 0; +#else const int write_fault = access & PFERR_WRITE_MASK; const int user_fault = access & PFERR_USER_MASK; const int fetch_fault = access & PFERR_FETCH_MASK; +#endif u16 errcode = 0; trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault, @@ -174,6 +200,9 @@ retry_walk: (mmu->get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0); pt_access = ACC_ALL; +#if PTTYPE == EPT + pt_access = PT_PRESENT_MASK | EPT_WRITABLE_MASK | EPT_EXEC_MASK; +#endif for (;;) { gfn_t real_gfn; @@ -186,9 +215,14 @@ retry_walk: pte_gpa = gfn_to_gpa(table_gfn) + offset; walker->table_gfn[walker->level - 1] = table_gfn; walker->pte_gpa[walker->level - 1] = pte_gpa; - +#if PTTYPE == EPT + real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn), + EPT_WRITABLE_MASK); +#else real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn), PFERR_USER_MASK|PFERR_WRITE_MASK); +#endif + if (unlikely(real_gfn == UNMAPPED_GVA)) goto error; real_gfn = gpa_to_gfn(real_gfn); @@ -221,6 +255,7 @@ retry_walk: eperm = true; #endif +#if PTTYPE != EPT if (!eperm && unlikely(!(pte & PT_ACCESSED_MASK))) { int ret; trace_kvm_mmu_set_accessed_bit(table_gfn, index, @@ -235,7 +270,7 @@ retry_walk: mark_page_dirty(vcpu->kvm, table_gfn); pte |= PT_ACCESSED_MASK; } - +#endif walker->ptes[walker->level - 1] = pte; if (FNAME(is_last_gpte)(walker, vcpu, mmu, pte)) { @@ -244,12 +279,13 @@ retry_walk: gfn_t gfn; u32 ac; +#if PTTYPE != EPT /* check if the kernel is fetching from user page */ if (unlikely(pte_access & PT_USER_MASK) && kvm_read_cr4_bits(vcpu, X86_CR4_SMEP)) if (fetch_fault && !user_fault) eperm = true; - +#endif gfn = gpte_to_gfn_lvl(pte, lvl); gfn += (addr & PT_LVL_OFFSET_MASK(lvl)) >> PAGE_SHIFT; -- 1.7.6.4