Re: [PATCH 02/10] nEPT: MMU context for nested EPT

Orit Wasserman <owasserm@xxxxxxxxxx> · Sun, 13 Nov 2011 13:30:15 +0200

On 11/12/2011 11:37 PM, Nadav Har'El wrote:
> On Sat, Nov 12, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
>> host may write-protect a page.  Second, the shadow and guest ptes may be
>> in different formats (ept vs ia32).
> 
> I'm afraid I've lost you here... The shadow table and the to-be-shadowed 
> table are both ia32 (this is the normal shadow table code), or both ept
> (the nested tdp code). When are they supposed to be in different
> formats (ept vs ia32)?
> 
> I'm also puzzled in what situation will the host will write-protect an EPT02
> (shadow EPT) page?
> 
>> In fact that happens to accidentally work, no?  Intermediate ptes are
>> always present/write/user, which translates to read/write/execute in EPT.
> 
> It didn't work because it also used to set the "accessed" bit, bit 5,
> which on EPT is reserved and caused EPT misconfiguration. So I had to
> fix link_shadow_page, or nested EPT would not work at all.
> 
>> Don't optimize for least changes, optimize for best result afterwards.
> 
> As I'm sure you remember, two years ago, in September 6 2009, you wrote in
> your blog about the newly contributed nested VMX patch set, and in
> particular its nested EPT (which predated the nested NPT contribution).
> 
> Nested EPT was, for some workloads, a huge performance improvement, but
> you (if I understand correctly) did not want that code in KVM because
> it, basically, optimized for getting the job done, in the most correct
> and most efficient manner - but without regard of how cleanly this fit with
> other types of shadowing (normal shadow page tables, and nested NPT),
> or how much of the code was being duplicated or circumvented.
> 
> So this time around, I couldn't really "not optimize for least changes".
> This time, the nested EPT had to fit (like a square peg in a round hole
> ;-)), into the preexisting MMU and NPT shadowing. I couldn't really just write
> the most correct and most efficient code (which Orit Wasserman already
> did, two years earlier). This time I needed to figure out the least obtrusive
> way of changing the existing code. The hardest thing about doing this
> was trying to understand all the complexities and subtleties of the existing
> MMU code in KVM, which already does 101 different cases in one
> overloaded piece of code, which is not commented or documented.
> And of course, add to that all the complexities (some might even say "cruft")
> which the underlying x86 architecture itself has acrued over the years.
> So it's not surprising I've missed some of the important subtleties which
> didn't have any effect in the typical case I've tried. Like I said, in my
> tests nested EPT *did* work. And even getting to that point was hard enough :-)
> 
>> We need a third variant of walk_addr_generic that parses EPT format
>> PTEs.  Whether that's best done by writing paging_ept.h or modifying
>> paging_tmpl.h, I don't know.
> 
> Thanks. I'll think about everything you've said in this thread (I'm still
> not convinced I understood all your points, so just understanding them
> will be the first step). I'll see what I can do to improve the patch.
> 
> But I have to be honest - I'm not sure how quickly I can finish this.
> I really appreciate all your comments about nested VMX in the last two
> years - most of them have been spot-on, 100% correct, and really helpful
> for making me understand things which I had previously misunderstood.
> However, since you are (of course) extremely familiar with every nook and
> cranny of KVM, what normally happens is that every comment which took you
> 5 minutes to figure out, takes me 5 days to fully understand, and to actually
> write, debug and test the fixed code. Every review that takes you two days
> to go through (and is very much appreciated!) takes me several months to fix
> each and every thing you asked for.
> 
> Don't get me wrong, I *am* planning to continue working (part-time) on nested
> VMX, and nested EPT in particular. But if you want it to pick up the pace,
> I could use some help with actual coding from people who have much more
> intimate knowledge of the non-nested-VMX parts of KVM than I have.
> 
> In the meantime, if anybody wants to experiment with a much faster
> Nested VMX than we had before, you can try my current patch. It may not
> be perfect, but in many ways it is better than the old shadow-on-ept code.
> And in simple (64 bit, 4k page) kvm-over-kvm configurations like I tried, it
> works well.
> 
> Nadav.
> 

Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly.
I'm sorry I don't have setup to run nested VMX at the moment so i can't test it.

Orit
>From 032ba0221ea690f61ecc4f6d946090d18d6f4dfb Mon Sep 17 00:00:00 2001
From: Orit Wasserman <owasserm@xxxxxxxxxxx>
Date: Sun, 13 Nov 2011 12:57:17 +0200
Subject: [PATCH] Add EPT_walk_addr_genric

---
 arch/x86/kvm/mmu.c         |   13 +++++++++++++
 arch/x86/kvm/mmu.h         |    5 +++++
 arch/x86/kvm/paging_tmpl.h |   44 ++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9335e1b..bbe212f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3180,6 +3180,10 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
 #include "paging_tmpl.h"
 #undef PTTYPE
 
+#define PTTYPE EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
 #define PTTYPE 32
 #include "paging_tmpl.h"
 #undef PTTYPE
@@ -3381,6 +3385,15 @@ static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 	return r;
 }
 
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+	int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
+
+	vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested;
+
+	return r;
+}
+
 static int init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *g_context = &vcpu->arch.nested_mmu;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e374db9..545e2ff 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -48,6 +48,11 @@
 #define PFERR_RSVD_MASK (1U << 3)
 #define PFERR_FETCH_MASK (1U << 4)
 
+#define EPT_WRITABLE_MASK 2
+#define EPT_EXEC_MASK 4
+
+#define EPT 18
+
 int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 507e2b8..70d4cfd 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -39,6 +39,21 @@
 	#define CMPXCHG cmpxchg64
 	#define PT_MAX_FULL_LEVELS 2
 	#endif
+#elif PTTYPE == EPT
+	#define pt_element_t u64
+	#define FNAME(name) EPT_##name
+	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+	#define PT_LEVEL_BITS PT64_LEVEL_BITS
+	#ifdef CONFIG_X86_64
+	#define PT_MAX_FULL_LEVELS 4
+	#define CMPXCHG cmpxchg
+	#else
+	#define CMPXCHG cmpxchg64
+	#define PT_MAX_FULL_LEVELS 2
+	#endif
 #elif PTTYPE == 32
 	#define pt_element_t u32
 	#define guest_walker guest_walker32
@@ -106,14 +121,19 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
 {
 	unsigned access;
 
+#if PTTYPE == EPT       
 	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+#else
+	access = (gpte & EPT_WRITABLE_MASK) | EPT_EXEC_MASK;
 	if (last && !is_dirty_gpte(gpte))
 		access &= ~ACC_WRITE_MASK;
+#endif
 
 #if PTTYPE == 64
 	if (vcpu->arch.mmu.nx)
 		access &= ~(gpte >> PT64_NX_SHIFT);
-#endif
+#endif 
+
 	return access;
 }
 
@@ -149,9 +169,15 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	gpa_t pte_gpa;
 	bool eperm;
 	int offset;
+#if PTTYPE == EPT
+	const int write_fault = access & EPT_WRITABLE_MASK;
+	const int user_fault  = 0;
+	const int fetch_fault = 0;
+#else
 	const int write_fault = access & PFERR_WRITE_MASK;
 	const int user_fault  = access & PFERR_USER_MASK;
 	const int fetch_fault = access & PFERR_FETCH_MASK;
+#endif
 	u16 errcode = 0;
 
 	trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault,
@@ -174,6 +200,9 @@ retry_walk:
 	       (mmu->get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0);
 
 	pt_access = ACC_ALL;
+#if PTTYPE == EPT 
+	pt_access =  PT_PRESENT_MASK | EPT_WRITABLE_MASK | EPT_EXEC_MASK;
+#endif
 
 	for (;;) {
 		gfn_t real_gfn;
@@ -186,9 +215,14 @@ retry_walk:
 		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
 		walker->table_gfn[walker->level - 1] = table_gfn;
 		walker->pte_gpa[walker->level - 1] = pte_gpa;
-
+#if PTTYPE == EPT 
+		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
+					      EPT_WRITABLE_MASK);
+#else
 		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
 					      PFERR_USER_MASK|PFERR_WRITE_MASK);
+#endif
+
 		if (unlikely(real_gfn == UNMAPPED_GVA))
 			goto error;
 		real_gfn = gpa_to_gfn(real_gfn);
@@ -221,6 +255,7 @@ retry_walk:
 			eperm = true;
 #endif
 
+#if PTTYPE != EPT
 		if (!eperm && unlikely(!(pte & PT_ACCESSED_MASK))) {
 			int ret;
 			trace_kvm_mmu_set_accessed_bit(table_gfn, index,
@@ -235,7 +270,7 @@ retry_walk:
 			mark_page_dirty(vcpu->kvm, table_gfn);
 			pte |= PT_ACCESSED_MASK;
 		}
-
+#endif
 		walker->ptes[walker->level - 1] = pte;
 
 		if (FNAME(is_last_gpte)(walker, vcpu, mmu, pte)) {
@@ -244,12 +279,13 @@ retry_walk:
 			gfn_t gfn;
 			u32 ac;
 
+#if PTTYPE != EPT
 			/* check if the kernel is fetching from user page */
 			if (unlikely(pte_access & PT_USER_MASK) &&
 			    kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
 				if (fetch_fault && !user_fault)
 					eperm = true;
-
+#endif
 			gfn = gpte_to_gfn_lvl(pte, lvl);
 			gfn += (addr & PT_LVL_OFFSET_MASK(lvl)) >> PAGE_SHIFT;
 
-- 
1.7.6.4