On Mon, Jul 29, 2013 at 01:55:46PM +0200, Paolo Bonzini wrote: > Il 29/07/2013 13:33, Gleb Natapov ha scritto: > > On Mon, Jul 29, 2013 at 11:48:06AM +0200, Paolo Bonzini wrote: > >> Il 25/07/2013 12:59, Gleb Natapov ha scritto: > >>> From: Nadav Har'El <nyh@xxxxxxxxxx> > >>> > >>> This is the first patch in a series which adds nested EPT support to KVM's > >>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use > >>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest > >>> to set its own cr3 and take its own page faults without either of L0 or L1 > >>> getting involved. This often significanlty improves L2's performance over the > >>> previous two alternatives (shadow page tables over EPT, and shadow page > >>> tables over shadow page tables). > >>> > >>> This patch adds EPT support to paging_tmpl.h. > >>> > >>> paging_tmpl.h contains the code for reading and writing page tables. The code > >>> for 32-bit and 64-bit tables is very similar, but not identical, so > >>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once > >>> with PTTYPE=64, and this generates the two sets of similar functions. > >>> > >>> There are subtle but important differences between the format of EPT tables > >>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a > >>> third set of functions to read the guest EPT table and to write the shadow > >>> EPT table. > >>> > >>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed > >>> with "EPT") which correctly read and write EPT tables. > >>> > >>> Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx> > >>> Signed-off-by: Jun Nakajima <jun.nakajima@xxxxxxxxx> > >>> Signed-off-by: Xinhao Xu <xinhao.xu@xxxxxxxxx> > >>> Signed-off-by: Yang Zhang <yang.z.zhang@xxxxxxxxx> > >>> Signed-off-by: Gleb Natapov <gleb@xxxxxxxxxx> > >>> --- > >>> arch/x86/kvm/mmu.c | 5 +++++ > >>> arch/x86/kvm/paging_tmpl.h | 43 +++++++++++++++++++++++++++++++++++++++---- > >>> 2 files changed, 44 insertions(+), 4 deletions(-) > >>> > >>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > >>> index 4c4274d..b5273c3 100644 > >>> --- a/arch/x86/kvm/mmu.c > >>> +++ b/arch/x86/kvm/mmu.c > >>> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp > >>> return mmu->last_pte_bitmap & (1 << index); > >>> } > >>> > >>> +#define PTTYPE_EPT 18 /* arbitrary */ > >>> +#define PTTYPE PTTYPE_EPT > >>> +#include "paging_tmpl.h" > >>> +#undef PTTYPE > >>> + > >>> #define PTTYPE 64 > >>> #include "paging_tmpl.h" > >>> #undef PTTYPE > >>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h > >>> index 7581395..e38b3c0 100644 > >>> --- a/arch/x86/kvm/paging_tmpl.h > >>> +++ b/arch/x86/kvm/paging_tmpl.h > >>> @@ -58,6 +58,21 @@ > >>> #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT > >>> #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT > >>> #define CMPXCHG cmpxchg > >>> +#elif PTTYPE == PTTYPE_EPT > >>> + #define pt_element_t u64 > >>> + #define guest_walker guest_walkerEPT > >>> + #define FNAME(name) ept_##name > >>> + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK > >>> + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl) > >>> + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl) > >>> + #define PT_INDEX(addr, level) PT64_INDEX(addr, level) > >>> + #define PT_LEVEL_BITS PT64_LEVEL_BITS > >>> + #define PT_GUEST_ACCESSED_MASK 0 > >>> + #define PT_GUEST_DIRTY_MASK 0 > >>> + #define PT_GUEST_DIRTY_SHIFT 0 > >>> + #define PT_GUEST_ACCESSED_SHIFT 0 > >>> + #define CMPXCHG cmpxchg64 > >>> + #define PT_MAX_FULL_LEVELS 4 > >>> #else > >>> #error Invalid PTTYPE value > >>> #endif > >>> @@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl) > >>> > >>> static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte) > >>> { > >>> +#if PT_GUEST_DIRTY_MASK == 0 > >>> + /* dirty bit is not supported, so no need to track it */ > >>> + return; > >>> +#else > >>> unsigned mask; > >>> > >>> BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK); > >>> @@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte) > >>> mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & > >>> PT_WRITABLE_MASK; > >>> *access &= mask; > >>> +#endif > >> > >> Please put this #if/#else/#endif in the previous patch. (See also > >> below on leaving out protect_clean_gpte altogether). > >> > > Why? This change does not make much sense before EPT is introduced. The > > previous patch is just a rename that should be easily verifiable by any > > reviewer to be NOP. > > My initial impression to this patch was "everything's ready after the > previous patch, you just have to set the mask to 0". Which is not quite > true. Maybe you need three patches instead of two. > Or change commit message for patch 5 to make it more clear that it is a preparation patch? > >> if (!write_fault) > >> protect_clean_gpte(&pte_access, pte); > >> else > >> /* > >> * On a write fault, fold the dirty bit into accessed_dirty by > >> * shifting it one place right. > >> */ > >> accessed_dirty &= pte >> > >> (PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT); > >> > >> should be compiled out (in the previous patch) if dirty bits are not in use. > >> The "then" branch does nothing in that case, and the "else" branch is dead > >> code that makes no sense. > > > > I disagree, there ifdef was there and it was ugly. protect_clean_gpte > > and update_accessed_dirty_bits had to be ifdefed too. > > protect_clean_gpte is still #ifdef'ed, so we're talking of a net gain of > 2 lines. Please no need to count lines :) protect_clean_gpte() _has_ ifdef, it is not ifdefed. > > Something like this: > > + /* if dirty bit is not supported, no need to track it */ > +#if PT_GUEST_DIRTY_MASK == 0 > if (!write_fault) > protect_clean_gpte(&pte_access, pte); > ... > if (unlikely(!accessed_dirty)) { > ... > } > +#endif > I will have to do the same for update_accessed_dirty_bits(). The problem of idfdefs they spread around. > doesn't look bad at all. With the old check on EPT it looked ugly, but > with the new check on PT_GUEST_DIRTY_MASK it is quite natural. Also > because you have anyway a reference to PT_GUEST_DIRTY_MASK in the "if". > If I see > > if (!write_fault) > protect_clean_gpte(&pte_access, pte); > else > /* > * On a write fault, fold the dirty bit into > * accessed_dirty by > * shifting it one place right. > */ > accessed_dirty &= > pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT); > > if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) { > > the obvious reaction is "what, is there a case where I'm using > accessed_dirty if PT_GUEST_DIRTY_MASK == 0?" Of course it makes sense In this case accessed_dirty has correct value of 0 :) The if() bellow just tells you that since A/D is not supported there is nothing to be done about zero value of accessed_dirty, but the value itself is correct! > to leave the otherwise dead assignments to accessed_dirty in the loop > (the compiler removes them anyway); but here the assignment is too close > to the user to not raise such questions. > I would think since the assignment is close to the loop it is _more_ easy to see that it is dead, not less. > Perhaps it's obvious to you because you're more familiar with this code. > > > Compiler should be > > smart enough to get rid of all of this code when PT_GUEST_DIRTY_MASK is 0. > > Doing it like that was Xiao idea and it looks much nicer. > > > >> Once you do this, you can add a > >> > >> BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT < PT_GUEST_ACCESSED_SHIFT); > >> > >> before the shift. > >> > >> Please check if, with these changes, you can avoid defining > >> PT_GUEST_{DIRTY,ACCESSED}_SHIFT altogether in the EPT case. > >> This is safer because you are sure you left no undefined > >> behaviors when a bit is being folded onto another. > > You basically ask me to get back to the patch how it was before I > > addressed Xiao comment and add some more idfefs because previously not > > all places where A/D bits were used were protected by it. IMO this would > > be a step backward especially as the method in this patch series is a > > preparation for A/D support for EPT. When those bits are supported with > > EPT they are different than in regular page tables. > > Yes, and if they will put the dirty bit below the accessed bit (rather > than above), you will silently get undefined behavior for a shift by > negative value. Adding BUILD_BUG_ONs for this, and ensuring > PT_GUEST_{DIRTY,ACCESSED}_SHIFT are never used in the EPT case, is > important in my opinion. > The format is defined. Dirty bit is above Accessed. > I very much like what you did in the previous patch, allowing > customization of the masks instead of using EPT ifdefs all over the > place. On the other hand, once you did that the benefits of Xiao's > proposed change pretty much go away. > > >> In principle, with these changes you could leave protect_clean_gpte in mmu.c. > > > > Only if I ifdef all other uses of in in the file. > > Yeah, scrap that. > > Paolo -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html