The following patches add nested EPT support to Nested VMX. This is the second version of this patch set. Most of the issues from the previous reviews were handled, and in particular there is now a new variant of paging_tmpl for EPT page tables. However, while this version does work in my tests, there are still some known problems/bugs with this version and unhandled issues from the previous review: 1. 32-bit *PAE* L2s currently don't work. non-PAE 32-bit L2s do work (and so do, of course, 64-bit L2s). 2. nested_ept_inject_page_fault() assumes vm_exit_reason is already set to EPT_VIOLATION. However, it is conceivable that L0 emulates some L2 instruction, and during this emulation we read some L2 memory causing a need to exit (from L2 to L1) with an EPT violation. 3. Moreover, now nested_ept_inject_page_fault() always causes an EPT_VIOLATION, with vmcs12->exit_qualification = fault->error_code. This is wrong: first fault->error code is not in exit qualification format but in PFERR_* format. Moreover, PFERR_RSVD_MASK would mean we need to cause an EPT_MISCONFIG, NOT EPT_VIOLATION. Instead of trying to fix this by translating PFERR to exit_qualification, we should calculate and remember in walk_addr() the exit qualification (and and an additional bit: whether it's an EPT VIOLATION or MISCONFIGURATION). This will be remembered in new fields in x86_exception. Avi suggested: "[add to x86_exception] another bool, to distinguish between EPT VIOLATION and EPT_QUALIFICATION. The error_code field should be extended to 64 bits for EXIT_QUALIFICATION (though only bits 0-12 are defined). You need another field for the guest linear address. EXIT_QUALIFICATION has to be calculated, it cannot be derived from the original exit. Look at kvm_propagate_fault()." He also added: "If we're injecting an EPT VIOLATION to L1 (because we weren't able to resolve it; say L1 write-protected the page), then we need to compute EXIT_QUALIFICATION. Bits 3-5 of EXIT_QUALIFICATION are computed from EPT12 paging structure entries (easy to derive them from pt_access/pte_access)." 4. Also, nested_ept_inject_page_fault() doesn't set guest linear address. 5. There are several "TODO"s left in the code. If there's any volunteer willing to help me with some of these issues, it would be great :-) About nested EPT: ----------------- Nested EPT means emulating EPT for an L1 guest, allowing it to use EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set its own cr3 and take its own page faults without either of L0 or L1 getting involved. In many workloads this significanlty improves L2's performance over the previous two alternatives (shadow page tables over ept, and shadow page tables over shadow page tables). Our paper [1] described these three options, and the advantages of nested EPT ("multidimensional paging" in the paper). Nested EPT is enabled by default (if the hardware supports EPT), so users do not have to do anything special to enjoy the performance improvement that this patch gives to L2 guests. L1 may of course choose not to use nested EPT, by simply not using EPT (e.g., a KVM in L1 may use the "ept=0" option). Just as a non-scientific, non-representative indication of the kind of dramatic performance improvement you may see in workloads that have a lot of context switches and page faults, here is a measurement of the time an example single-threaded "make" took in L2 (kvm over kvm): shadow over shadow: 105 seconds ("ept=0" in L0 forces this) shadow over EPT: 87 seconds (the previous default; Can be forced with "ept=0" in L1) EPT over EPT: 29 seconds (the default after this patch) Note that the same test on L1 (with EPT) took 25 seconds, so for this example workload, performance of nested virtualization is now very close to that of single-level virtualization. [1] "The Turtles Project: Design and Implementation of Nested Virtualization", http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf Patch statistics: ----------------- Documentation/virtual/kvm/nested-vmx.txt | 4 arch/x86/include/asm/vmx.h | 2 arch/x86/kvm/mmu.c | 52 +++- arch/x86/kvm/mmu.h | 1 arch/x86/kvm/paging_tmpl.h | 98 ++++++++- arch/x86/kvm/vmx.c | 227 +++++++++++++++++++-- arch/x86/kvm/x86.c | 11 - 7 files changed, 354 insertions(+), 41 deletions(-) -- Nadav Har'El IBM Haifa Research Lab -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html