This patch series adds Page Modification Logging (PML) support in VMX. 1) Introduction PML is a new feature on Intel's Boardwell server platfrom targeted to reduce overhead of dirty logging mechanism. The specification can be found at: http://www.intel.com/content/www/us/en/processors/page-modification-logging-vmm-white-paper.html Currently, dirty logging is done by write protection, which write protects guest memory, and mark dirty GFN to dirty_bitmap in subsequent write fault. This works fine, except with overhead of additional write fault for logging each dirty GFN. The overhead can be large if the write operations from geust is intensive. PML is a hardware-assisted efficient way for dirty logging. PML logs dirty GPA automatically to a 4K PML memory buffer when CPU changes EPT table's D-bit from 0 to 1. To do this, A new 4K PML buffer base address, and a PML index were added to VMCS. Initially PML index is set to 512 (8 bytes for each GPA), and CPU decreases PML index after logging one GPA, and eventually a PML buffer full VMEXIT happens when PML buffer is fully logged. With PML, we don't have to use write protection so the intensive write fault EPT violation can be avoided, with an additional PML buffer full VMEXIT for 512 dirty GPAs. Theoretically, this can reduce hypervisor overhead when guest is in dirty logging mode, and therefore more CPU cycles can be allocated to guest, so it's expected benchmarks in guest will have better performance comparing to non-PML. 2) Design a. Enable/Disable PML PML is per-vcpu (per-VMCS), while EPT table can be shared by vcpus, so we need to enable/disable PML for all vcpus of guest. A dedicated 4K page will be allocated for each vcpu when PML is enabled for that vcpu. Currently, we choose to always enable PML for guest, which means we enables PML when creating VCPU, and never disable it during guest's life time. This avoids the complicated logic to enable PML by demand when guest is running. And to eliminate potential unnecessary GPA logging in non-dirty logging mode, we set D-bit manually for the slots with dirty logging disabled. b. Flush PML buffer When userspace querys dirty_bitmap, it's possible that there are GPAs logged in vcpu's PML buffer, but as PML buffer is not full, so no VMEXIT happens. In this case, we'd better to manually flush PML buffer for all vcpus and update the dirty GPAs to dirty_bitmap. We do PML buffer flush at the beginning of each VMEXIT, this makes dirty_bitmap more updated, and also makes logic of flushing PML buffer for all vcpus easier -- we only need to kick all vcpus out of guest and PML buffer for each vcpu will be flushed automatically. 3) Tests and benchmark results I tested specjbb benchmark, which is memory intensive to measure PML. All tests are done in below configuration: Machine (Boardwell server): 16 CPUs (1.4G) + 4G memory Host Kernel: KVM queue branch. Transparent Hugepage disabled. C-state, P-state, S-state disabled. Swap disabled. Guest: Ubuntu 14.04 with kernel 3.13.0-36-generic Guest: 4 vcpus + 1G memory. All vcpus are pinned. a. Comapre score with and without PML enabled. This is to make sure PML won't bring any performance regression as it's always enabled for guest. Booting guest with graphic window (no --nographic) NOPML PML 109755 109379 108786 109300 109234 109663 109257 107471 108514 108904 109740 107623 avg: 109214 108723 performance regression: (109214 - 108723) / 109214 = 0.45% Booting guest without graphic window (--nographic) NOPML PML 109090 109686 109461 110533 110523 108550 109960 110775 109090 109802 110787 109192 avg: 109818 109756 performance regression: (109818 - 109756) / 109818 = 0.06% So there's no noticeable performance regression leaving PML always enabled. b. Compare specjbb score between PML and Write Protection. This is used to see how much performance gain PML can bring when guest is in dirty logging mode. I modified qemu by adding an additional "Monitoring thread" to query dirty_bitmap periodically (once per 1 second). With this thread, we can get performance gain of PML by comparing specjbb score under PML code path and write protection code path. Again, I got score for both with/without graphic window of guest. Booting guest with graphic window (no --nographic) PML WP No monitoring thread 104748 101358 102934 99895 103525 98832 105331 100678 106038 99476 104776 99851 avg: 104558 100015 108723 (== PML score in test a) percent: 96.17% 91.99% 100% performance gain: 96.17% - 91.99% = 4.18% Booting guest without graphic window (--nographic) PML WP No monithring thread 104778 98967 104856 99380 103783 99406 105210 100638 106218 99763 105475 99287 avg: 105053 99573 109756 (== PML score in test a) percent: 95.72% 90.72% 100% performance gain: 95.72% - 90.72% = 5% So there's noticeable performance gain (around 4%~5%) of PML comparing to Write Protection. Kai Huang (6): KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log dirty KVM: MMU: Add mmu help functions to support PML KVM: MMU: Explicitly set D-bit for writable spte. KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access KVM: x86: Add new dirty logging kvm_x86_ops for PML KVM: VMX: Add PML support in VMX arch/arm/kvm/mmu.c | 18 ++- arch/x86/include/asm/kvm_host.h | 37 +++++- arch/x86/include/asm/vmx.h | 4 + arch/x86/include/uapi/asm/vmx.h | 1 + arch/x86/kvm/mmu.c | 243 +++++++++++++++++++++++++++++++++++++++- arch/x86/kvm/trace.h | 18 +++ arch/x86/kvm/vmx.c | 195 +++++++++++++++++++++++++++++++- arch/x86/kvm/x86.c | 78 +++++++++++-- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 2 +- 10 files changed, 577 insertions(+), 21 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html