[PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support

Kai Huang <kai.huang@xxxxxxxxxxxxxxx> · Wed, 28 Jan 2015 10:54:22 +0800

This patch series adds Page Modification Logging (PML) support in VMX.

1) Introduction

PML is a new feature on Intel's Boardwell server platfrom targeted to reduce
overhead of dirty logging mechanism.

The specification can be found at:

http://www.intel.com/content/www/us/en/processors/page-modification-logging-vmm-white-paper.html

Currently, dirty logging is done by write protection, which write protects guest
memory, and mark dirty GFN to dirty_bitmap in subsequent write fault. This works
fine, except with overhead of additional write fault for logging each dirty GFN.
The overhead can be large if the write operations from geust is intensive.

PML is a hardware-assisted efficient way for dirty logging. PML logs dirty GPA
automatically to a 4K PML memory buffer when CPU changes EPT table's D-bit from
0 to 1. To do this, A new 4K PML buffer base address, and a PML index were added
to VMCS. Initially PML index is set to 512 (8 bytes for each GPA), and CPU
decreases PML index after logging one GPA, and eventually a PML buffer full
VMEXIT happens when PML buffer is fully logged.

With PML, we don't have to use write protection so the intensive write fault EPT
violation can be avoided, with an additional PML buffer full VMEXIT for 512
dirty GPAs. Theoretically, this can reduce hypervisor overhead when guest is in
dirty logging mode, and therefore more CPU cycles can be allocated to guest, so
it's expected benchmarks in guest will have better performance comparing to
non-PML.

2) Design

a. Enable/Disable PML

PML is per-vcpu (per-VMCS), while EPT table can be shared by vcpus, so we need
to enable/disable PML for all vcpus of guest. A dedicated 4K page will be
allocated for each vcpu when PML is enabled for that vcpu.

Currently, we choose to always enable PML for guest, which means we enables PML
when creating VCPU, and never disable it during guest's life time. This avoids
the complicated logic to enable PML by demand when guest is running. And to
eliminate potential unnecessary GPA logging in non-dirty logging mode, we set
D-bit manually for the slots with dirty logging disabled.

b. Flush PML buffer

When userspace querys dirty_bitmap, it's possible that there are GPAs logged in
vcpu's PML buffer, but as PML buffer is not full, so no VMEXIT happens. In this
case, we'd better to manually flush PML buffer for all vcpus and update the
dirty GPAs to dirty_bitmap.

We do PML buffer flush at the beginning of each VMEXIT, this makes dirty_bitmap
more updated, and also makes logic of flushing PML buffer for all vcpus easier
-- we only need to kick all vcpus out of guest and PML buffer for each vcpu will
be flushed automatically.

3) Tests and benchmark results

I tested specjbb benchmark, which is memory intensive to measure PML. All tests
are done in below configuration:

Machine (Boardwell server): 16 CPUs (1.4G) + 4G memory
Host Kernel: KVM queue branch. Transparent Hugepage disabled. C-state, P-state,
	S-state disabled. Swap disabled.

Guest: Ubuntu 14.04 with kernel 3.13.0-36-generic
Guest: 4 vcpus + 1G memory. All vcpus are pinned.

a. Comapre score with and without PML enabled.

This is to make sure PML won't bring any performance regression as it's always
enabled for guest.

Booting guest with graphic window (no --nographic)

	NOPML		PML

	109755		109379
	108786		109300
	109234		109663
	109257		107471
	108514		108904
	109740		107623

avg:	109214		108723

performance regression: (109214 - 108723) / 109214 = 0.45%

Booting guest without graphic window (--nographic)

	NOPML		PML

	109090		109686
	109461		110533
	110523		108550
	109960		110775
	109090		109802
	110787		109192

avg:	109818		109756

performance regression: (109818 - 109756) / 109818 = 0.06%

So there's no noticeable performance regression leaving PML always enabled.

b. Compare specjbb score between PML and Write Protection.

This is used to see how much performance gain PML can bring when guest is in
dirty logging mode.

I modified qemu by adding an additional "Monitoring thread" to query
dirty_bitmap periodically (once per 1 second). With this thread, we can get
performance gain of PML by comparing specjbb score under PML code path and
write protection code path.

Again, I got score for both with/without graphic window of guest.

Booting guest with graphic window (no --nographic)

		PML		WP 		No monitoring thread

		104748		101358
		102934		99895
		103525		98832
		105331		100678
		106038		99476
		104776		99851

	avg:	104558		100015		108723 (== PML score in test a)

	percent: 96.17%		91.99%		100%

	performance gain:	96.17% - 91.99% = 4.18%

Booting guest without graphic window (--nographic)

		PML		WP		No monithring thread

		104778		98967
		104856		99380
		103783		99406
		105210		100638
		106218		99763
		105475		99287

	avg:	105053		99573		109756 (== PML score in test a)

	percent: 95.72%		90.72%		100%

	performance gain:  95.72% - 90.72% = 5%

So there's noticeable performance gain (around 4%~5%) of PML comparing to Write
Protection.

Kai Huang (6):
  KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic
    for log dirty
  KVM: MMU: Add mmu help functions to support PML
  KVM: MMU: Explicitly set D-bit for writable spte.
  KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access
  KVM: x86: Add new dirty logging kvm_x86_ops for PML
  KVM: VMX: Add PML support in VMX

 arch/arm/kvm/mmu.c              |  18 ++-
 arch/x86/include/asm/kvm_host.h |  37 +++++-
 arch/x86/include/asm/vmx.h      |   4 +
 arch/x86/include/uapi/asm/vmx.h |   1 +
 arch/x86/kvm/mmu.c              | 243 +++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/trace.h            |  18 +++
 arch/x86/kvm/vmx.c              | 195 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |  78 +++++++++++--
 include/linux/kvm_host.h        |   2 +-
 virt/kvm/kvm_main.c             |   2 +-
 10 files changed, 577 insertions(+), 21 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html