Re: [PATCH v7 00/12] KVM: Add host swap event notifications for PV guest

Gleb Natapov <gleb@xxxxxxxxxx> · Thu, 14 Oct 2010 11:21:08 +0200

Ignore this please. Something bad happened to From: header.

On Thu, Oct 14, 2010 at 11:16:58AM +0200, y@xxxxxxxxxx wrote:
> From: Gleb Natapov <gleb@xxxxxxxxxx>
> 
> KVM virtualizes guest memory by means of shadow pages or HW assistance
> like NPT/EPT. Not all memory used by a guest is mapped into the guest
> address space or even present in a host memory at any given time.
> When vcpu tries to access memory page that is not mapped into the guest
> address space KVM is notified about it. KVM maps the page into the guest
> address space and resumes vcpu execution. If the page is swapped out from
> the host memory vcpu execution is suspended till the page is swapped
> into the memory again. This is inefficient since vcpu can do other work
> (run other task or serve interrupts) while page gets swapped in.
> 
> The patch series tries to mitigate this problem by introducing two
> mechanisms. The first one is used with non-PV guest and it works like
> this: when vcpu tries to access swapped out page it is halted and
> requested page is swapped in by another thread. That way vcpu can still
> process interrupts while io is happening in parallel and, with any luck,
> interrupt will cause the guest to schedule another task on the vcpu, so
> it will have work to do instead of waiting for the page to be swapped in.
> 
> The second mechanism introduces PV notification about swapped page state to
> a guest (asynchronous page fault). Instead of halting vcpu upon access to
> swapped out page and hoping that some interrupt will cause reschedule we
> immediately inject asynchronous page fault to the vcpu.  PV aware guest
> knows that upon receiving such exception it should schedule another task
> to run on the vcpu. Current task is put to sleep until another kind of
> asynchronous page fault is received that notifies the guest that page
> is now in the host memory, so task that waits for it can run again.
> 
> To measure performance benefits I use a simple benchmark program (below)
> that starts number of threads. Some of them do work (increment counter),
> others access huge array in random location trying to generate host page
> faults. The size of the array is smaller then guest memory bug bigger
> then host memory so we are guarantied that host will swap out part of
> the array.
> 
> I ran the benchmark on three setups: with current kvm.git (master),
> with my patch series + non-pv guest (nonpv) and with my patch series +
> pv guest (pv).
> 
> Each guest had 4 cpus and 2G memory and was launched inside 512M memory
> container. The command line was "./bm -f 4 -w 4 -t 60" (run 4 faulting
> threads and 4 working threads for a minute).
> 
> Below is the total amount of "work" each guest managed to do
> (average of 10 runs):
>          total work    std error
> master: 122789420615 (3818565029)
> nonpv:  138455939001 (773774299)
> pv:     234351846135 (10461117116)
> 
> Changes:
>  v1->v2
>    Use MSR instead of hypercall.
>    Move most of the code into arch independent place.
>    halt inside a guest instead of doing "wait for page" hypercall if
>     preemption is disabled.
>  v2->v3
>    Use MSR from range 0x4b564dxx.
>    Add slot version tracking.
>    Support migration by restarting all guest processes after migration.
>    Drop patch that tract preemptability for non-preemptable kernels
>     due to performance concerns. Send async PF to non-preemptable
>     guests only when vcpu is executing userspace code.
>  v3->v4
>   Provide alternative page fault handler in PV guest instead of adding hook to
>    standard page fault handler and patch it out on non-PV guests.
>   Allow only limited number of outstanding async page fault per vcpu.
>   Unify  gfn_to_pfn and gfn_to_pfn_async code.
>   Cancel outstanding slow work on reset.
>  v4->v5
>   Move async pv cpu initialization into cpu hotplug notifier.
>   Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
>   Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
>    cr3 back
>  v5->v6
>   To many. Will list only major changes here.
>   Replace slow work with work queues.
>   Halt vcpu for non-pv guests.
>   Handle async PF in nested SVM mode.
>   Do not prefault swapped in page for non tdp case.
>  v6->v7
>   Fix "GUP fail in work thread" problem
>   Do prefault only if mmu is in direct map mode
>   Use cpu->request to ask for vcpu halt (drop optimization that tried to
>    skip non-present apf injection if page is swapped in before next vmentry)
>   Keep track of synthetic halt in separate state to prevent it from leaking
>    during migration.
>   Fix memslot tracking problems.
>   More documentation.
>   Other small comments are addressed
> 
> Gleb Natapov (12):
>   Add get_user_pages() variant that fails if major fault is required.
>   Halt vcpu if page it tries to access is swapped out.
>   Retry fault before vmentry
>   Add memory slot versioning and use it to provide fast guest write interface
>   Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
>   Add PV MSR to enable asynchronous page faults delivery.
>   Add async PF initialization to PV guest.
>   Handle async PF in a guest.
>   Inject asynchronous page fault into a PV guest if page is swapped out.
>   Handle async PF in non preemptable context
>   Let host know whether the guest can handle async PF in non-userspace context.
>   Send async PF when guest is not in userspace too.
> 
>  Documentation/kernel-parameters.txt |    3 +
>  Documentation/kvm/cpuid.txt         |    3 +
>  Documentation/kvm/msr.txt           |   36 ++++-
>  arch/x86/include/asm/kvm_host.h     |   28 +++-
>  arch/x86/include/asm/kvm_para.h     |   24 +++
>  arch/x86/include/asm/traps.h        |    1 +
>  arch/x86/kernel/entry_32.S          |   10 +
>  arch/x86/kernel/entry_64.S          |    3 +
>  arch/x86/kernel/kvm.c               |  315 +++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/kvmclock.c          |   13 +--
>  arch/x86/kvm/Kconfig                |    1 +
>  arch/x86/kvm/Makefile               |    1 +
>  arch/x86/kvm/mmu.c                  |   61 ++++++-
>  arch/x86/kvm/paging_tmpl.h          |    8 +-
>  arch/x86/kvm/svm.c                  |   45 ++++-
>  arch/x86/kvm/x86.c                  |  192 +++++++++++++++++++++-
>  fs/ncpfs/mmap.c                     |    2 +
>  include/linux/kvm.h                 |    1 +
>  include/linux/kvm_host.h            |   39 +++++
>  include/linux/kvm_types.h           |    7 +
>  include/linux/mm.h                  |    5 +
>  include/trace/events/kvm.h          |   95 +++++++++++
>  mm/filemap.c                        |    3 +
>  mm/memory.c                         |   31 +++-
>  mm/shmem.c                          |    8 +-
>  virt/kvm/Kconfig                    |    3 +
>  virt/kvm/async_pf.c                 |  213 +++++++++++++++++++++++
>  virt/kvm/async_pf.h                 |   36 ++++
>  virt/kvm/kvm_main.c                 |  132 ++++++++++++---
>  29 files changed, 1255 insertions(+), 64 deletions(-)
>  create mode 100644 virt/kvm/async_pf.c
>  create mode 100644 virt/kvm/async_pf.h
> 
> === benchmark.c ===
> 
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <unistd.h>
> #include <pthread.h>
> 
> #define FAULTING_THREADS 1
> #define WORKING_THREADS 1
> #define TIMEOUT 5
> #define MEMORY 1024*1024*1024
> 
> pthread_barrier_t barrier;
> volatile int stop;
> size_t pages;
> 
> void *fault_thread(void* p)
> {
> 	char *mem = p;
> 
> 	pthread_barrier_wait(&barrier);
> 
> 	while (!stop)
> 		mem[(random() % pages) << 12] = 10;
> 
> 	pthread_barrier_wait(&barrier);
> 
> 	return NULL;
> }
> 
> void *work_thread(void* p)
> {
> 	unsigned long *i = p;
> 
> 	pthread_barrier_wait(&barrier);
> 
> 	while (!stop)
> 		(*i)++;
> 
> 	pthread_barrier_wait(&barrier);
> 
> 	return NULL;
> }
> 
> int main(int argc, char **argv)
> {
> 	int ft = FAULTING_THREADS, wt = WORKING_THREADS;
> 	unsigned int timeout = TIMEOUT;
> 	size_t mem = MEMORY;
> 	void *buf;
> 	int i, opt, verbose = 0;
> 	pthread_t t;
> 	pthread_attr_t pattr;
> 	unsigned long *res, sum = 0;
> 
> 	while((opt = getopt(argc, argv, "f:w:m:t:v")) != -1) {
> 		switch (opt) {
> 		case 'f':
> 			ft = atoi(optarg);
> 			break;
> 		case 'w':
> 			wt = atoi(optarg);
> 			break;
> 		case 'm':
> 			mem = atoi(optarg);
> 			break;
> 		case 't':
> 			timeout = atoi(optarg);
> 			break;
> 		case 'v':
> 			verbose++;
> 			break;
> 		default:
> 			fprintf(stderr, "Usage %s [-f num] [-w num] [-m byte] [-t secs]\n", argv[0]);
> 			exit(1);
> 		}
> 	}
> 
> 	if (verbose)
> 		printf("fault=%d work=%d mem=%lu timeout=%d\n", ft, wt, mem, timeout);
> 
> 	pages = mem >> 12;
> 	posix_memalign(&buf, 4096, pages << 12);
> 	res = malloc(sizeof (unsigned long) * wt);
> 	memset(res, 0, sizeof (unsigned long) * wt);
> 
> 	pthread_attr_init(&pattr);
> 	pthread_barrier_init(&barrier, NULL, ft + wt + 1);
> 
> 	for (i = 0; i < ft; i++) {
> 		pthread_create(&t, &pattr, fault_thread, buf);
> 		pthread_detach(t);
> 	}
> 
> 	for (i = 0; i < wt; i++) {
> 		pthread_create(&t, &pattr, work_thread, &res[i]);
> 		pthread_detach(t);
> 	}
> 
> 	/* prefault memory */
> 	memset(buf, 0, pages << 12);
> 	printf("start\n");
> 
> 	pthread_barrier_wait(&barrier);
> 
> 	pthread_barrier_destroy(&barrier);
> 	pthread_barrier_init(&barrier, NULL, ft + wt + 1);
> 
> 	sleep(timeout);
> 	stop = 1;
> 
> 	pthread_barrier_wait(&barrier);
> 
> 	for (i = 0; i < wt; i++) {
> 		sum += res[i];
> 		printf("worker %d: %lu\n", i, res[i]);
> 	}
> 	printf("total: %lu\n", sum);
> 
> 	return 0;
> }
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>