Ignore this please. Something bad happened to From: header. On Thu, Oct 14, 2010 at 11:16:58AM +0200, y@xxxxxxxxxx wrote: > From: Gleb Natapov <gleb@xxxxxxxxxx> > > KVM virtualizes guest memory by means of shadow pages or HW assistance > like NPT/EPT. Not all memory used by a guest is mapped into the guest > address space or even present in a host memory at any given time. > When vcpu tries to access memory page that is not mapped into the guest > address space KVM is notified about it. KVM maps the page into the guest > address space and resumes vcpu execution. If the page is swapped out from > the host memory vcpu execution is suspended till the page is swapped > into the memory again. This is inefficient since vcpu can do other work > (run other task or serve interrupts) while page gets swapped in. > > The patch series tries to mitigate this problem by introducing two > mechanisms. The first one is used with non-PV guest and it works like > this: when vcpu tries to access swapped out page it is halted and > requested page is swapped in by another thread. That way vcpu can still > process interrupts while io is happening in parallel and, with any luck, > interrupt will cause the guest to schedule another task on the vcpu, so > it will have work to do instead of waiting for the page to be swapped in. > > The second mechanism introduces PV notification about swapped page state to > a guest (asynchronous page fault). Instead of halting vcpu upon access to > swapped out page and hoping that some interrupt will cause reschedule we > immediately inject asynchronous page fault to the vcpu. PV aware guest > knows that upon receiving such exception it should schedule another task > to run on the vcpu. Current task is put to sleep until another kind of > asynchronous page fault is received that notifies the guest that page > is now in the host memory, so task that waits for it can run again. > > To measure performance benefits I use a simple benchmark program (below) > that starts number of threads. Some of them do work (increment counter), > others access huge array in random location trying to generate host page > faults. The size of the array is smaller then guest memory bug bigger > then host memory so we are guarantied that host will swap out part of > the array. > > I ran the benchmark on three setups: with current kvm.git (master), > with my patch series + non-pv guest (nonpv) and with my patch series + > pv guest (pv). > > Each guest had 4 cpus and 2G memory and was launched inside 512M memory > container. The command line was "./bm -f 4 -w 4 -t 60" (run 4 faulting > threads and 4 working threads for a minute). > > Below is the total amount of "work" each guest managed to do > (average of 10 runs): > total work std error > master: 122789420615 (3818565029) > nonpv: 138455939001 (773774299) > pv: 234351846135 (10461117116) > > Changes: > v1->v2 > Use MSR instead of hypercall. > Move most of the code into arch independent place. > halt inside a guest instead of doing "wait for page" hypercall if > preemption is disabled. > v2->v3 > Use MSR from range 0x4b564dxx. > Add slot version tracking. > Support migration by restarting all guest processes after migration. > Drop patch that tract preemptability for non-preemptable kernels > due to performance concerns. Send async PF to non-preemptable > guests only when vcpu is executing userspace code. > v3->v4 > Provide alternative page fault handler in PV guest instead of adding hook to > standard page fault handler and patch it out on non-PV guests. > Allow only limited number of outstanding async page fault per vcpu. > Unify gfn_to_pfn and gfn_to_pfn_async code. > Cancel outstanding slow work on reset. > v4->v5 > Move async pv cpu initialization into cpu hotplug notifier. > Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep > Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing > cr3 back > v5->v6 > To many. Will list only major changes here. > Replace slow work with work queues. > Halt vcpu for non-pv guests. > Handle async PF in nested SVM mode. > Do not prefault swapped in page for non tdp case. > v6->v7 > Fix "GUP fail in work thread" problem > Do prefault only if mmu is in direct map mode > Use cpu->request to ask for vcpu halt (drop optimization that tried to > skip non-present apf injection if page is swapped in before next vmentry) > Keep track of synthetic halt in separate state to prevent it from leaking > during migration. > Fix memslot tracking problems. > More documentation. > Other small comments are addressed > > Gleb Natapov (12): > Add get_user_pages() variant that fails if major fault is required. > Halt vcpu if page it tries to access is swapped out. > Retry fault before vmentry > Add memory slot versioning and use it to provide fast guest write interface > Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c. > Add PV MSR to enable asynchronous page faults delivery. > Add async PF initialization to PV guest. > Handle async PF in a guest. > Inject asynchronous page fault into a PV guest if page is swapped out. > Handle async PF in non preemptable context > Let host know whether the guest can handle async PF in non-userspace context. > Send async PF when guest is not in userspace too. > > Documentation/kernel-parameters.txt | 3 + > Documentation/kvm/cpuid.txt | 3 + > Documentation/kvm/msr.txt | 36 ++++- > arch/x86/include/asm/kvm_host.h | 28 +++- > arch/x86/include/asm/kvm_para.h | 24 +++ > arch/x86/include/asm/traps.h | 1 + > arch/x86/kernel/entry_32.S | 10 + > arch/x86/kernel/entry_64.S | 3 + > arch/x86/kernel/kvm.c | 315 +++++++++++++++++++++++++++++++++++ > arch/x86/kernel/kvmclock.c | 13 +-- > arch/x86/kvm/Kconfig | 1 + > arch/x86/kvm/Makefile | 1 + > arch/x86/kvm/mmu.c | 61 ++++++- > arch/x86/kvm/paging_tmpl.h | 8 +- > arch/x86/kvm/svm.c | 45 ++++- > arch/x86/kvm/x86.c | 192 +++++++++++++++++++++- > fs/ncpfs/mmap.c | 2 + > include/linux/kvm.h | 1 + > include/linux/kvm_host.h | 39 +++++ > include/linux/kvm_types.h | 7 + > include/linux/mm.h | 5 + > include/trace/events/kvm.h | 95 +++++++++++ > mm/filemap.c | 3 + > mm/memory.c | 31 +++- > mm/shmem.c | 8 +- > virt/kvm/Kconfig | 3 + > virt/kvm/async_pf.c | 213 +++++++++++++++++++++++ > virt/kvm/async_pf.h | 36 ++++ > virt/kvm/kvm_main.c | 132 ++++++++++++--- > 29 files changed, 1255 insertions(+), 64 deletions(-) > create mode 100644 virt/kvm/async_pf.c > create mode 100644 virt/kvm/async_pf.h > > === benchmark.c === > > #include <stdlib.h> > #include <stdio.h> > #include <string.h> > #include <unistd.h> > #include <pthread.h> > > #define FAULTING_THREADS 1 > #define WORKING_THREADS 1 > #define TIMEOUT 5 > #define MEMORY 1024*1024*1024 > > pthread_barrier_t barrier; > volatile int stop; > size_t pages; > > void *fault_thread(void* p) > { > char *mem = p; > > pthread_barrier_wait(&barrier); > > while (!stop) > mem[(random() % pages) << 12] = 10; > > pthread_barrier_wait(&barrier); > > return NULL; > } > > void *work_thread(void* p) > { > unsigned long *i = p; > > pthread_barrier_wait(&barrier); > > while (!stop) > (*i)++; > > pthread_barrier_wait(&barrier); > > return NULL; > } > > int main(int argc, char **argv) > { > int ft = FAULTING_THREADS, wt = WORKING_THREADS; > unsigned int timeout = TIMEOUT; > size_t mem = MEMORY; > void *buf; > int i, opt, verbose = 0; > pthread_t t; > pthread_attr_t pattr; > unsigned long *res, sum = 0; > > while((opt = getopt(argc, argv, "f:w:m:t:v")) != -1) { > switch (opt) { > case 'f': > ft = atoi(optarg); > break; > case 'w': > wt = atoi(optarg); > break; > case 'm': > mem = atoi(optarg); > break; > case 't': > timeout = atoi(optarg); > break; > case 'v': > verbose++; > break; > default: > fprintf(stderr, "Usage %s [-f num] [-w num] [-m byte] [-t secs]\n", argv[0]); > exit(1); > } > } > > if (verbose) > printf("fault=%d work=%d mem=%lu timeout=%d\n", ft, wt, mem, timeout); > > pages = mem >> 12; > posix_memalign(&buf, 4096, pages << 12); > res = malloc(sizeof (unsigned long) * wt); > memset(res, 0, sizeof (unsigned long) * wt); > > pthread_attr_init(&pattr); > pthread_barrier_init(&barrier, NULL, ft + wt + 1); > > for (i = 0; i < ft; i++) { > pthread_create(&t, &pattr, fault_thread, buf); > pthread_detach(t); > } > > for (i = 0; i < wt; i++) { > pthread_create(&t, &pattr, work_thread, &res[i]); > pthread_detach(t); > } > > /* prefault memory */ > memset(buf, 0, pages << 12); > printf("start\n"); > > pthread_barrier_wait(&barrier); > > pthread_barrier_destroy(&barrier); > pthread_barrier_init(&barrier, NULL, ft + wt + 1); > > sleep(timeout); > stop = 1; > > pthread_barrier_wait(&barrier); > > for (i = 0; i < wt; i++) { > sum += res[i]; > printf("worker %d: %lu\n", i, res[i]); > } > printf("total: %lu\n", sum); > > return 0; > } > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxxx For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Gleb. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>