Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> · Mon, 06 Nov 2017 10:14:41 +0100

Wanpeng Li <kernellwp@xxxxxxxxx> writes:

> 2017-08-03 0:09 GMT+08:00 Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>:
>> Changes since v9:
>> - Rebase to 4.13-rc3.
>> - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
>>   functional dependencies on this patch so the series can go through a different tree
>>   (and it actually belongs to x86 if I got Ingo's comment right).
>> - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
>> - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
>>   hyperv_flush_tlb_others() [Andy Shevchenko]
>> - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
>>   reported by kbuild test robot (#include <asm/io.h>)
>> - Add Steven's 'Reviewed-by:' to PATCH9.
>>
>> Original description:
>>
>> Hyper-V supports hypercalls for doing local and remote TLB flushing and
>> gives its guests hints when using hypercall is preferred. While doing
>> hypercalls for local TLB flushes is probably not practical (and is not
>> being suggested by modern Hyper-V versions) remote TLB flush with a
>> hypercall brings significant improvement.
>>
>> To test the series I wrote a special 'TLB trasher': on a 16 vCPU guest I
>> was creating 32 threads which were doing 100000 mmap/munmaps each on some
>> big file. Here are the results:
>>
>> Before:
>> # time ./pthread_mmap ./randfile
>> real    3m33.118s
>> user    0m3.698s
>> sys     3m16.624s
>>
>> After:
>> # time ./pthread_mmap ./randfile
>> real    2m19.920s
>> user    0m2.662s
>> sys     2m9.948s
>>
>> This series brings a number of small improvements along the way: fast
>> hypercall implementation and using it for event signaling, rep hypercalls
>> implementation, hyperv tracing subsystem (which only traces the newly added
>> remote TLB flush for now).
>>
>
> Hi Vitaly,
>
> Could you attach your benchmark? I'm interested in to try the
> implementation in paravirt kvm.
>

Oh, this would be cool) I briefly discussed the idea with Radim (one of
KVM maintainers) during the last KVM Forum and he wasn't opposed to the
idea. Need to talk to Paolo too. Good thing is that we have everything
in place for guests now (HAVE_RCU_TABLE_FREE is enabled globaly on x86).

Please see the microbenchmark attached. Adjust defines in the beginning
to match your needs. It is not anything smart, basically just a TLB
trasher.

In theory, the best result is achived when we're overcommiting the host
by running multiple vCPUs on each pCPU. In this case PV tlb flush avoids
touching vCPUs which are not scheduled and avoid the wait on the main
CPU.

-- 
  Vitaly

#include <pthread.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

#define nthreads 48
#define pagecount 16384
#define nrounds 1000
#define nchunks 20
#define PAGE_SIZE 4096

int fd;
unsigned long v;

void *threadf(void *ptr)
{
	unsigned long *addr[nchunks];
	int i, j, k;
	struct timespec ts = {0};
	int ret;

	ts.tv_nsec = random() % 1024;

	for (j = 0; j < nrounds; j++) {
		for (i = 0; i < nchunks; i++) {
			addr[i] = mmap(NULL, PAGE_SIZE * pagecount, PROT_READ, MAP_SHARED, fd, i * PAGE_SIZE);
			if (addr[i] == MAP_FAILED) {
				fprintf(stderr, "mmap\n");
				exit(1);
			}
		}

		nanosleep(&ts, NULL);

		for (i = 0; i < nchunks; i++) {
			v += *addr[i];
		}

		nanosleep(&ts, NULL);

		for (i = 0; i < nchunks; i++) {
			munmap(addr[i], PAGE_SIZE * pagecount);
		}
	}
}

int main(int argc, char *argv[]) {
	pthread_t thr[nthreads];
	int i;

	srandom(time(NULL));

	if (argc < 2) {
		fprintf(stderr, "usage: %s <some-big-file>\n", argv[0]);
		exit(1);
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		fprintf(stderr, "open\n");
		exit(1);
	}

	for (i = 0; i < nthreads; i++) {
		if(pthread_create(&thr[i], NULL, threadf, NULL)) {
			fprintf(stderr, "pthread_create\n");
			exit(1);
		}
	}

	for (i = 0; i < nthreads; i++) {
		if(pthread_join(thr[i], NULL)) {
			fprintf(stderr, "pthread_join\n");
			exit(1);
		}
	}

	return 0;
}