On 14/05/2016 11:35, Nadav Amit wrote: > I encountered a strange phenomenum and I would appreciate your sanity check > and opinion. It looks as if 'invlpg' that runs in a VM causes a very broad > flush. > > I created a small kvm-unit-test (below) to show what I talk about. The test > touches 50 pages, and then either: (1) runs full flush, (2) runs invlpg to > an arbitrary (other) address, or (3) runs memory barrier. > > It appears that the execution time of the test is indeed determined by TLB > misses, since the runtime of the memory barrier flavor is considerably lower. Did you check the performance counters? Another explanation is that there are no TLB misses, but CR3 writes are optimized in such a way that they do not incur TLB misses either. (Disclaimer: I didn't check the performance counters to prove the alternative theory ;)). > What I find strange is that if I compute the net access time for tests 1 & 2, > by deducing the time of the flushes, the time is almost identical. I am aware > that invlpg flushes the page-walk caches, but I would still expect the invlpg > flavor to run considerably faster than the full-flush flavor. That's interesting. I guess you're using EPT because I get very similar number on an Ivy Bridge laptop: with invlpg: 902,224,568 with full flush: 880,103,513 invlpg only 113,186,461 full flushes only 100,236,620 access net 104,454,125 w/full flush net 779,866,893 w/invlpg net 789,038,107 (commas added for readability). Out of curiosity I tried making all pages global (patch after my signature). Both invlpg and write to CR3 become much faster, but invlpg now is faster than full flush, even though in theory it should be the opposite... with invlpg: 223,079,661 with full flush: 294,280,788 invlpg only 126,236,334 full flushes only 107,614,525 access net 90,830,503 w/full flush net 186,666,263 w/invlpg net 96,843,327 Thanks for the interesting test! Paolo diff --git a/lib/x86/vm.c b/lib/x86/vm.c index 7ce7bbc..3b9b81a 100644 --- a/lib/x86/vm.c +++ b/lib/x86/vm.c @@ -2,6 +2,7 @@ #include "vm.h" #include "libcflat.h" +#define PTE_GLOBAL 256 #define PAGE_SIZE 4096ul #ifdef __x86_64__ #define LARGE_PAGE_SIZE (512 * PAGE_SIZE) @@ -106,14 +107,14 @@ unsigned long *install_large_page(unsigned long *cr3, void *virt) { return install_pte(cr3, 2, virt, - phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0); + phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE | PTE_GLOBAL, 0); } unsigned long *install_page(unsigned long *cr3, unsigned long phys, void *virt) { - return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0); + return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_GLOBAL, 0); } > Am I missing something? > > > On my Haswell EP I get the following results: > > with invlpg: 948965249 > with full flush: 1047927009 > invlpg only 127682028 > full flushes only 224055273 > access net 107691277 --> considerably lower than w/flushes > w/full flush net 823871736 > w/invlpg net 821283221 --> almost identical to full-flush net > > --- > > > #include "libcflat.h" > #include "fwcfg.h" > #include "vm.h" > #include "smp.h" > > #define N_PAGES (50) > #define ITERATIONS (500000) > volatile char buf[N_PAGES * PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE))); > > int main(void) > { > void *another_addr = (void*)0x50f9000; > int i, j; > unsigned long t_start, t_single, t_full, t_single_only, t_full_only, > t_access; > unsigned long cr3; > char v = 0; > > setup_vm(); > > cr3 = read_cr3(); > > t_start = rdtsc(); > for (i = 0; i < ITERATIONS; i++) { > invlpg(another_addr); > for (j = 0; j < N_PAGES; j++) > v = buf[PAGE_SIZE * j]; > } > t_single = rdtsc() - t_start; > printf("with invlpg: %lu\n", t_single); > > t_start = rdtsc(); > for (i = 0; i < ITERATIONS; i++) { > write_cr3(cr3); > for (j = 0; j < N_PAGES; j++) > v = buf[PAGE_SIZE * j]; > } > t_full = rdtsc() - t_start; > printf("with full flush: %lu\n", t_full); > > t_start = rdtsc(); > for (i = 0; i < ITERATIONS; i++) > invlpg(another_addr); > t_single_only = rdtsc() - t_start; > printf("invlpg only %lu\n", t_single_only); > > t_start = rdtsc(); > for (i = 0; i < ITERATIONS; i++) > write_cr3(cr3); > t_full_only = rdtsc() - t_start; > printf("full flushes only %lu\n", t_full_only); > > t_start = rdtsc(); > for (i = 0; i < ITERATIONS; i++) { > for (j = 0; j < N_PAGES; j++) > v = buf[PAGE_SIZE * j]; > mb(); > } > t_access = rdtsc()-t_start; > printf("access net %lu\n", t_access); > printf("w/full flush net %lu\n", t_full - t_full_only); > printf("w/invlpg net %lu\n", t_single - t_single_only); > > (void)v; > return 0; > }-- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html