Thanks! I appreciate it. I think your experiment with global paging just corraborate that the latency is caused by TLB misses. I measured TLB misses (and especially STLB misses) in other experiments but not in this one. I will run some more experiments, specifically to test how AMD behaves. I should note this is a byproduct of a study I did, and it is not as if I was looking for strange behaviors (no more validation papers for me!). The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt it is a CPU “feature”. Once we understand it, the very least it may affect the recommended value of “tlb_single_page_flush_ceiling”, that controls when the kernel performs full TLB flush vs. selective flushes. Nadav Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > On 14/05/2016 11:35, Nadav Amit wrote: >> I encountered a strange phenomenum and I would appreciate your sanity check >> and opinion. It looks as if 'invlpg' that runs in a VM causes a very broad >> flush. >> >> I created a small kvm-unit-test (below) to show what I talk about. The test >> touches 50 pages, and then either: (1) runs full flush, (2) runs invlpg to >> an arbitrary (other) address, or (3) runs memory barrier. >> >> It appears that the execution time of the test is indeed determined by TLB >> misses, since the runtime of the memory barrier flavor is considerably lower. > > Did you check the performance counters? Another explanation is that > there are no TLB misses, but CR3 writes are optimized in such a way > that they do not incur TLB misses either. (Disclaimer: I didn't check > the performance counters to prove the alternative theory ;)). > >> What I find strange is that if I compute the net access time for tests 1 & 2, >> by deducing the time of the flushes, the time is almost identical. I am aware >> that invlpg flushes the page-walk caches, but I would still expect the invlpg >> flavor to run considerably faster than the full-flush flavor. > > That's interesting. I guess you're using EPT because I get very > similar number on an Ivy Bridge laptop: > > with invlpg: 902,224,568 > with full flush: 880,103,513 > invlpg only 113,186,461 > full flushes only 100,236,620 > access net 104,454,125 > w/full flush net 779,866,893 > w/invlpg net 789,038,107 > > (commas added for readability). > > Out of curiosity I tried making all pages global (patch after my > signature). Both invlpg and write to CR3 become much faster, but > invlpg now is faster than full flush, even though in theory it > should be the opposite... > > with invlpg: 223,079,661 > with full flush: 294,280,788 > invlpg only 126,236,334 > full flushes only 107,614,525 > access net 90,830,503 > w/full flush net 186,666,263 > w/invlpg net 96,843,327 > > Thanks for the interesting test! > > Paolo > > diff --git a/lib/x86/vm.c b/lib/x86/vm.c > index 7ce7bbc..3b9b81a 100644 > --- a/lib/x86/vm.c > +++ b/lib/x86/vm.c > @@ -2,6 +2,7 @@ > #include "vm.h" > #include "libcflat.h" > > +#define PTE_GLOBAL 256 > #define PAGE_SIZE 4096ul > #ifdef __x86_64__ > #define LARGE_PAGE_SIZE (512 * PAGE_SIZE) > @@ -106,14 +107,14 @@ unsigned long *install_large_page(unsigned long *cr3, > void *virt) > { > return install_pte(cr3, 2, virt, > - phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0); > + phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE | PTE_GLOBAL, 0); > } > > unsigned long *install_page(unsigned long *cr3, > unsigned long phys, > void *virt) > { > - return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0); > + return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_GLOBAL, 0); > } > > > >> Am I missing something? >> >> >> On my Haswell EP I get the following results: >> >> with invlpg: 948965249 >> with full flush: 1047927009 >> invlpg only 127682028 >> full flushes only 224055273 >> access net 107691277 --> considerably lower than w/flushes >> w/full flush net 823871736 >> w/invlpg net 821283221 --> almost identical to full-flush net >> >> --- >> >> >> #include "libcflat.h" >> #include "fwcfg.h" >> #include "vm.h" >> #include "smp.h" >> >> #define N_PAGES (50) >> #define ITERATIONS (500000) >> volatile char buf[N_PAGES * PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE))); >> >> int main(void) >> { >> void *another_addr = (void*)0x50f9000; >> int i, j; >> unsigned long t_start, t_single, t_full, t_single_only, t_full_only, >> t_access; >> unsigned long cr3; >> char v = 0; >> >> setup_vm(); >> >> cr3 = read_cr3(); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) { >> invlpg(another_addr); >> for (j = 0; j < N_PAGES; j++) >> v = buf[PAGE_SIZE * j]; >> } >> t_single = rdtsc() - t_start; >> printf("with invlpg: %lu\n", t_single); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) { >> write_cr3(cr3); >> for (j = 0; j < N_PAGES; j++) >> v = buf[PAGE_SIZE * j]; >> } >> t_full = rdtsc() - t_start; >> printf("with full flush: %lu\n", t_full); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) >> invlpg(another_addr); >> t_single_only = rdtsc() - t_start; >> printf("invlpg only %lu\n", t_single_only); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) >> write_cr3(cr3); >> t_full_only = rdtsc() - t_start; >> printf("full flushes only %lu\n", t_full_only); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) { >> for (j = 0; j < N_PAGES; j++) >> v = buf[PAGE_SIZE * j]; >> mb(); >> } >> t_access = rdtsc()-t_start; >> printf("access net %lu\n", t_access); >> printf("w/full flush net %lu\n", t_full - t_full_only); >> printf("w/invlpg net %lu\n", t_single - t_single_only); >> >> (void)v; >> return 0; >> }-- >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html