Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > On 16/05/2016 18:51, Nadav Amit wrote: >> Thanks! I appreciate it. >> >> I think your experiment with global paging just corraborate that the >> latency is caused by TLB misses. I measured TLB misses (and especially STLB >> misses) in other experiments but not in this one. I will run some more >> experiments, specifically to test how AMD behaves. > > I'm curious about AMD too now... > > with invlpg: 285,639,427 > with full flush: 584,419,299 > invlpg only 70,681,128 > full flushes only 265,238,766 > access net 242,538,804 > w/full flush net 319,180,533 > w/invlpg net 214,958,299 > > Roughly the same with and without pte.g. So AMD behaves as it should. > >> I should note this is a byproduct of a study I did, and it is not as if I was >> looking for strange behaviors (no more validation papers for me!). >> >> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt >> it is a CPU “feature”. Once we understand it, the very least it may affect >> the recommended value of “tlb_single_page_flush_ceiling”, that controls when >> the kernel performs full TLB flush vs. selective flushes. > > Do you have a kernel module to reproduce the test on bare metal? (/me is > lazy). It came to my mind that I didn’t tell you what turned eventually to be the issue. (Yes, I know it is a very old thread, but you may still be interested). It turns out that Intel has something that is called “page fracturing”. After the TLB caches a translation that came from 2MB guest page and 4KB host page, INVLPG ends up flushing the entire TLB is flushed. I guess they need to do it to follow the SDM 4.10.4.1 (regarding pages larger than 4 KBytes): "The INVLPG instruction and page faults provide the same assurances that they provide when a single TLB entry is used: they invalidate all TLB entries corresponding to the translation specified by the paging structures.” Thanks again for your help, Nadav