Re: x86: strange behavior of invlpg

Nadav Amit <nadav.amit@xxxxxxxxx> · Mon, 16 May 2016 09:51:23 -0700

Thanks! I appreciate it.

I think your experiment with global paging just corraborate that the
latency is caused by TLB misses. I measured TLB misses (and especially STLB
misses) in other experiments but not in this one. I will run some more
experiments, specifically to test how AMD behaves.

I should note this is a byproduct of a study I did, and it is not as if I was
looking for strange behaviors (no more validation papers for me!).

The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
it is a CPU “feature”. Once we understand it, the very least it may affect
the recommended value of “tlb_single_page_flush_ceiling”, that controls when
the kernel performs full TLB flush vs. selective flushes.

Nadav

Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:

> On 14/05/2016 11:35, Nadav Amit wrote:
>> I encountered a strange phenomenum and I would appreciate your sanity check
>> and opinion. It looks as if 'invlpg' that runs in a VM causes a very broad
>> flush.
>> 
>> I created a small kvm-unit-test (below) to show what I talk about. The test 
>> touches 50 pages, and then either: (1) runs full flush, (2) runs invlpg to
>> an arbitrary (other) address, or (3) runs memory barrier.
>> 
>> It appears that the execution time of the test is indeed determined by TLB
>> misses, since the runtime of the memory barrier flavor is considerably lower.
> 
> Did you check the performance counters?  Another explanation is that 
> there are no TLB misses, but CR3 writes are optimized in such a way 
> that they do not incur TLB misses either.  (Disclaimer: I didn't check 
> the performance counters to prove the alternative theory ;)).
> 
>> What I find strange is that if I compute the net access time for tests 1 & 2,
>> by deducing the time of the flushes, the time is almost identical. I am aware 
>> that invlpg flushes the page-walk caches, but I would still expect the invlpg
>> flavor to run considerably faster than the full-flush flavor.
> 
> That's interesting.  I guess you're using EPT because I get very 
> similar number on an Ivy Bridge laptop:
> 
>  with invlpg:        902,224,568
>  with full flush:    880,103,513
>  invlpg only         113,186,461
>  full flushes only   100,236,620
>  access net          104,454,125
>  w/full flush net    779,866,893
>  w/invlpg net        789,038,107
> 
> (commas added for readability).
> 
> Out of curiosity I tried making all pages global (patch after my
> signature).  Both invlpg and write to CR3 become much faster, but
> invlpg now is faster than full flush, even though in theory it
> should be the opposite...
> 
>  with invlpg:        223,079,661
>  with full flush:    294,280,788
>  invlpg only         126,236,334
>  full flushes only   107,614,525
>  access net           90,830,503
>  w/full flush net    186,666,263
>  w/invlpg net         96,843,327
> 
> Thanks for the interesting test!
> 
> Paolo
> 
> diff --git a/lib/x86/vm.c b/lib/x86/vm.c
> index 7ce7bbc..3b9b81a 100644
> --- a/lib/x86/vm.c
> +++ b/lib/x86/vm.c
> @@ -2,6 +2,7 @@
> #include "vm.h"
> #include "libcflat.h"
> 
> +#define PTE_GLOBAL      256
> #define PAGE_SIZE 4096ul
> #ifdef __x86_64__
> #define LARGE_PAGE_SIZE (512 * PAGE_SIZE)
> @@ -106,14 +107,14 @@ unsigned long *install_large_page(unsigned long *cr3,
> 				  void *virt)
> {
>     return install_pte(cr3, 2, virt,
> -		       phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0);
> +		       phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE | PTE_GLOBAL, 0);
> }
> 
> unsigned long *install_page(unsigned long *cr3,
> 			    unsigned long phys,
> 			    void *virt)
> {
> -    return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0);
> +    return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_GLOBAL, 0);
> }
> 
> 
> 
>> Am I missing something?
>> 
>> 
>> On my Haswell EP I get the following results:
>> 
>> with invlpg:        948965249
>> with full flush:    1047927009
>> invlpg only         127682028
>> full flushes only   224055273
>> access net          107691277	--> considerably lower than w/flushes
>> w/full flush net    823871736
>> w/invlpg net        821283221	--> almost identical to full-flush net
>> 
>> ---
>> 
>> 
>> #include "libcflat.h"
>> #include "fwcfg.h"
>> #include "vm.h"
>> #include "smp.h"
>> 
>> #define N_PAGES	(50)
>> #define ITERATIONS (500000)
>> volatile char buf[N_PAGES * PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE)));
>> 
>> int main(void)
>> {
>>    void *another_addr = (void*)0x50f9000;
>>    int i, j;
>>    unsigned long t_start, t_single, t_full, t_single_only, t_full_only,
>> 		  t_access;
>>    unsigned long cr3;
>>    char v = 0;
>> 
>>    setup_vm();
>> 
>>    cr3 = read_cr3();
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++) {
>>        invlpg(another_addr);
>> 	for (j = 0; j < N_PAGES; j++)
>>            v = buf[PAGE_SIZE * j];
>>    }
>>    t_single = rdtsc() - t_start;
>>    printf("with invlpg:        %lu\n", t_single);
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++) {
>>    	write_cr3(cr3);
>> 	for (j = 0; j < N_PAGES; j++)
>>            v = buf[PAGE_SIZE * j];
>>    }
>>    t_full = rdtsc() - t_start;
>>    printf("with full flush:    %lu\n", t_full);
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++)
>>         invlpg(another_addr);
>>    t_single_only = rdtsc() - t_start;
>>    printf("invlpg only         %lu\n", t_single_only);
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++)
>>    	 write_cr3(cr3);
>>    t_full_only = rdtsc() - t_start;
>>    printf("full flushes only   %lu\n", t_full_only);
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++) {
>> 	for (j = 0; j < N_PAGES; j++)
>>            v = buf[PAGE_SIZE * j];
>> 	mb();
>>    }
>>    t_access = rdtsc()-t_start;
>>    printf("access net          %lu\n", t_access);
>>    printf("w/full flush net    %lu\n", t_full - t_full_only);
>>    printf("w/invlpg net        %lu\n", t_single - t_single_only);
>> 
>>    (void)v;
>>    return 0;
>> }--
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html