Re: Seeking a KVM benchmark

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Mon, 10 Nov 2014 13:15:56 +0100

On 10/11/2014 11:45, Gleb Natapov wrote:
> > I tried making also the other shared MSRs the same between guest and
> > host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
> > has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
> > do want to dig out my old Core 2 and see how the new test fares, but it
> > really looks like your patch will be in 3.19.
>
> Please test on wide variety of HW before final decision.

Yes, definitely.

> Also it would
> be nice to ask Intel what is expected overhead. It is awesome if they
> mange to add EFER switching with non measurable overhead, but also hard
> to believe :)

So let's see what happens.  Sneak preview: the result is definitely worth
asking Intel about.

I ran these benchmarks with a stock 3.16.6 KVM.  Instead I patched
kvm-unit-tests to set EFER.SCE in enable_nx.  This makes it much simpler
for others to reproduce the results.  I only ran the inl_from_qemu test.

Perf stat reports that the processor goes from 0.65 to 0.46 
instructions per cycle, which is consistent with the improvement from 
19k to 12k cycles per iteration.

Unpatched KVM-unit-tests:

     3,385,586,563 cycles                    #    3.189 GHz                     [83.25%]
     2,475,979,685 stalled-cycles-frontend   #   73.13% frontend cycles idle    [83.37%]
     2,083,556,270 stalled-cycles-backend    #   61.54% backend  cycles idle    [66.71%]
     1,573,854,041 instructions              #    0.46  insns per cycle        
                                             #    1.57  stalled cycles per insn [83.20%]
       1.108486526 seconds time elapsed

Patched KVM-unit-tests:

     3,252,297,378 cycles                    #    3.147 GHz                     [83.32%]
     2,010,266,184 stalled-cycles-frontend   #   61.81% frontend cycles idle    [83.36%]
     1,560,371,769 stalled-cycles-backend    #   47.98% backend  cycles idle    [66.51%]
     2,133,698,018 instructions              #    0.66  insns per cycle        
                                             #    0.94  stalled cycles per insn [83.45%]
       1.072395697 seconds time elapsed

Playing with other events shows that the unpatched benchmark has an
awful load of TLB misses

Unpatched:

            30,311 iTLB-loads                                                  
       464,641,844 dTLB-loads                                                  
        10,813,839 dTLB-load-misses          #    2.33% of all dTLB cache hits 
        20436,027 iTLB-load-misses          #  67421.16% of all iTLB cache hits 

Patched:

         1,440,033 iTLB-loads                                                  
       640,970,836 dTLB-loads                                                  
         2,345,112 dTLB-load-misses          #    0.37% of all dTLB cache hits 
           270,884 iTLB-load-misses          #   18.81% of all iTLB cache hits 

This is 100% reproducible.  The meaning of the numbers is clearer if you
look up the raw event numbers in the Intel manuals:

- iTLB-loads is 85h/10h aka "perf -e r1085": "Number of cache load STLB [second-level
TLB] hits. No page walk."

- iTLB-load-misses is 85h/01h aka r185: "Misses in all ITLB levels that
cause page walks."

So for example event 85h/04h aka r485 ("Cycle PMH is busy with a walk.") and
friends show that the unpatched KVM wastes about 0.1 seconds more than
the patched KVM on page walks:

Unpatched:

        22,583,440 r449             (cycles on dTLB store miss page walks)
        40,452,018 r408             (cycles on dTLB load miss page walks)
         2,115,981 r485             (cycles on iTLB miss page walks)
------------------------
        65,151,439 total

Patched:

        24,430,676 r449             (cycles on dTLB store miss page walks)
       196,017,693 r408             (cycles on dTLB load miss page walks)
       213,266,243 r485             (cycles on iTLB miss page walks)
-------------------------
       433,714,612 total

These 0.1 seconds probably are all on instructions that would have been
fast, since the slow instructions responsible for the low IPC are the
microcoded instructions including VMX and other privileged stuff.

Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM
and 260k in patched KVM.  Let's see where they come from:

Unpatched:

+  98.97%  qemu-kvm  [kernel.kallsyms]  [k] native_write_msr_safe
+   0.70%  qemu-kvm  [kernel.kallsyms]  [k] page_fault

It's expected that most TLB misses happen just before a page fault (there
are also events to count how many TLB misses do result in a page fault,
if you care about that), and thus are accounted to the first instruction of the
exception handler.

We do not know what causes second-level TLB _flushes_ but it's quite
expected that you'll have a TLB miss after them and possibly a page fault.
And anyway 98.97% of them coming from native_write_msr_safe is totally
anomalous.

A patched benchmark shows no second-level TLB flush occurs after a WRMSR:

+  72.41%  qemu-kvm  [kernel.kallsyms]  [k] page_fault
+   9.07%  qemu-kvm  [kvm_intel]        [k] vmx_flush_tlb
+   6.60%  qemu-kvm  [kernel.kallsyms]  [k] set_pte_vaddr_pud
+   5.68%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_mm_range
+   4.87%  qemu-kvm  [kernel.kallsyms]  [k] native_flush_tlb
+   1.36%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_page

So basically VMX EFER writes are optimized, while non-VMX EFER writes
cause a TLB flush, at least on a Sandy Bridge.  Ouch!

I'll try to reproduce on the Core 2 Duo soon, and inquire Intel about it.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html