On 10/11/2014 11:45, Gleb Natapov wrote: > > I tried making also the other shared MSRs the same between guest and > > host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier > > has nothing to do. That saves about 4-500 cycles on inl_from_qemu. I > > do want to dig out my old Core 2 and see how the new test fares, but it > > really looks like your patch will be in 3.19. > > Please test on wide variety of HW before final decision. Yes, definitely. > Also it would > be nice to ask Intel what is expected overhead. It is awesome if they > mange to add EFER switching with non measurable overhead, but also hard > to believe :) So let's see what happens. Sneak preview: the result is definitely worth asking Intel about. I ran these benchmarks with a stock 3.16.6 KVM. Instead I patched kvm-unit-tests to set EFER.SCE in enable_nx. This makes it much simpler for others to reproduce the results. I only ran the inl_from_qemu test. Perf stat reports that the processor goes from 0.65 to 0.46 instructions per cycle, which is consistent with the improvement from 19k to 12k cycles per iteration. Unpatched KVM-unit-tests: 3,385,586,563 cycles # 3.189 GHz [83.25%] 2,475,979,685 stalled-cycles-frontend # 73.13% frontend cycles idle [83.37%] 2,083,556,270 stalled-cycles-backend # 61.54% backend cycles idle [66.71%] 1,573,854,041 instructions # 0.46 insns per cycle # 1.57 stalled cycles per insn [83.20%] 1.108486526 seconds time elapsed Patched KVM-unit-tests: 3,252,297,378 cycles # 3.147 GHz [83.32%] 2,010,266,184 stalled-cycles-frontend # 61.81% frontend cycles idle [83.36%] 1,560,371,769 stalled-cycles-backend # 47.98% backend cycles idle [66.51%] 2,133,698,018 instructions # 0.66 insns per cycle # 0.94 stalled cycles per insn [83.45%] 1.072395697 seconds time elapsed Playing with other events shows that the unpatched benchmark has an awful load of TLB misses Unpatched: 30,311 iTLB-loads 464,641,844 dTLB-loads 10,813,839 dTLB-load-misses # 2.33% of all dTLB cache hits 20436,027 iTLB-load-misses # 67421.16% of all iTLB cache hits Patched: 1,440,033 iTLB-loads 640,970,836 dTLB-loads 2,345,112 dTLB-load-misses # 0.37% of all dTLB cache hits 270,884 iTLB-load-misses # 18.81% of all iTLB cache hits This is 100% reproducible. The meaning of the numbers is clearer if you look up the raw event numbers in the Intel manuals: - iTLB-loads is 85h/10h aka "perf -e r1085": "Number of cache load STLB [second-level TLB] hits. No page walk." - iTLB-load-misses is 85h/01h aka r185: "Misses in all ITLB levels that cause page walks." So for example event 85h/04h aka r485 ("Cycle PMH is busy with a walk.") and friends show that the unpatched KVM wastes about 0.1 seconds more than the patched KVM on page walks: Unpatched: 22,583,440 r449 (cycles on dTLB store miss page walks) 40,452,018 r408 (cycles on dTLB load miss page walks) 2,115,981 r485 (cycles on iTLB miss page walks) ------------------------ 65,151,439 total Patched: 24,430,676 r449 (cycles on dTLB store miss page walks) 196,017,693 r408 (cycles on dTLB load miss page walks) 213,266,243 r485 (cycles on iTLB miss page walks) ------------------------- 433,714,612 total These 0.1 seconds probably are all on instructions that would have been fast, since the slow instructions responsible for the low IPC are the microcoded instructions including VMX and other privileged stuff. Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM and 260k in patched KVM. Let's see where they come from: Unpatched: + 98.97% qemu-kvm [kernel.kallsyms] [k] native_write_msr_safe + 0.70% qemu-kvm [kernel.kallsyms] [k] page_fault It's expected that most TLB misses happen just before a page fault (there are also events to count how many TLB misses do result in a page fault, if you care about that), and thus are accounted to the first instruction of the exception handler. We do not know what causes second-level TLB _flushes_ but it's quite expected that you'll have a TLB miss after them and possibly a page fault. And anyway 98.97% of them coming from native_write_msr_safe is totally anomalous. A patched benchmark shows no second-level TLB flush occurs after a WRMSR: + 72.41% qemu-kvm [kernel.kallsyms] [k] page_fault + 9.07% qemu-kvm [kvm_intel] [k] vmx_flush_tlb + 6.60% qemu-kvm [kernel.kallsyms] [k] set_pte_vaddr_pud + 5.68% qemu-kvm [kernel.kallsyms] [k] flush_tlb_mm_range + 4.87% qemu-kvm [kernel.kallsyms] [k] native_flush_tlb + 1.36% qemu-kvm [kernel.kallsyms] [k] flush_tlb_page So basically VMX EFER writes are optimized, while non-VMX EFER writes cause a TLB flush, at least on a Sandy Bridge. Ouch! I'll try to reproduce on the Core 2 Duo soon, and inquire Intel about it. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html