I've recently been looking at our entry/exit costs, and profiling figures did show some very low hanging fruits. The most obvious cost is that accessing the GIC HW is slow. As in "deadly slow", specially when GICv2 is involved. So not hammering the HW when there is nothing to write (and even to read) is immediately beneficial, as this is the most common cases (whatever people seem to think, interrupts are a *rare* event). Similar work has also been done for GICv3, with a reduced impact (it was less "bad" to start with). Another easy thing to fix is the way we handle trapped system registers. We do insist on (mostly) sorting them, but we do perform a linear search on trap. We can switch to a binary search for free, and get immediate benefits (the PMU code, being extremely trap-happy, benefits immediately from this). With these in place, I see an improvement of 10 to 40% (depending on the platform) on our world-switch cycle count when running a set of hand-crafted guests that are designed to only perform traps. Please note that VM exits are actually a rare event on ARM. So don't expect your guest to be 40% faster, this will hardly make a noticable difference. Methodology: * NULL-hypercall guest: Perform 2^20 PSCI_0_2_FN_PSCI_VERSION calls, and then a power-off: __start: mov x19, #(1 << 16) 1: mov x0, #0x84000000 hvc #0 sub x19, x19, #1 cbnz x19, 1b mov x0, #0x84000000 add x0, x0, #9 hvc #0 b . * Self IPI guest: Inject and handle 2^20 SGI0 using GICv2 or GICv3, and then power-off: __start: mov x19, #(1 << 20) mrs x0, id_aa64pfr0_el1 ubfx x0, x0, #24, #4 and x0, x0, #0xf cbz x0, do_v2 mrs x0, s3_0_c12_c12_5 // ICC_SRE_EL1 and x0, x0, #1 // SRE bit cbnz x0, do_v3 do_v2: mov x0, #0x3fff0000 // Dist mov x1, #0x3ffd0000 // CPU mov w2, #1 str w2, [x0] // Enable Group0 ldr w2, =0xa0a0a0a0 str w2, [x0, 0x400] // A0 priority for SGI0-3 mov w2, #0x0f str w2, [x0, #0x100] // Enable SGI0-3 mov w2, #0xf0 str w2, [x1, #4] // PMR mov w2, #1 str w2, [x1] // Enable CPU interface 1: mov w2, #(2 << 24) // Interrupt self with SGI0 str w2, [x0, #0xf00] 2: ldr w2, [x1, #0x0c] // GICC_IAR cmp w2, #0x3ff b.ne 3f wfi b 2b 3: str w2, [x1, #0x10] // EOI sub x19, x19, #1 cbnz x19, 1b die: mov x0, #0x84000000 add x0, x0, #9 hvc #0 b . do_v3: mov x0, #0x3fff0000 // Dist mov x1, #0x3fbf0000 // Redist 0 mov x2, #0x10000 add x1, x1, x2 // SGI page mov w2, #2 str w2, [x0] // Enable Group1 ldr w2, =0xa0a0a0a0 str w2, [x1, 0x400] // A0 priority for SGI0-3 mov w2, #0x0f str w2, [x1, #0x100] // Enable SGI0-3 mov w2, #0xf0 msr S3_0_c4_c6_0, x2 // PMR mov w2, #1 msr S3_0_C12_C12_7, x2 // Enable Group1 1: mov x2, #1 msr S3_0_c12_c11_5, x2 // Self SGI0 2: mrs x2, S3_0_c12_c12_0 // Read IAR1 cmp w2, #0x3ff b.ne 3f wfi b 2b 3: msr S3_0_c12_c12_1, x2 // EOI sub x19, x19, #1 cbnz x19, 1b b die * sysreg trap guest: Perform 2^20 PMSELR_EL0 accesses, and power-off: __start: mov x19, #(1 << 20) 1: mrs x0, PMSELR_EL0 sub x19, x19, #1 cbnz x19, 1b mov x0, #0x84000000 add x0, x0, #9 hvc #0 b . * These guests are profiled using perf and kvmtool: taskset -c 1 perf stat -e cycles:kh lkvm run -c1 --kernel do_sysreg.bin 2>&1 >/dev/null| grep cycles The result is then divided by the number of iterations (2^20). These tests have been run on three different platform (two GICv2 based, and one with GICv3 and legacy mode) and shown significant improvements in all cases. I've only touched the arm64 GIC code, but obviously the 32bit code should use it as well once we've migrated it to C. Vanilla v4.5-rc4 A B C-v2 C-v3 Null HVC: 8462 6566 6572 6505 Self SGI: 11961 8690 9541 8629 SysReg: 8952 6979 7212 7180 Patched v4.5-rc4 A B C-v2 C-v3 Null HVC: 5219 -38% 3957 -39% 5175 -21% 5158 -20% Self SGI: 8946 -25% 6658 -23% 8547 -10% 7299 -15% SysReg: 5314 -40% 4190 -40% 5417 -25% 5414 -24% I've pushed out a branch (kvm-arm64/suck-less) to the usual location, based on -rc4 + a few fixes I also posted today. Thanks, M. * From v1: - Fixed a nasty bug dealing with the active Priority Register - Maintenance interrupt lazy saving - More LR hackery - Adapted most of the series for GICv3 as well Marc Zyngier (17): arm64: KVM: Switch the sys_reg search to be a binary search ARM: KVM: Properly sort the invariant table ARM: KVM: Enforce sorting of all CP tables ARM: KVM: Rename struct coproc_reg::is_64 to is_64bit ARM: KVM: Switch the CP reg search to be a binary search KVM: arm/arm64: timer: Add active state caching arm64: KVM: vgic-v2: Avoid accessing GICH registers arm64: KVM: vgic-v2: Save maintenance interrupt state only if required arm64: KVM: vgic-v2: Move GICH_ELRSR saving to its own function arm64: KVM: vgic-v2: Do not save an LR known to be empty arm64: KVM: vgic-v2: Only wipe LRs on vcpu exit arm64: KVM: vgic-v2: Make GICD_SGIR quicker to hit arm64: KVM: vgic-v3: Avoid accessing ICH registers arm64: KVM: vgic-v3: Save maintenance interrupt state only if required arm64: KVM: vgic-v3: Do not save an LR known to be empty arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit arm64: KVM: vgic-v3: Do not save ICH_AP0Rn_EL2 for GICv2 emulation arch/arm/kvm/arm.c | 1 + arch/arm/kvm/coproc.c | 74 +++++---- arch/arm/kvm/coproc.h | 8 +- arch/arm64/kvm/hyp/vgic-v2-sr.c | 144 +++++++++++++---- arch/arm64/kvm/hyp/vgic-v3-sr.c | 333 ++++++++++++++++++++++++++-------------- arch/arm64/kvm/sys_regs.c | 40 ++--- include/kvm/arm_arch_timer.h | 5 + include/kvm/arm_vgic.h | 8 +- virt/kvm/arm/arch_timer.c | 31 ++++ virt/kvm/arm/vgic-v2-emul.c | 10 +- virt/kvm/arm/vgic-v3.c | 4 +- 11 files changed, 452 insertions(+), 206 deletions(-) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html