Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

David Hildenbrand <david@xxxxxxxxxx> · Tue, 8 Aug 2017 13:25:31 +0200

On 08.08.2017 06:05, Longpeng(Mike) wrote:
> This is a simple optimization for kvm_vcpu_on_spin, the
> main idea is described in patch-1's commit msg.
> 
> I did some tests base on the RFC version, the result shows
> that it can improves the performance slightly.
> 
> == Geekbench-3.4.1 ==
> VM1: 	8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
> 	running Geekbench-3.4.1 *10 truns*
> VM2/VM3/VM4: configure is the same as VM1
> 	stress each vcpu usage(seed by top in guest) to 40%
> 
> The comparison of each testcase's score:
> (higher is better)
> 		before		after		improve
> Inter
>  single		1176.7		1179.0		0.2%
>  multi		3459.5		3426.5		-0.9%
> Float
>  single		1150.5		1150.9		0.0%
>  multi		3364.5		3391.9		0.8%
> Memory(stream)
>  single		1768.7		1773.1		0.2%
>  multi		2511.6		2557.2		1.8%
> Overall
>  single		1284.2		1286.2		0.2%
>  multi		3231.4		3238.4		0.2%
> 
> 
> == kernbench-0.42 ==
> VM1:    8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>         running "kernbench -n 10"
> VM2/VM3/VM4: configure is the same as VM1
>         stress each vcpu usage(seed by top in guest) to 40%
> 
> The comparison of 'Elapsed Time':
> (sooner is better)
> 		before		after		improve
> load -j4	12.762		12.751		0.1%
> load -j32	9.743		8.955		8.1%
> load -j		9.688		9.229		4.7%
> 
> 
> Physical Machine:
>   Architecture:          x86_64
>   CPU op-mode(s):        32-bit, 64-bit
>   Byte Order:            Little Endian
>   CPU(s):                24
>   On-line CPU(s) list:   0-23
>   Thread(s) per core:    2
>   Core(s) per socket:    6
>   Socket(s):             2
>   NUMA node(s):          2
>   Vendor ID:             GenuineIntel
>   CPU family:            6
>   Model:                 45
>   Model name:            Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>   Stepping:              7
>   CPU MHz:               2799.902
>   BogoMIPS:              5004.67
>   Virtualization:        VT-x
>   L1d cache:             32K
>   L1i cache:             32K
>   L2 cache:              256K
>   L3 cache:              15360K
>   NUMA node0 CPU(s):     0-5,12-17
>   NUMA node1 CPU(s):     6-11,18-23
> 
> ---
> Changes since V1:
>  - split the implementation of s390 & arm. [David]
>  - refactor the impls according to the suggestion. [Paolo]
> 
> Changes since RFC:
>  - only cache result for X86. [David & Cornlia & Paolo]
>  - add performance numbers. [David]
>  - impls arm/s390. [Christoffer & David]
>  - refactor the impls. [me]
> 
> ---
> Longpeng(Mike) (4):
>   KVM: add spinlock optimization framework
>   KVM: X86: implement the logic for spinlock optimization
>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
> 
>  arch/arm/kvm/handle_exit.c      |  2 +-
>  arch/arm64/kvm/handle_exit.c    |  2 +-
>  arch/mips/kvm/mips.c            |  6 ++++++
>  arch/powerpc/kvm/powerpc.c      |  6 ++++++
>  arch/s390/kvm/diag.c            |  2 +-
>  arch/s390/kvm/kvm-s390.c        |  6 ++++++
>  arch/x86/include/asm/kvm_host.h |  5 +++++
>  arch/x86/kvm/hyperv.c           |  2 +-
>  arch/x86/kvm/svm.c              | 10 +++++++++-
>  arch/x86/kvm/vmx.c              | 16 +++++++++++++++-
>  arch/x86/kvm/x86.c              | 11 +++++++++++
>  include/linux/kvm_host.h        |  3 ++-
>  virt/kvm/arm/arm.c              |  5 +++++
>  virt/kvm/kvm_main.c             |  4 +++-
>  14 files changed, 72 insertions(+), 8 deletions(-)
> 

I am curious, is there any architecture that allows to trigger
kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?

I would have guessed that user space should never be allowed to make cpu
wide decisions (giving up the CPU to the hypervisor).

E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
only valid from kernel space.

I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
"me_in_kernel" basically always true?

-- 

Thanks,

David